Unicode
class Unicode (View source)
Provides Unicode-related conversions and operations.
Constants
| PREG_CLASS_WORD_BOUNDARY |
Matches Unicode characters that are word boundaries. Characters with the following General_category (gc) property values are used as word boundaries. While this does not fully conform to the Word Boundaries algorithm described in http://unicode.org/reports/tr29, as PCRE does not contain the Word_Break property table, this simpler algorithm has to do.
Non-boundary characters include the following General_category (gc) property values:
Note that the PCRE property matcher is not used because we wanted to be compatible with Unicode 5.2.0 regardless of the PCRE version used (and any bugs in PCRE property tables). |
| STATUS_SINGLEBYTE |
Indicates that standard PHP (emulated) unicode support is being used. |
| STATUS_MULTIBYTE |
Indicates that full unicode support with the PHP mbstring extension is
being used. |
| STATUS_ERROR |
Indicates an error during check for PHP unicode support. |
Methods
Gets the current status of unicode/multibyte support on this environment.
Sets the value for multibyte support status for the current environment.
Checks for Unicode support in PHP and sets the proper settings if possible.
Decodes UTF byte-order mark (BOM) into the encoding's name.
Converts data to UTF-8.
Truncates a UTF-8-encoded string safely to a number of bytes.
Capitalizes the first character of a UTF-8 string.
Converts the first character of a UTF-8 string to lowercase.
Capitalizes the first character of each word in a UTF-8 string.
Cuts off a piece of a string based on character indices and counts.
Truncates a UTF-8-encoded string safely to a number of characters.
Compares UTF-8-encoded strings in a binary safe case-insensitive manner.
Encodes MIME/HTTP headers that contain incorrectly encoded characters.
Decodes MIME/HTTP encoded header values.
Flip U+C0-U+DE to U+E0-U+FD and back. Can be used as preg_replace callback.
Checks whether a string is valid UTF-8.
Finds the position of the first occurrence of a string in another string.
Details
static int
getStatus()
Gets the current status of unicode/multibyte support on this environment.
static
setStatus(int $status)
deprecated
deprecated
Sets the value for multibyte support status for the current environment.
The following status keys are supported:
- \Drupal\Component\Utility\Unicode::STATUS_MULTIBYTE Full unicode support using an extension.
- \Drupal\Component\Utility\Unicode::STATUS_SINGLEBYTE Standard PHP (emulated) unicode support.
- \Drupal\Component\Utility\Unicode::STATUS_ERROR An error occurred. No unicode support.
static string
check()
Checks for Unicode support in PHP and sets the proper settings if possible.
Because of the need to be able to handle text in various encodings, we do not support mbstring function overloading. HTTP input/output conversion must be disabled for similar reasons.
static string|bool
encodingFromBOM(string $data)
Decodes UTF byte-order mark (BOM) into the encoding's name.
static string|bool
convertToUtf8(string $data, string $encoding)
Converts data to UTF-8.
Requires the iconv, GNU recode or mbstring PHP extension.
static string
truncateBytes(string $string, int $len)
Truncates a UTF-8-encoded string safely to a number of bytes.
If the end position is in the middle of a UTF-8 sequence, it scans backwards until the beginning of the byte sequence.
Use this function whenever you want to chop off a string at an unsure location. On the other hand, if you're sure that you're splitting on a character boundary (e.g. after using strpos() or similar), you can safely use substr() instead.
static int
strlen(string $text)
deprecated
deprecated
Counts the number of characters in a UTF-8 string.
This is less than or equal to the byte count.
static string
strtoupper(string $text)
deprecated
deprecated
Converts a UTF-8 string to uppercase.
static string
strtolower(string $text)
deprecated
deprecated
Converts a UTF-8 string to lowercase.
static string
ucfirst(string $text)
Capitalizes the first character of a UTF-8 string.
static string
lcfirst(string $text)
Converts the first character of a UTF-8 string to lowercase.
static string
ucwords(string $text)
Capitalizes the first character of each word in a UTF-8 string.
static string
substr(string $text, int $start, int $length = NULL)
deprecated
deprecated
Cuts off a piece of a string based on character indices and counts.
Follows the same behavior as PHP's own substr() function. Note that for cutting off a string at a known character/substring location, the usage of PHP's normal strpos/substr is safe and much faster.
static string
truncate(string $string, int $max_length, bool $wordsafe = FALSE, bool $add_ellipsis = FALSE, int $min_wordsafe_length = 1)
Truncates a UTF-8-encoded string safely to a number of characters.
static int
strcasecmp(string $str1, string $str2)
Compares UTF-8-encoded strings in a binary safe case-insensitive manner.
static string
mimeHeaderEncode(string $string, bool $shorten = FALSE)
Encodes MIME/HTTP headers that contain incorrectly encoded characters.
For example, Unicode::mimeHeaderEncode('tést.txt') returns "=?UTF-8?B?dMOpc3QudHh0?=".
See http://www.rfc-editor.org/rfc/rfc2047.txt for more information.
Notes:
- Only encode strings that contain non-ASCII characters.
- We progressively cut-off a chunk with self::truncateBytes(). This ensures each chunk starts and ends on a character boundary.
- Using \n as the chunk separator may cause problems on some systems and may have to be changed to \r\n or \r.
static string
mimeHeaderDecode(string $header)
Decodes MIME/HTTP encoded header values.
static string
caseFlip(array $matches)
deprecated
deprecated
Flip U+C0-U+DE to U+E0-U+FD and back. Can be used as preg_replace callback.
static bool
validateUtf8(string $text)
Checks whether a string is valid UTF-8.
All functions designed to filter input should use drupal_validate_utf8 to ensure they operate on valid UTF-8 strings to prevent bypass of the filter.
When text containing an invalid UTF-8 lead byte (0xC0 - 0xFF) is presented as UTF-8 to Internet Explorer 6, the program may misinterpret subsequent bytes. When these subsequent bytes are HTML control characters such as quotes or angle brackets, parts of the text that were deemed safe by filters end up in locations that are potentially unsafe; An onerror attribute that is outside of a tag, and thus deemed safe by a filter, can be interpreted by the browser as if it were inside the tag.
The function does not return FALSE for strings containing character codes above U+10FFFF, even though these are prohibited by RFC 3629.
static int|false
strpos(string $haystack, string $needle, int $offset = 0)
deprecated
deprecated
Finds the position of the first occurrence of a string in another string.