Drupal\Component\Transliteration\PhpTransliteration

class PhpTransliteration implements TransliterationInterface (View source)

Implements transliteration without using the PECL extensions.

Transliterations are done character-by-character, by looking up non-US-ASCII characters in a transliteration database.

The database comes from two types of files, both of which are searched for in the PhpTransliteration::$dataDirectory directory. First, language-specific overrides are searched (see PhpTransliteration::readLanguageOverrides()). If there is no language-specific override for a character, the generic transliteration character tables are searched (see PhpTransliteration::readGenericData()). If looking up the character in the generic table results in a NULL value, or an illegal character is encountered, then a substitute character is returned.

Properties

protected string	$dataDirectory	Directory where data for transliteration resides.
protected array	$languageOverrides	Associative array of language-specific character transliteration tables.
protected array	$genericMap	Non-language-specific transliteration tables.
protected	$fixTransliterateForRemoveDiacritics	Special characters for ::removeDiacritics().

Methods

__construct(string $data_directory = NULL)

Constructs a transliteration object.

string

removeDiacritics(string $string)

Removes diacritics (accents) from certain letters.

string

transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)

Transliterates text from Unicode to US-ASCII.

static int

ordUTF8(string $character)

Finds the character code for a UTF-8 character: like ord() but for UTF-8.

string

replace(int $code, string $langcode, string $unknown_character)

Replaces a single Unicode character using the transliteration database.

string

lookupReplacement($code, string $unknown_character = '?')

Look up the generic replacement for a UTF-8 character code.

readLanguageOverrides($langcode)

Reads in language overrides for a language code.

readGenericData($bank)

Reads in generic transliteration data for a bank of characters.

Details

at line 84
`__construct(string $data_directory = NULL)`

Constructs a transliteration object.

Parameters

string

$data_directory

(optional) The directory where data files reside. If omitted, defaults to subdirectory 'data' underneath the directory where the class's PHP file resides.

at line 91
`string removeDiacritics(string $string)`

Removes diacritics (accents) from certain letters.

This only applies to certain letters: Accented Latin characters like a-with-acute-accent, in the UTF-8 character range of 0xE0 to 0xE6 and 01CD to 024F. Replacements that would result in the string changing length are excluded, as well as characters that are not accented US-ASCII letters.

Parameters

string

$string

The string holding diacritics.

Return Value

string

$string with accented letters replaced by their unaccented equivalents.

at line 125
`string transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)`

Transliterates text from Unicode to US-ASCII.

Parameters

string	$string	The string to transliterate.
string	$langcode	(optional) The language code of the language the string is in. Defaults to 'en' if not provided. Warning: this can be unfiltered user input.
string	$unknown_character	(optional) The character to substitute for characters in $string without transliterated equivalents. Defaults to '?'.
int	$max_length	(optional) If provided, return at most this many characters, ensuring that the transliteration does not split in the middle of an input character's transliteration.

Return Value

string

$string with non-US-ASCII characters transliterated to US-ASCII characters, and unknown characters replaced with $unknown_character.

at line 185
`static protected int ordUTF8(string $character)`

Finds the character code for a UTF-8 character: like ord() but for UTF-8.

Parameters

string

$character

A single UTF-8 character.

Return Value

int	The character code, or -1 if an illegal character is found.

at line 225
`protected string replace(int $code, string $langcode, string $unknown_character)`

Replaces a single Unicode character using the transliteration database.

Parameters

int	$code	The character code of a Unicode character.
string	$langcode	The language code of the language the character is in.
string	$unknown_character	The character to substitute for characters without transliterated equivalents.

Return Value

string

US-ASCII replacement character. If it has a mapping, it is returned; otherwise, $unknown_character is returned. The replacement can contain multiple characters.

at line 256
`protected string lookupReplacement($code, string $unknown_character = '?')`

Look up the generic replacement for a UTF-8 character code.

Parameters

	$code	The UTF-8 character code.
string	$unknown_character	(optional) The character to substitute for characters without entries in the replacement tables.

Return Value

string

US-ASCII replacement characters. If it has a mapping, it is returned; otherwise, $unknown_character is returned. The replacement can contain multiple characters.

at line 279
`protected readLanguageOverrides($langcode)`

Reads in language overrides for a language code.

The data is read from files named "$langcode.php" in PhpTransliteration::$dataDirectory. These files should set up an array variable $overrides with an element whose key is $langcode and whose value is an array whose keys are character codes, and whose values are their transliterations in this language. The character codes can be for any valid Unicode character, independent of the number of bytes.

Parameters

$langcode

Code for the language to read.

at line 308
`protected readGenericData($bank)`

Reads in generic transliteration data for a bank of characters.

The data is read in from a file named "x$bank.php" (with $bank in hexadecimal notation) in PhpTransliteration::$dataDirectory. These files should set up a variable $bank containing an array whose numerical indices are the remaining two bytes of the character code, and whose values are the transliterations of these characters into US-ASCII. Note that the maximum Unicode character that can be encoded in this way is 4 bytes.

Parameters

$bank

First two bytes of the Unicode character, or 0 for the ASCII range.

PhpTransliteration

Properties

Methods

Details

at line 84 __construct(string $data_directory = NULL)

Parameters

at line 91 string removeDiacritics(string $string)

Parameters

Return Value

at line 125 string transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)

Parameters

Return Value

at line 185 static protected int ordUTF8(string $character)

Parameters

Return Value

at line 225 protected string replace(int $code, string $langcode, string $unknown_character)

Parameters

Return Value

at line 256 protected string lookupReplacement($code, string $unknown_character = '?')

Parameters

Return Value

at line 279 protected readLanguageOverrides($langcode)

Parameters

at line 308 protected readGenericData($bank)

Parameters

at line 84
`__construct(string $data_directory = NULL)`

at line 91
`string removeDiacritics(string $string)`

at line 125
`string transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)`

at line 185
`static protected int ordUTF8(string $character)`

at line 225
`protected string replace(int $code, string $langcode, string $unknown_character)`

at line 256
`protected string lookupReplacement($code, string $unknown_character = '?')`

at line 279
`protected readLanguageOverrides($langcode)`

at line 308
`protected readGenericData($bank)`