PhpTransliteration
class PhpTransliteration implements TransliterationInterface (View source)
Implements transliteration without using the PECL extensions.
Transliterations are done character-by-character, by looking up non-US-ASCII characters in a transliteration database.
The database comes from two types of files, both of which are searched for in the PhpTransliteration::$dataDirectory directory. First, language-specific overrides are searched (see PhpTransliteration::readLanguageOverrides()). If there is no language-specific override for a character, the generic transliteration character tables are searched (see PhpTransliteration::readGenericData()). If looking up the character in the generic table results in a NULL value, or an illegal character is encountered, then a substitute character is returned.
Some parts of this code were derived from the MediaWiki project's UtfNormal class, Copyright © 2004 Brion Vibber brion@pobox.com, http://www.mediawiki.org/
Properties
| protected string | $dataDirectory | Directory where data for transliteration resides. |
|
| protected array | $languageOverrides | Associative array of language-specific character transliteration tables. |
|
| protected array | $genericMap | Non-language-specific transliteration tables. |
|
| protected | $fixTransliterateForRemoveDiacritics | Special characters for ::removeDiacritics(). |
Methods
Constructs a transliteration object.
Removes diacritics (accents) from certain letters.
Transliterates text from Unicode to US-ASCII.
Finds the character code for a UTF-8 character: like ord() but for UTF-8.
Replaces a single Unicode character using the transliteration database.
Look up the generic replacement for a UTF-8 character code.
Reads in language overrides for a language code.
Reads in generic transliteration data for a bank of characters.
Details
__construct(string $data_directory = NULL)
Constructs a transliteration object.
string
removeDiacritics(string $string)
Removes diacritics (accents) from certain letters.
This only applies to certain letters: Accented Latin characters like a-with-acute-accent, in the UTF-8 character range of 0xE0 to 0xE6 and 01CD to 024F. Replacements that would result in the string changing length are excluded, as well as characters that are not accented US-ASCII letters.
string
transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)
Transliterates text from Unicode to US-ASCII.
static protected int
ordUTF8(string $character)
Finds the character code for a UTF-8 character: like ord() but for UTF-8.
protected string
replace(int $code, string $langcode, string $unknown_character)
Replaces a single Unicode character using the transliteration database.
protected string
lookupReplacement($code, string $unknown_character = '?')
Look up the generic replacement for a UTF-8 character code.
protected
readLanguageOverrides($langcode)
Reads in language overrides for a language code.
The data is read from files named "$langcode.php" in PhpTransliteration::$dataDirectory. These files should set up an array variable $overrides with an element whose key is $langcode and whose value is an array whose keys are character codes, and whose values are their transliterations in this language. The character codes can be for any valid Unicode character, independent of the number of bytes.
protected
readGenericData($bank)
Reads in generic transliteration data for a bank of characters.
The data is read in from a file named "x$bank.php" (with $bank in hexadecimal notation) in PhpTransliteration::$dataDirectory. These files should set up a variable $bank containing an array whose numerical indices are the remaining two bytes of the character code, and whose values are the transliterations of these characters into US-ASCII. Note that the maximum Unicode character that can be encoded in this way is 4 bytes.