class PhpTransliteration implements TransliterationInterface (View source)

Implements transliteration without using the PECL extensions.

Transliterations are done character-by-character, by looking up non-US-ASCII characters in a transliteration database.

The database comes from two types of files, both of which are searched for in the PhpTransliteration::$dataDirectory directory. First, language-specific overrides are searched (see PhpTransliteration::readLanguageOverrides()). If there is no language-specific override for a character, the generic transliteration character tables are searched (see PhpTransliteration::readGenericData()). If looking up the character in the generic table results in a NULL value, or an illegal character is encountered, then a substitute character is returned.

Some parts of this code were derived from the MediaWiki project's UtfNormal class, Copyright © 2004 Brion Vibber brion@pobox.com, http://www.mediawiki.org/

Properties

protected string $dataDirectory

Directory where data for transliteration resides.

protected array $languageOverrides

Associative array of language-specific character transliteration tables.

protected array $genericMap

Non-language-specific transliteration tables.

protected $fixTransliterateForRemoveDiacritics

Special characters for ::removeDiacritics().

Methods

__construct(string $data_directory = NULL)

Constructs a transliteration object.

string
removeDiacritics(string $string)

Removes diacritics (accents) from certain letters.

string
transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)

Transliterates text from Unicode to US-ASCII.

static int
ordUTF8(string $character)

Finds the character code for a UTF-8 character: like ord() but for UTF-8.

string
replace(int $code, string $langcode, string $unknown_character)

Replaces a single Unicode character using the transliteration database.

string
lookupReplacement($code, string $unknown_character = '?')

Look up the generic replacement for a UTF-8 character code.

readLanguageOverrides($langcode)

Reads in language overrides for a language code.

readGenericData($bank)

Reads in generic transliteration data for a bank of characters.

Details

__construct(string $data_directory = NULL)

Constructs a transliteration object.

Parameters

string $data_directory

(optional) The directory where data files reside. If omitted, defaults to subdirectory 'data' underneath the directory where the class's PHP file resides.

string removeDiacritics(string $string)

Removes diacritics (accents) from certain letters.

This only applies to certain letters: Accented Latin characters like a-with-acute-accent, in the UTF-8 character range of 0xE0 to 0xE6 and 01CD to 024F. Replacements that would result in the string changing length are excluded, as well as characters that are not accented US-ASCII letters.

Parameters

string $string

The string holding diacritics.

Return Value

string

$string with accented letters replaced by their unaccented equivalents.

string transliterate(string $string, string $langcode = 'en', string $unknown_character = '?', int $max_length = NULL)

Transliterates text from Unicode to US-ASCII.

Parameters

string $string

The string to transliterate.

string $langcode

(optional) The language code of the language the string is in. Defaults to 'en' if not provided. Warning: this can be unfiltered user input.

string $unknown_character

(optional) The character to substitute for characters in $string without transliterated equivalents. Defaults to '?'.

int $max_length

(optional) If provided, return at most this many characters, ensuring that the transliteration does not split in the middle of an input character's transliteration.

Return Value

string

$string with non-US-ASCII characters transliterated to US-ASCII characters, and unknown characters replaced with $unknown_character.

static protected int ordUTF8(string $character)

Finds the character code for a UTF-8 character: like ord() but for UTF-8.

Parameters

string $character

A single UTF-8 character.

Return Value

int

The character code, or -1 if an illegal character is found.

protected string replace(int $code, string $langcode, string $unknown_character)

Replaces a single Unicode character using the transliteration database.

Parameters

int $code

The character code of a Unicode character.

string $langcode

The language code of the language the character is in.

string $unknown_character

The character to substitute for characters without transliterated equivalents.

Return Value

string

US-ASCII replacement character. If it has a mapping, it is returned; otherwise, $unknown_character is returned. The replacement can contain multiple characters.

protected string lookupReplacement($code, string $unknown_character = '?')

Look up the generic replacement for a UTF-8 character code.

Parameters

$code

The UTF-8 character code.

string $unknown_character

(optional) The character to substitute for characters without entries in the replacement tables.

Return Value

string

US-ASCII replacement characters. If it has a mapping, it is returned; otherwise, $unknown_character is returned. The replacement can contain multiple characters.

protected readLanguageOverrides($langcode)

Reads in language overrides for a language code.

The data is read from files named "$langcode.php" in PhpTransliteration::$dataDirectory. These files should set up an array variable $overrides with an element whose key is $langcode and whose value is an array whose keys are character codes, and whose values are their transliterations in this language. The character codes can be for any valid Unicode character, independent of the number of bytes.

Parameters

$langcode

Code for the language to read.

protected readGenericData($bank)

Reads in generic transliteration data for a bank of characters.

The data is read in from a file named "x$bank.php" (with $bank in hexadecimal notation) in PhpTransliteration::$dataDirectory. These files should set up a variable $bank containing an array whose numerical indices are the remaining two bytes of the character code, and whose values are the transliterations of these characters into US-ASCII. Note that the maximum Unicode character that can be encoded in this way is 4 bytes.

Parameters

$bank

First two bytes of the Unicode character, or 0 for the ASCII range.