SearchTextProcessor
class SearchTextProcessor implements SearchTextProcessorInterface (View source)
Processes search text for indexing.
Properties
| protected TransliterationInterface | $transliteration | The transliteration service. |
|
| protected ConfigFactoryInterface | $configFactory | The config factory. |
|
| protected ModuleHandlerInterface | $moduleHandler | The module handler. |
Methods
SearchTextProcessor constructor.
Processes text into words for indexing.
Runs the text through character analyzers in preparation for indexing.
Invokes hook_search_preprocess() to simplify text.
Splits CJK (Chinese, Japanese, Korean) text into tokens.
Helper function for array_walk in ::analyze().
Details
__construct(TransliterationInterface $transliteration, ConfigFactoryInterface $config_factory, ModuleHandlerInterface $module_handler)
SearchTextProcessor constructor.
array
process(string $text, string|null $langcode = NULL)
Processes text into words for indexing.
string
analyze(string $text, string|null $langcode = NULL)
Runs the text through character analyzers in preparation for indexing.
Processing steps:
- Entities are decoded.
- Text is lower-cased and diacritics (accents) are removed.
- hook_search_preprocess() is invoked.
- CJK (Chinese, Japanese, Korean) characters are processed, depending on the search settings.
- Punctuation is processed (removed or replaced with spaces, depending on where it is; see code for details).
- Words are truncated to 50 characters maximum.
protected void
invokePreprocess(string $text, string|null $langcode = NULL)
Invokes hook_search_preprocess() to simplify text.
protected string
expandCjk(array $matches)
Splits CJK (Chinese, Japanese, Korean) text into tokens.
The Search module matches exact words, where a word is defined to be a sequence of characters delimited by spaces or punctuation. CJK languages are written in long strings of characters, though, not split up into words. So in order to allow search matching, we split up CJK text into tokens consisting of consecutive, overlapping sequences of characters whose length is equal to the 'minimum_word_size' variable. This tokenizing is only done if the 'overlap_cjk' variable is TRUE.
protected void
truncate(string $text)
Helper function for array_walk in ::analyze().