class SearchTextProcessor implements SearchTextProcessorInterface (View source)

Processes search text for indexing.

Properties

protected TransliterationInterface $transliteration

The transliteration service.

protected ConfigFactoryInterface $configFactory

The config factory.

protected ModuleHandlerInterface $moduleHandler

The module handler.

Methods

__construct(TransliterationInterface $transliteration, ConfigFactoryInterface $config_factory, ModuleHandlerInterface $module_handler)

SearchTextProcessor constructor.

array
process(string $text, string|null $langcode = NULL)

Processes text into words for indexing.

string
analyze(string $text, string|null $langcode = NULL)

Runs the text through character analyzers in preparation for indexing.

void
invokePreprocess(string $text, string|null $langcode = NULL)

Invokes hook_search_preprocess() to simplify text.

string
expandCjk(array $matches)

Splits CJK (Chinese, Japanese, Korean) text into tokens.

void
truncate(string $text)

Helper function for array_walk in ::analyze().

Details

__construct(TransliterationInterface $transliteration, ConfigFactoryInterface $config_factory, ModuleHandlerInterface $module_handler)

SearchTextProcessor constructor.

Parameters

TransliterationInterface $transliteration

The transliteration service.

ConfigFactoryInterface $config_factory

The config factory.

ModuleHandlerInterface $module_handler

The module handler.

array process(string $text, string|null $langcode = NULL)

Processes text into words for indexing.

Parameters

string $text

Text to process.

string|null $langcode

Language code for the language of $text, if known.

Return Value

array

Array of words in the simplified, preprocessed text.

string analyze(string $text, string|null $langcode = NULL)

Runs the text through character analyzers in preparation for indexing.

Processing steps:

  • Entities are decoded.
  • Text is lower-cased and diacritics (accents) are removed.
  • hook_search_preprocess() is invoked.
  • CJK (Chinese, Japanese, Korean) characters are processed, depending on the search settings.
  • Punctuation is processed (removed or replaced with spaces, depending on where it is; see code for details).
  • Words are truncated to 50 characters maximum.

Parameters

string $text

Text to simplify.

string|null $langcode

(optional) Language code for the language of $text, if known.

Return Value

string

Simplified and processed text.

protected void invokePreprocess(string $text, string|null $langcode = NULL)

Invokes hook_search_preprocess() to simplify text.

Parameters

string $text

Text to preprocess, passed by reference and altered in place.

string|null $langcode

Language code for the language of $text, if known.

Return Value

void

protected string expandCjk(array $matches)

Splits CJK (Chinese, Japanese, Korean) text into tokens.

The Search module matches exact words, where a word is defined to be a sequence of characters delimited by spaces or punctuation. CJK languages are written in long strings of characters, though, not split up into words. So in order to allow search matching, we split up CJK text into tokens consisting of consecutive, overlapping sequences of characters whose length is equal to the 'minimum_word_size' variable. This tokenizing is only done if the 'overlap_cjk' variable is TRUE.

Parameters

array $matches

This function is a callback for preg_replace_callback(), which is called from self::analyze(). So, $matches is an array of regular expression matches, which means that $matches[0] contains the matched text -- a string of CJK characters to tokenize.

Return Value

string

Tokenized text, starting and ending with a space character.

protected void truncate(string $text)

Helper function for array_walk in ::analyze().

Parameters

string $text

The text to be truncated.

Return Value

void