`icu_tokenizer`

The icu_tokenizer uses the same Unicode Text Segmentation algorithm as the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

For instance, compare the tokens produced by the standard and icu_tokenizers respectively when tokenizing ``Hello. I am from Bangkok'' in Thai:

GET /_analyze?tokenizer=standard
สวัสดี ผมมาจากกรุงเทพฯ

The standard tokenizer produces two tokens, one for each sentence: สวัสดี, ผมมาจากกรุงเทพฯ. That is only useful if you want to search for the whole sentence I am from Bangkok.'', but not if you want to search for justBangkok''.

GET /_analyze?tokenizer=icu_tokenizer
สวัสดี ผมมาจากกรุงเทพฯ

The icu_tokenizer, on the other hand, is able to break up the text into the individual words: สวัสดี, ผม, มา, จาก, กรุงเทพฯ, making them easier to search.

In contrast, the standard tokenizer ``over-tokenizes'' Chinese and Japanese text, often breaking up whole words into single characters. Because there are no spaces between words, it can be difficult to tell if consecutive characters are separate words or form a single word. For instance:

向 means facing'', 日 means sun'' and 葵 means hollyhock''. When written together as 向日葵 it meanssunflower''.
五 means five'' or fifth'', 月 means month'' and 雨 means rain''. The first two characters written together as 五月 means the month of May'' and, together with the third character, 五月雨 meanscontinuous rain''. When combined with a fourth character 式, meaning ``style'', the word 五月雨式 becomes an adjective for anything consecutive or unrelenting.

Although each character may be a word in its own right, tokens are more meaningful when they retain the bigger original concept instead of just the component parts.

GET /_analyze?tokenizer=standard
向日葵

GET /_analyze?tokenizer=icu_tokenizer
向日葵

The standard tokenizer in the above example would emit each character as a separate token: 向, 日, 葵, while the icu_tokenizer would emit the single token 向日葵 (sunflower).

Another difference between the standard tokenizer and the icu_tokenizer is that the latter will break a word containing characters written in different scripts (e.g. βeta) into separate tokens — β, eta — while the former will emit the word as a single token: βeta.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

40_ICU_tokenizer.asciidoc

40_ICU_tokenizer.asciidoc

`icu_tokenizer`

Files

40_ICU_tokenizer.asciidoc

Latest commit

History

40_ICU_tokenizer.asciidoc

File metadata and controls

icu_tokenizer

`icu_tokenizer`