The icu_tokenizer
uses the same Unicode Text Segmentation algorithm as the
standard
tokenizer, but adds better support for some Asian languages by
using a dictionary-based approach to identify words in Thai, Lao, Chinese,
Japanese, and Korean, and using custom rules to break Myanmar and Khmer text
into syllables.
For instance, compare the tokens produced by the standard
and
icu_tokenizers
respectively when tokenizing ``Hello. I am from Bangkok'' in
Thai:
GET /_analyze?tokenizer=standard
สวัสดี ผมมาจากกรุงเทพฯ
The standard
tokenizer produces two tokens, one for each sentence: สวัสดี
,
ผมมาจากกรุงเทพฯ
. That is only useful if you want to search for the whole
sentence I am from Bangkok.'', but not if you want to search for just
Bangkok''.
GET /_analyze?tokenizer=icu_tokenizer
สวัสดี ผมมาจากกรุงเทพฯ
The icu_tokenizer
, on the other hand, is able to break up the text into the
individual words: สวัสดี
, ผม
, มา
, จาก
, กรุงเทพฯ
, making them
easier to search.
In contrast, the standard
tokenizer ``over-tokenizes'' Chinese and Japanese
text, often breaking up whole words into single characters. Because there
are no spaces between words, it can be difficult to tell if consecutive
characters are separate words or form a single word. For instance:
-
向 means
facing'', 日 means
sun'' and 葵 meanshollyhock''. When written together as 向日葵 it means
sunflower''. -
五 means
five'' or
fifth'', 月 meansmonth'' and 雨 means
rain''. The first two characters written together as 五月 meansthe month of May'' and, together with the third character, 五月雨 means
continuous rain''. When combined with a fourth character 式, meaning ``style'', the word 五月雨式 becomes an adjective for anything consecutive or unrelenting.
Although each character may be a word in its own right, tokens are more meaningful when they retain the bigger original concept instead of just the component parts.
GET /_analyze?tokenizer=standard
向日葵
GET /_analyze?tokenizer=icu_tokenizer
向日葵
The standard
tokenizer in the above example would emit each character
as a separate token: 向
, 日
, 葵
, while the icu_tokenizer
would
emit the single token 向日葵
(sunflower).
Another difference between the standard
tokenizer and the icu_tokenizer
is
that the latter will break a word containing characters written in different
scripts (e.g. βeta
) into separate tokens — β
, eta
— while the
former will emit the word as a single token: βeta
.