Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.
Take, for example, the lower case German letter ß
. Converting that to upper
case gives you SS
which, converted back to lower case gives you ss
. Or the
Greek letter ς
(sigma, when used at the end of a word). Converting it to
upper case results in Σ
which, converted back to lower case, gives you σ
.
The whole point of lower casing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lower casing. Case folding is the act of converting words into a (usually lower case) form which does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
For instance, the letter ß
, which is already lower case, is folded to
ss
. Similarly, the lower case ς
is folded to σ
, to make σ
, ς
and Σ
comparable, no matter where the letter appears in a word.
The default normalization form which the icu_normalizer
token filter uses
is nfkc_cf
which, like the nfkc
form:
-
composes characters into the shortest byte representation,
-
uses compatibility mode to convert characters like
ffi
into the simplerffi
But also:
-
case-folds characters into a form suitable for case comparison
In other words, nfkc_cf
is the equivalent of the lowercase
token filter,
but suitable for use with all languages. The `on-steriods'' equivalent of the
`standard
analyzer would be the following:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_lowercaser": {
"tokenizer": "icu_tokenizer",
"filter": [ "icu_normalizer" ] (1)
}
}
}
}
}
-
The
icu_normalizer
defaults to thenfkc_cf
form.
We can compare the results of running Weißkopfseeadler'' and
WEISSKOPFSEEADLER'' (the upper case equivalent) through the standard
analyzer and through our Unicode-aware analyzer:
GET /_analyze?analyzer=standard (1)
Weißkopfseeadler WEISSKOPFSEEADLER
GET /my_index/_analyze?analyzer=my_lowercaser (2)
Weißkopfseeadler WEISSKOPFSEEADLER
-
Emits tokens:
weißkopfseeadler
,weisskopfseeadler
-
Emits tokens:
weisskopfseeadler
,weisskopfseeadler
The standard
analyzer emits two different, incomparable tokens, while our
custom analyzer produces tokens that are comparable, regardless of the
original case.