Skip to content

Commit

Permalink
update 20_Custom_Analyzers.md
Browse files Browse the repository at this point in the history
  • Loading branch information
sailxjx committed Apr 2, 2015
1 parent 110bcda commit 879c0a9
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 119 deletions.
2 changes: 1 addition & 1 deletion 070_Index_Mgmt/05_Create_Delete.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ action.auto_create_index: false
> **NOTE**
> 今后,我们将介绍怎样用<<索引模板>>来自动预先配置索引。这在索引日志数据时尤其有效:
> 今后,我们将介绍怎样用索引模板来自动预先配置索引。这在索引日志数据时尤其有效:
> 你将日志数据索引在一个以日期结尾的索引上,第二天,一个新的配置好的索引会自动创建好。
### 删除索引
Expand Down
8 changes: 4 additions & 4 deletions 070_Index_Mgmt/15_Configure_Analyzer.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

第三个重要的索引设置是 `analysis` 部分,用来配置已存在的分析器或创建自定义分析器来定制化你的索引。

<<分析器介绍>>中,我们介绍了一些内置的分析器,用于将全文字符串转换为适合搜索的倒排索引。
分析器介绍中,我们介绍了一些内置的分析器,用于将全文字符串转换为适合搜索的倒排索引。

`standard` 分析器是用于全文字段的默认分析器,对于大部分西方语系来说是一个不错的选择。它考虑了以下几点:

* `standard` 分词器,在词层级上分割输入的文本。
* `standard` 过滤器,被设计用来整理分词器触发的所有表征(但是目前什么都没做)。
* `lowercase` 过滤器,将所有表征转换为小写。
* `stop` 过滤器,删除所有可能会造成搜索歧义的停用词,如 `a``the``and``is`
* `standard` 表征过滤器,被设计用来整理分词器触发的所有表征(但是目前什么都没做)。
* `lowercase` 表征过滤器,将所有表征转换为小写。
* `stop` 表征过滤器,删除所有可能会造成搜索歧义的停用词,如 `a``the``and``is`

默认情况下,停用词过滤器是被禁用的。如需启用它,你可以通过创建一个基于 `standard` 分析器的自定义分析器,并且设置 `stopwords` 参数。可以提供一个停用词列表,或者使用一个特定语言的预定停用词列表。

Expand Down
174 changes: 60 additions & 114 deletions 070_Index_Mgmt/20_Custom_Analyzers.md
Original file line number Diff line number Diff line change
@@ -1,76 +1,36 @@
### 自定义分析器

While Elasticsearch comes with a number of analyzers available out of the box,
the real power comes from the ability to create your own custom analyzers
by combining character filters, tokenizers, and token filters in a
configuration that suits your particular data.

In <<analysis-intro>>, we said that an _analyzer_ is a wrapper that combines
three functions into a single package,((("analyzers", "character filters, tokenizers, and token filters in"))) which are executed in sequence:

Character filters::
+
--
Character filters((("character filters"))) are used to ``tidy up'' a string before it is tokenized.
For instance, if our text is in HTML format, it will contain HTML tags like
`<p>` or `<div>` that we don't want to be indexed. We can use the
http://bit.ly/1B6f4Ay[`html_strip` character filter]
to remove all HTML tags and to convert HTML entities like `&Aacute;` into the
corresponding Unicode character `Á`.

An analyzer may have zero or more character filters.
--

Tokenizers::
+
--
An analyzer _must_ have a single tokenizer.((("tokenizers", "in analyzers"))) The tokenizer breaks up the
string into individual terms or tokens. The
http://bit.ly/1E3Fd1b[`standard` tokenizer],
which is used((("standard tokenizer"))) in the `standard` analyzer, breaks up a string into
individual terms on word boundaries, and removes most punctuation, but
other tokenizers exist that have different behavior.

For instance, the
http://bit.ly/1ICd585[`keyword` tokenizer]
outputs exactly((("keyword tokenizer"))) the same string as it received, without any tokenization. The
http://bit.ly/1xt3t7d[`whitespace` tokenizer]
splits text((("whitespace tokenizer"))) on whitespace only. The
http://bit.ly/1ICdozA[`pattern` tokenizer] can
be used to split text on a ((("pattern tokenizer")))matching regular expression.
--

Token filters::
+
--
After tokenization, the resulting _token stream_ is passed through any
specified token filters,((("token filters"))) in the order in which they are specified.

Token filters may change, add, or remove tokens. We have already mentioned the
http://bit.ly/1DIeXvZ[`lowercase`] and
http://bit.ly/1INX4tN[`stop` token filters],
but there are many more available in Elasticsearch.
http://bit.ly/1AUfpDN[Stemming token filters]
``stem'' words to ((("stemming token filters")))their root form. The
http://bit.ly/1ylU7Q7[`ascii_folding` filter]
removes diacritics,((("ascii_folding filter"))) converting a term like `"très"` into `"tres"`. The
http://bit.ly/1CbkmYe[`ngram`] and
http://bit.ly/1DIf6j5[`edge_ngram` token filters] can produce((("edge_engram token filter")))((("ngram and edge_ngram token filters")))
tokens suitable for partial matching or autocomplete.
--

In <<search-in-depth>>, we discuss examples of where and how to use these
tokenizers and filters. But first, we need to explain how to create a custom
analyzer.

==== Creating a Custom Analyzer

In the same way as((("index settings", "analysis", "creating custom analyzers")))((("analyzers", "custom", "creating"))) we configured the `es_std` analyzer previously, we can configure
character filters, tokenizers, and token filters in their respective sections
under `analysis`:

[source,js]
--------------------------------------------------
虽然 Elasticsearch 内置了一系列的分析器,但是真正的强大之处在于定制你自己的分析器。你可以通过在配置文件中组合字符过滤器,分词器和表征过滤器,来满足特定数据的需求。

在 【分析器介绍】 中,我们提到 _分析器_ 是三个顺序执行的组件的结合(字符过滤器,分词器,表征过滤器)。

字符过滤器

> 字符过滤器是让字符串在被分词前变得更加“整洁”。例如,如果我们的文本是 HTML 格式,它可能会包含一些我们不想被索引的 HTML 标签,诸如 `<p>``<div>`
> 我们可以使用 [`html_strip` 字符过滤器](http://bit.ly/1B6f4Ay) 来删除所有的 HTML 标签,并且将 HTML 实体转换成对应的 Unicode 字符,比如将 `&Aacute;` 转成 `Á`
> 一个分析器可能包含零到多个字符过滤器。
分词器

> 一个分析器 _必须_ 包含一个分词器。分词器将字符串分割成单独的词(terms)或表征(tokens)。`standard` 分析器使用 [`standard` 分词器](http://bit.ly/1E3Fd1b)将字符串分割成单独的字词,删除大部分标点符号,但是现存的其他分词器会有不同的行为特征。
> 例如,[`keyword` 分词器](http://bit.ly/1ICd585)输出和它接收到的相同的字符串,不做任何分词处理。[`whitespace` 分词器]只通过空格来分割文本。[`pattern` 分词器]可以通过正则表达式来分割文本。
表征过滤器

> 分词结果的 _表征流_ 会根据各自的情况,传递给特定的表征过滤器。
> 表征过滤器可能修改,添加或删除表征。我们已经提过 [`lowercase`](http://bit.ly/1DIeXvZ)[`stop`](http://bit.ly/1INX4tN) 表征过滤器,但是 Elasticsearch 中有更多的选择。[`stemmer` 表征过滤器](http://bit.ly/1AUfpDN)将单词转化为他们的根形态(root form)。[`ascii_folding` 表征过滤器](http://bit.ly/1ylU7Q7)会删除变音符号,比如从 `très` 转为 `tres`[`ngram`](http://bit.ly/1CbkmYe)[`edge_ngram`](http://bit.ly/1DIf6j5) 可以让表征更适合特殊匹配情况或自动完成。
在【深入搜索】中,我们将举例介绍如何使用这些分词器和过滤器。但是首先,我们需要阐述一下如何创建一个自定义分析器

### 创建自定义分析器

与索引设置一样,我们预先配置好 `es_std` 分析器,我们可以再 `analysis` 字段下配置字符过滤器,分词器和表征过滤器:

```
PUT /my_index
{
"settings": {
Expand All @@ -82,48 +42,41 @@ PUT /my_index
}
}
}
--------------------------------------------------
```

作为例子,我们来配置一个这样的分析器:

As an example, let's set up a custom analyzer that will do the following:
1.`html_strip` 字符过滤器去除所有的 HTML 标签

1. Strip out HTML by using the `html_strip` character filter.
2.`&` 替换成 `and`,使用一个自定义的 `mapping` 字符过滤器

2. Replace `&` characters with `" and "`, using a custom `mapping`
character filter:
+
[source,js]
--------------------------------------------------
```
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [ "&=> and "]
}
}
--------------------------------------------------
```

3. 使用 `standard` 分词器分割单词

3. Tokenize words, using the `standard` tokenizer.
4. 使用 `lowercase` 表征过滤器将词转为小写

4. Lowercase terms, using the `lowercase` token filter.
5.`stop` 表征过滤器去除一些自定义停用词。

5. Remove a custom list of stopwords, using a custom `stop` token filter:
+
[source,js]
--------------------------------------------------
```
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [ "the", "a" ]
}
}
--------------------------------------------------
```

Our analyzer definition combines the predefined tokenizer and filters with the
custom filters that we have configured previously:
根据以上描述来将预定义好的分词器和过滤器组合成我们的分析器:

[source,js]
--------------------------------------------------
```
"analyzer": {
"my_analyzer": {
"type": "custom",
Expand All @@ -132,13 +85,11 @@ custom filters that we have configured previously:
"filter": [ "lowercase", "my_stopwords" ]
}
}
--------------------------------------------------
```

用下面的方式可以将以上请求合并成一条:

To put it all together, the whole `create-index` request((("create-index request"))) looks like this:

[source,js]
--------------------------------------------------
```
PUT /my_index
{
"settings": {
Expand All @@ -161,24 +112,22 @@ PUT /my_index
"filter": [ "lowercase", "my_stopwords" ]
}}
}}}
--------------------------------------------------
// SENSE: 070_Index_Mgmt/20_Custom_analyzer.json
```

<!-- SENSE: 070_Index_Mgmt/20_Custom_analyzer.json -->

After creating the index, use the `analyze` API to((("analyzers", "testing using analyze API"))) test the new analyzer:
创建索引后,用 `analyze` API 来测试新的分析器:

[source,js]
--------------------------------------------------
```
GET /my_index/_analyze?analyzer=my_analyzer
The quick & brown fox
--------------------------------------------------
// SENSE: 070_Index_Mgmt/20_Custom_analyzer.json
```

<!-- SENSE: 070_Index_Mgmt/20_Custom_analyzer.json -->

The following abbreviated results show that our analyzer is working correctly:
下面的结果证明我们的分析器能正常工作了:

[source,js]
--------------------------------------------------
```
{
"tokens" : [
{ "token" : "quick", "position" : 2 },
Expand All @@ -187,13 +136,11 @@ The following abbreviated results show that our analyzer is working correctly:
{ "token" : "fox", "position" : 5 }
]
}
--------------------------------------------------
```

The analyzer is not much use unless we tell ((("analyzers", "custom", "telling Elasticsearch where to use")))((("mapping (types)", "applying custom analyzer to a string field")))Elasticsearch where to use it. We
can apply it to a `string` field with a mapping such as the following:
除非我们告诉 Elasticsearch 在哪里使用,否则分析器不会起作用。我们可以通过下面的映射将它应用在一个 `string` 类型的字段上:

[source,js]
--------------------------------------------------
```
PUT /my_index/_mapping/my_type
{
"properties": {
Expand All @@ -203,7 +150,6 @@ PUT /my_index/_mapping/my_type
}
}
}
--------------------------------------------------
// SENSE: 070_Index_Mgmt/20_Custom_analyzer.json

```

<!-- SENSE: 070_Index_Mgmt/20_Custom_analyzer.json -->

0 comments on commit 879c0a9

Please sign in to comment.