前言
analysis 是 ES 文本分析模块,一段原始文本处理流水线完整顺序:
原始文本 → char_filter 字符预处理 → tokenizer 切词分词 → token_filter 词条加工
衍生两类成品 :
analyzer:组装上面三套,给 text 全文检索用(会分词)
normalizer:只有 char_filter + token_filter,无分词器,给 keyword 精确匹配用(不分词)
五大组件 :
char_filter 字符过滤器
tokenizer 分词器内核
token_filter 词条过滤器(简写 filter)
analyzer 自定义分词器
normalizer 标准化器
一、char_filter 字符过滤器
作用分词之前对原始整段文本做字符替换、删除、清洗,修改原始字符流,还没切词。常用 type:mapping、html_strip、pattern_replace
示例配置
bash
"char_filter": {
// 1. mapping:固定字符映射替换
"symbol_map": {
"type": "mapping",
"mappings": [
"- => _",
"& => and",
"# => num"
]
},
// 2. html_strip:去除HTML标签
"clear_html": {
"type": "html_strip"
},
// 3. pattern_replace:正则批量替换
"del_all_digit": {
"type": "pattern_replace",
"pattern": "\\d+",
"replacement": ""
}
}
效果演示
输入文本:< h1>Hello&Java-2026#test< /h1>
clear_html 处理:Hello&Java-2026#test
symbol_map 处理:HelloandJava_2026numtest
del_all_digit 处理:HelloandJava_numtest
二、tokenizer 分词内核
作用
经过 char_filter 处理后的完整字符串,切割成多个独立词条 (token),是分词核心,必须有。
常用 type:standard、whitespace、pattern、ik_max_word (中文)
示例配置
bash
"tokenizer": {
// 空格切分
"split_by_space": {
"type": "whitespace"
},
// 自定义正则分隔符(逗号分割)
"split_by_comma": {
"type": "pattern",
"pattern": ","
},
// ES默认标准分词(英文按单词、标点分割)
"std_token": {
"type": "standard"
}
}
效果演示
文本:Java,Go Python ES
split_by_comma 分词结果:"Java","Go Python ES"
split_by_space 分词结果:"Java,Go","Python","ES"
standard 分词结果:"Java","Go","Python","ES"
三、filter(token_filter)词条过滤器
作用
分词完成后,对每一个单独词条依次加工,顺序由数组定义。
常用 type:lowercase、trim、stop、unique、asciifolding
示例配置
bash
"filter": {
// 全部转小写
"to_lower": {
"type": "lowercase"
},
// 清除词条首尾空格
"trim_space": {
"type": "trim"
},
// 自定义停用词,过滤无意义词汇
"remove_stop_word": {
"type": "stop",
"stop_words": ["a", "the", "is", "and"]
},
// 词条去重
"unique_token": {
"type": "unique"
}
}
完整流程演示
分词后原始词条:" Hello","AND","Hello","JAVA "
to_lower:" hello","and","hello","java "
trim_space:"hello","and","hello","java"
remove_stop_word:"hello","hello","java"
unique_token:"hello","java"
四、analyzer 自定义分词器(text 字段专用)
结构公式
analyzer = char_filter可选 + tokenizer必填 + filter可选
作用
组装前面三类组件,生成一套完整分词规则,用于 text 类型字段,支持模糊、全文检索。
bash
"analyzer": {
"my_text_analyzer": {
"char_filter": ["clear_html", "symbol_map"],
"tokenizer": "std_token",
"filter": ["to_lower", "remove_stop_word", "unique_token"]
}
}
完整流水线演示
输入原文:
Hello&Java AND Hello Python-2026
char_filter:去除 html、替换符号 → HelloandJava AND Hello Python_2026
tokenizer 标准分词:"HelloandJava","AND","Hello","Python_2026"
filter 流水线:小写 → 过滤 and 停用词 → 去重
最终词条:"helloandjava", "hello", "python_2026"
使用位置
mapping 中 text 字段引用
bash
"title": {
"type": "text",
"analyzer": "my_text_analyzer"
}
五、normalizer 标准化器(keyword 字段专用)
结构公式
normalizer = char_filter可选 + filter可选
无 tokenizer,永远不分词,整条文本只生成 1 个词条
作用
只清洗、不拆分,用于 keyword 精确匹配、聚合、排序,解决大小写、多余空格问题。
type 仅支持 custom /built_in
示例配置
bash
"normalizer": {
"lower_trim_norm": {
"type": "custom",
"char_filter": ["symbol_map"],
"filter": ["lowercase", "trim"]
}
}
效果演示
输入: ABC-TEST
处理后:abc_test
查询时 ABC-TEST / abc-test 均可精确匹配命中。
使用位置
mapping 中 keyword 字段引用
bash
"code": {
"type": "keyword",
"normalizer": "lower_trim_norm"
}
完整可直接执行的索引 Demo(整合全部五大组件)
bash
PUT /test_demo
{
"settings": {
"analysis": {
// 1. char_filter 字符预处理
"char_filter": {
"symbol_map": {
"type": "mapping",
"mappings": ["- => _", "& => and"]
},
"clear_html": {
"type": "html_strip"
}
},
// 2. tokenizer 分词内核
"tokenizer": {
"std_token": {
"type": "standard"
}
},
// 3. token_filter 词条过滤器
"filter": {
"to_lower": {"type": "lowercase"},
"trim_space": {"type": "trim"},
"remove_stop_word": {
"type": "stop",
"stop_words": ["a", "and", "the"]
},
"unique_token": {"type": "unique"}
},
// 4. analyzer 完整分词器(text用)
"analyzer": {
"my_text_analyzer": {
"char_filter": ["clear_html", "symbol_map"],
"tokenizer": "std_token",
"filter": ["to_lower", "remove_stop_word", "unique_token"]
}
},
// 5. normalizer 标准化器(keyword用)
"normalizer": {
"lower_trim_norm": {
"type": "custom",
"char_filter": ["symbol_map"],
"filter": ["lowercase", "trim_space"]
}
}
}
},
"mappings": {
"properties": {
"article": {
"type": "text",
"analyzer": "my_text_analyzer"
},
"sku_code": {
"type": "keyword",
"normalizer": "lower_trim_norm"
}
}
}
}
