Elasticsearch analysis 五大组件完整详解 + 实例

前言

analysis 是 ES 文本分析模块,一段原始文本处理流水线完整顺序:

原始文本 → char_filter 字符预处理 → tokenizer 切词分词 → token_filter 词条加工

衍生两类成品

analyzer:组装上面三套,给 text 全文检索用(会分词)

normalizer:只有 char_filter + token_filter,无分词器,给 keyword 精确匹配用(不分词)

五大组件

char_filter 字符过滤器

tokenizer 分词器内核

token_filter 词条过滤器(简写 filter)

analyzer 自定义分词器

normalizer 标准化器

一、char_filter 字符过滤器

作用分词之前对原始整段文本做字符替换、删除、清洗,修改原始字符流,还没切词。常用 type:mapping、html_strip、pattern_replace

示例配置

bash 复制代码
"char_filter": {
  // 1. mapping:固定字符映射替换
  "symbol_map": {
    "type": "mapping",
    "mappings": [
      "- => _",
      "& => and",
      "# => num"
    ]
  },
  // 2. html_strip:去除HTML标签
  "clear_html": {
    "type": "html_strip"
  },
  // 3. pattern_replace:正则批量替换
  "del_all_digit": {
    "type": "pattern_replace",
    "pattern": "\\d+",
    "replacement": ""
  }
}

效果演示

输入文本:< h1>Hello&Java-2026#test< /h1>

clear_html 处理:Hello&Java-2026#test

symbol_map 处理:HelloandJava_2026numtest

del_all_digit 处理:HelloandJava_numtest

二、tokenizer 分词内核

作用

经过 char_filter 处理后的完整字符串,切割成多个独立词条 (token),是分词核心,必须有。

常用 type:standard、whitespace、pattern、ik_max_word (中文)

示例配置

bash 复制代码
"tokenizer": {
  // 空格切分
  "split_by_space": {
    "type": "whitespace"
  },
  // 自定义正则分隔符(逗号分割)
  "split_by_comma": {
    "type": "pattern",
    "pattern": ","
  },
  // ES默认标准分词(英文按单词、标点分割)
  "std_token": {
    "type": "standard"
  }
}

效果演示

文本:Java,Go Python ES

split_by_comma 分词结果:"Java","Go Python ES"

split_by_space 分词结果:"Java,Go","Python","ES"

standard 分词结果:"Java","Go","Python","ES"

三、filter(token_filter)词条过滤器

作用

分词完成后,对每一个单独词条依次加工,顺序由数组定义。

常用 type:lowercase、trim、stop、unique、asciifolding

示例配置

bash 复制代码
"filter": {
  // 全部转小写
  "to_lower": {
    "type": "lowercase"
  },
  // 清除词条首尾空格
  "trim_space": {
    "type": "trim"
  },
  // 自定义停用词,过滤无意义词汇
  "remove_stop_word": {
    "type": "stop",
    "stop_words": ["a", "the", "is", "and"]
  },
  // 词条去重
  "unique_token": {
    "type": "unique"
  }
}

完整流程演示

分词后原始词条:" Hello","AND","Hello","JAVA "

to_lower:" hello","and","hello","java "

trim_space:"hello","and","hello","java"

remove_stop_word:"hello","hello","java"

unique_token:"hello","java"

四、analyzer 自定义分词器(text 字段专用)

结构公式

analyzer = char_filter可选 + tokenizer必填 + filter可选

作用

组装前面三类组件,生成一套完整分词规则,用于 text 类型字段,支持模糊、全文检索。

bash 复制代码
"analyzer": {
  "my_text_analyzer": {
    "char_filter": ["clear_html", "symbol_map"],
    "tokenizer": "std_token",
    "filter": ["to_lower", "remove_stop_word", "unique_token"]
  }
}

完整流水线演示

输入原文:
Hello&Java AND Hello Python-2026

char_filter:去除 html、替换符号 → HelloandJava AND Hello Python_2026
tokenizer 标准分词:"HelloandJava","AND","Hello","Python_2026"
filter 流水线:小写 → 过滤 and 停用词 → 去重
最终词条:"helloandjava", "hello", "python_2026"

使用位置

mapping 中 text 字段引用

bash 复制代码
"title": {
  "type": "text",
  "analyzer": "my_text_analyzer"
}

五、normalizer 标准化器(keyword 字段专用)

结构公式

normalizer = char_filter可选 + filter可选

无 tokenizer,永远不分词,整条文本只生成 1 个词条

作用

只清洗、不拆分,用于 keyword 精确匹配、聚合、排序,解决大小写、多余空格问题。

type 仅支持 custom /built_in

示例配置

bash 复制代码
"normalizer": {
  "lower_trim_norm": {
    "type": "custom",
    "char_filter": ["symbol_map"],
    "filter": ["lowercase", "trim"]
  }
}

效果演示

输入: ABC-TEST

处理后:abc_test

查询时 ABC-TEST / abc-test 均可精确匹配命中。

使用位置

mapping 中 keyword 字段引用

bash 复制代码
"code": {
  "type": "keyword",
  "normalizer": "lower_trim_norm"
}

完整可直接执行的索引 Demo(整合全部五大组件)

bash 复制代码
PUT /test_demo
{
  "settings": {
    "analysis": {
      // 1. char_filter 字符预处理
      "char_filter": {
        "symbol_map": {
          "type": "mapping",
          "mappings": ["- => _", "& => and"]
        },
        "clear_html": {
          "type": "html_strip"
        }
      },
      // 2. tokenizer 分词内核
      "tokenizer": {
        "std_token": {
          "type": "standard"
        }
      },
      // 3. token_filter 词条过滤器
      "filter": {
        "to_lower": {"type": "lowercase"},
        "trim_space": {"type": "trim"},
        "remove_stop_word": {
          "type": "stop",
          "stop_words": ["a", "and", "the"]
        },
        "unique_token": {"type": "unique"}
      },
      // 4. analyzer 完整分词器(text用)
      "analyzer": {
        "my_text_analyzer": {
          "char_filter": ["clear_html", "symbol_map"],
          "tokenizer": "std_token",
          "filter": ["to_lower", "remove_stop_word", "unique_token"]
        }
      },
      // 5. normalizer 标准化器(keyword用)
      "normalizer": {
        "lower_trim_norm": {
          "type": "custom",
          "char_filter": ["symbol_map"],
          "filter": ["lowercase", "trim_space"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "article": {
        "type": "text",
        "analyzer": "my_text_analyzer"
      },
      "sku_code": {
        "type": "keyword",
        "normalizer": "lower_trim_norm"
      }
    }
  }
}