【Elasticsearch】检索高亮

检索高亮

1.什么是片段
2.案例实战
- [2.1 测试数据准备](#2.1 测试数据准备)
- [2.2 基础高亮语法](#2.2 基础高亮语法)
- [2.3 自定义高亮标签](#2.3 自定义高亮标签)
- [2.4 多字段高亮](#2.4 多字段高亮)
- [2.5 返回完整字段内容](#2.5 返回完整字段内容)
- [2.6 长文本多匹配点场景](#2.6 长文本多匹配点场景)
- [2.7 限制片段数量和大小](#2.7 限制片段数量和大小)
[3.为什么有时片段长度会超过 fragment_size](#3.为什么有时片段长度会超过 fragment_size)
- [3.1 示例分析](#3.1 示例分析)
- - [3.1.1 查询1：严格限制片段大小](#3.1.1 查询1：严格限制片段大小)
  - [3.1.2 查询2：观察片段扩展](#3.1.2 查询2：观察片段扩展)
- [3.2 关键结论](#3.2 关键结论)

高亮功能可以在搜索结果中标记出匹配的文本片段。

1.什么是片段

Elasticsearch 的高亮片段（fragment）是指从原始文本中提取的、包含搜索关键词的一小段文字。它的目的是让用户快速看到匹配内容在原文中的位置，而不是返回整个字段内容。

检索返回的片段由 fragment_size 和 number_of_fragments 共同控制。

特性	说明
`fragment_size`	* 控制返回的高亮文本片段的长度 * 每个片段的目标字符数（实际可能略多，因需保持单词完整） * 默认值为 100 100 100 * 当字段内容很长时，不会返回整个字段内容，而是返回包含匹配项的片段 * 有助于减少网络传输数据量和提高可读性
`number_of_fragments`	* 每个字段返回的最大片段数（针对单个文档的单个字段） * 默认值为 5 5 5。如果设为 0 0 0，则返回整个字段内容（不进行分段） * 对于很长的文本，可能有多个地方匹配查询条件，此参数控制展示多少个这样的匹配点 * 设置为 0 0 0 时适用于短文本或需要展示完整内容的情况
片段内容	* 一定包含至少一个匹配的关键词，并会添加高亮标签（如 `<em>`）

2.案例实战

2.1 测试数据准备

首先，我们创建一个名为 blog_posts 的索引，并插入一些测试数据：

json 复制代码

PUT /blog_posts
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "author": { "type": "keyword" },
      "views": { "type": "integer" },
      "publish_date": { "type": "date" },
      "tags": { "type": "keyword" }
    }
  }
}

json 复制代码

POST /blog_posts/_bulk
{"index":{}}
{"title":"Elasticsearch Basics","content":"Learn the basics of Elasticsearch and how to perform simple queries.","author":"John Doe","views":1500,"publish_date":"2023-01-15","tags":["search","database"]}
{"index":{}}
{"title":"Advanced Search Techniques","content":"Explore advanced search techniques in Elasticsearch including aggregations and filters.","author":"Jane Smith","views":3200,"publish_date":"2023-02-20","tags":["search","advanced"]}
{"index":{}}
{"title":"Data Analytics with ELK","content":"How to use the ELK stack for data analytics and visualization.","author":"John Doe","views":2800,"publish_date":"2023-03-10","tags":["analytics","elk"]}
{"index":{}}
{"title":"Elasticsearch Performance Tuning","content":"Tips and tricks for optimizing Elasticsearch performance in production environments.","author":"Mike Johnson","views":4200,"publish_date":"2023-04-05","tags":["performance","optimization"]}
{"index":{}}
{"title":"Kibana Dashboard Guide","content":"Creating effective dashboards in Kibana for monitoring and analysis.","author":"Jane Smith","views":1900,"publish_date":"2023-05-12","tags":["kibana","visualization"]}

2.2 基础高亮语法

默认情况下，不指定参数。但其实，此时 fragment_size 默认为 100 100 100，number_of_fragments 默认为 5 5 5。

json 复制代码

GET /blog_posts/_search
{
  "query": {
    "match": {
      "content": "Elasticsearch"
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

2.3 自定义高亮标签

json 复制代码

GET /blog_posts/_search
{
  "query": {
    "match": {
      "content": "techniques"
    }
  },
  "highlight": {
    "pre_tags": ["<strong>"],
    "post_tags": ["</strong>"],
    "fields": {
      "content": {
        "fragment_size": 150,
        "number_of_fragments": 3
      }
    }
  }
}

2.4 多字段高亮

json 复制代码

GET /blog_posts/_search
{
  "query": {
    "multi_match": {
      "query": "search",
      "fields": ["title", "content"]
    }
  },
  "highlight": {
    "fields": {
      "title": {},
      "content": {
        "fragment_size": 100,
        "number_of_fragments": 2
      }
    }
  }
}

2.5 返回完整字段内容

设置 number_of_fragments 为 0 0 0，适用于短文本，如标题。

json 复制代码

GET /blog_posts/_search
{
  "query": { "match": { "title": "Kibana" } },
  "highlight": {
    "fields": {
      "title": { "number_of_fragments": 0 }
    }
  }
}

2.6 长文本多匹配点场景

模拟长文本字段：

json 复制代码

POST /blog_posts/_update/Nmgc2ZcB9mA5oeTvZT0A
{
  "doc": {
    "content": "Elasticsearch is a tool. Elasticsearch is fast. Elasticsearch scales well. Repeat: Elasticsearch is a tool."
  }
}

json 复制代码

GET /blog_posts/_search
{
  "query": { "match": { "content": "Elasticsearch" } },
  "highlight": {
    "fields": {
      "content": {
        "fragment_size": 30,
        "number_of_fragments": 3
      }
    }
  }
}

展示前 3 3 3 个匹配点（忽略第 4 4 4 个重复匹配）。

2.7 限制片段数量和大小

每个片段限制在 30 30 30 字符以内，只返回 2 2 2 个片段（即使实际匹配更多）。

json 复制代码

GET /blog_posts/_search
{
  "query": { "match": { "content": "Elasticsearch" } },
  "highlight": {
    "fields": {
      "content": {
        "fragment_size": 30,
        "number_of_fragments": 2
      }
    }
  }
}

如果不设置 number_of_fragments 参数，默认值为 5 5 5，此处会全部返回。

json 复制代码

GET /blog_posts/_search
{
  "query": { "match": { "content": "Elasticsearch" } },
  "highlight": {
    "fields": {
      "content": {
        "fragment_size": 30
      }
    }
  }
}

3.为什么有时片段长度会超过 fragment_size

即使设置了 fragment_size，实际返回的片段长度可能会略大，原因包括：

单词完整性保护：Elasticsearch 不会在单词中间截断，因此会扩展到下一个空格或标点符号。

text 复制代码

// 设 fragment_size=20，匹配词为"Elasticsearch"
"This is a test with Elasticsearch and other words"
→ 可能返回："test with <em>Elasticsearch</em> and other"（实际27字符）

高亮标签占用长度：HTML 高亮标签（如 <em>）会增加额外字符，但这些 不计入 fragment_size。
边界扩展策略：为保证上下文可读性，Elasticsearch 可能会稍微扩展片段范围。

3.1 示例分析

插入测试数据。

json 复制代码

PUT /test/_doc/1
{
  "text": "Elasticsearch is a distributed search engine. It is built on top of Lucene. Elasticsearch provides powerful full-text search capabilities. Many companies use Elasticsearch for log analytics."
}

3.1.1 查询1：严格限制片段大小

json 复制代码

GET /test/_search
{
  "query": { "match": { "text": "Elasticsearch" } },
  "highlight": {
    "fields": {
      "text": { 
        "fragment_size": 20,
        "number_of_fragments": 2
      }
    }
  }
}

可能返回：

json 复制代码

"highlight": {
  "text": [
    "<em>Elasticsearch</em> is a",  // 实际21字符（含空格）
    "use <em>Elasticsearch</em> for" // 实际22字符
  ]
}

说明：虽然 fragment_size=20，但为保证单词完整，实际略超。

3.1.2 查询2：观察片段扩展

json 复制代码

GET /test/_search
{
  "query": { "match": { "text": "search" } },
  "highlight": {
    "fields": {
      "text": { 
        "fragment_size": 10,
        "number_of_fragments": 1
      }
    }
  }
}

返回示例：

json 复制代码

"highlight": {
  "text": [
    "distributed <em>search</em> engine"  // 实际25字符（远超10）
  ]
}

原因：匹配词 "search" 前后需要保留最小上下文，无法严格截断。

3.2 关键结论

片段是围绕匹配词的文本片段，用于展示关键词上下文。
fragment_size 是目标值，实际可能因以下原因超出：
- 保持单词完整
- 包含高亮标签
- 最小上下文保留

需要严格限制时，可用 "type": "plain" + "boundary_scanner": "chars"（可能破坏单词完整性）。

json 复制代码

"highlight": {
  "fields": {
    "text": {
      "fragment_size": 20,
      "number_of_fragments": 1,
      "type": "plain",  // 禁用智能处理
      "boundary_scanner": "chars" // 按字符而非单词截断
    }
  }
}

这种设计是为了平衡 精准控制 和 可读性。如果您的应用对片段长度敏感，建议在后端做二次处理。