Elasticsearch：语义文本 - 更简单、更好、更精炼、更强大 8.18

作者：来自 Elastic Mike Pellegrini

我们最新的 semantic_text 迭代带来了大量改进。除了简化 _source 中的表示之外，其好处还包括减少冗长程度、更高效的磁盘利用率以及更好地与其他 Elasticsearch 功能集成。你现在可以使用突出显示来检索与你的查询最相关的块。也许最重要的是，它现在是一个正式发的功能！

我们在 semantic_text 字段类型上经历了一段漫长的旅程，而这一最新版本承诺让语义搜索变得前所未有的简单。除了在 _source 中简化 semantic_text 的表示外，它还带来了减少冗余、更高效的磁盘利用率以及更好的 Elasticsearch 其他功能集成等优势。你现在可以使用高亮功能来检索与你查询最相关的文本片段。而最重要的是，它现已成为一个正式发布的功能！

语义搜索的演进

多年来，我们对语义搜索的方法不断发展，目标是让它尽可能简单。在 semantic_text 字段类型推出之前，执行语义搜索需要手动完成以下步骤：

配置映射以兼容你的嵌入向量。
配置一个摄取管道，以使用机器学习模型生成嵌入向量。
使用该管道来摄取你的文档。
在查询时使用相同的机器学习模型为查询文本生成嵌入向量。

当时，我们称这种设置为 "简单"，但我们知道，我们可以让它变得更简单。于是，semantic_text 诞生了。

初始阶段

我们在 Elasticsearch 8.15 中引入了 semantic_text，目标是简化语义搜索。如果你对 semantic_text 还不熟悉，建议先阅读我们的原始博客文章，以了解我们的方法背景。

我们最初将 semantic_text 作为 Beta 功能发布，是有充分理由的。在软件开发中，有一句广为流传的真理：让某样东西变得简单，往往是极其困难的，semantic_text 也不例外。在它的背后，有许多复杂的组件协同工作，以实现这种 "魔法般" 的语义搜索体验。我们希望花时间确保这些组件正确无误，然后再将该功能正式推出。事实证明，这段时间的投入是值得的 ------ 我们在最初的方法上进行了多次迭代，增加了新功能，并优化了存储，使 semantic_text 变得更简单、更精炼，并且更具长期可维护性。

我们最初的实现依赖于修改 _source 来存储推理结果。这导致 semantic_text 字段具有相对复杂的子字段结构：

复制代码

GET test-index/_doc/doc1
{
  "_index": "test-sparse",
  "_id": "doc1",
  "_source": {
    "infer_field": {
      "text": "these are not the droids you're looking for. He's free to go around",
      "inference": {
        "inference_id": "my-elser-endpoint",
        "model_settings": {
          "task_type": "sparse_embedding"
        },
        "chunks": [
          {
            "text": "these are not the droids you're looking for. He's free to go around",
            "embeddings": {
              "##oid": 1.9103845,
              "##oids": 1.768872,
              "free": 1.693662,
              "dr": 1.6103356,
              "around": 1.4376559,
              "these": 1.1396849

              ...
            }
          }
        ]
      }
    }
  }
}

这种结构带来了一些问题：

过于冗长。 除了原始值外，它还包含元数据和文本分片信息，使 API 响应既难以阅读，又比实际需要的更大。
增加了磁盘上的索引大小。 由于嵌入向量可能非常大，它们实际上被存储了两次：一次用于 Lucene 索引以支持检索，另一次存储在 _source 中。这极大地影响了 semantic_text 在大规模数据集上的可扩展性。
管理起来不够直观。 原始提供的文本值存储在 text 子字段中，因此需要特殊处理才能在后续操作中获取该值。这导致 semantic_text 字段的行为与文本字段族中的其他字段不一致，从而带来了一系列复杂的连锁反应，使得它更难集成到更高级的工作流中。

语义文本即文本

我们的改进版本优雅地解决了这些问题，通过优化 semantic_text 在 _source 中的表示方式，使其更加简洁。与其在 semantic_text 字段内直接存储元数据和文本分片信息的复杂子字段结构，我们现在改用一个 隐藏的元字段（metafield） 来处理这些内容。这意味着，我们不再需要修改 _source 来存储推理结果。

从实际效果来看，这一改进意味着你提交用于索引的文档 _source，在检索时依然会以相同的 _source 返回，不会再被额外修改或扩展。

复制代码

GET test-index/_doc/doc1
{
  "_index": "test-sparse",
  "_id": "doc1",
  "_source": {
    "infer_field": "these are not the droids you're looking for. He's free to go around"
  }
}

结构简化带来的变化

你会注意到，_source 结构中不再包含 text 或 inference 之类的子字段。现在，_source 保持与索引时提供的内容完全一致，变得更加简单！

🚨 注意：如果你的代码依赖于解析 semantic_text 字段在搜索结果或 Get API 返回值中的子字段结构，这将是一个 破坏性变更 。也就是说，如果你以前解析的是 infer_field.text 子字段的值，你需要更新代码，使其改为解析 infer_field 的值。我们尽力避免破坏性变更，但由于 _source 结构调整而移除子字段结构，这次变更不可避免。

这一简化带来的主要优势：

更易使用。 不再需要解析复杂的子字段结构来获取原始文本值，直接读取字段值即可。
减少冗余。 元数据和文本分片信息不会再干扰 API 响应，使其更加简洁。
提高磁盘利用率。 嵌入向量不再存储在 _source 中，减少了存储占用。
更好的集成性。 使 semantic_text 更好地支持 Elasticsearch 其他功能，如多字段（multi-fields）、文档的部分更新（partial updates）以及重新索引（reindexing）。

让我们稍微扩展一下最后一点，因为它涵盖了几个领域。通过这种简化，semantic_text 字段现在可以用作 multi-fields 的源和目标

复制代码

PUT multi-field-source-example-index
{
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text",
        "fields": {
          "text": {
            "type": "text"
           }
        }
      }
    }
  }
}

PUT multi-field-target-example-index
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "fields": {
          "inference": {
            "type": "semantic_text"
           }
        }
      }
    }
  }
}

semantic_text 字段现在还支持通过 Bulk API 进行部分文档更新：

复制代码

PUT partial-update-example-index
{
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text"
      },
      "source_field": {
        "type": "text",
        "copy_to": "inference"
      }
    }
  }
}

POST my-index/_doc/1
{
  "inference": "a test value",
  "source_field": "another test value"
}

POST my-index/_bulk
{ "update": {"_id": "1"} }
{ "doc": {"source_field": "a different test value"} }

现在，你可以将数据重新索引到使用不同 inference_id 的 semantic_text 字段中。

复制代码

PUT source-index
{
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text",
        "inference_id": "my-elser-endpoint"
      }
    }
  }
}

PUT dest-index
{
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text",
        "inference_id": "my-e5-endpoint"
      }
    }
  }
}

POST dest-index/_doc/1
{
  "inference": "a test value"
}

POST _reindex
{
  "source": {
    "index": "source-index"
  },
  "dest": {
    "index": "dest-index"
  }
}

语义高亮（Semantic Highlighting）

semantic_text 最受欢迎的功能请求之一，就是能够在字段内检索最相关的文本片段。这一功能对于 RAG（检索增强生成）应用至关重要。此前，我们（非官方地）通过 inner_hits 提供了一些临时解决方案，但现在我们决定淘汰 inner_hits，转而采用更精简的方案：高亮（highlighting）。

高亮是一种常见的 词法搜索技术 ，通常用于文本字段。由于 semantic_text 属于文本字段家族，因此将高亮技术适配到 semantic_text 是合理的。为此，我们新增了 语义高亮（semantic highlighter），它可以帮助你检索与查询最相关的文本片段：

复制代码

PUT highlighting-index
{
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text"
      }
    }
  }
}

POST highlighting-index/_doc/1
{
  "inference": ["Yosemite is one of the most popular national parks in the USA", "Park visitors should dispose of their trash properly in bear-proof trash bins to avoid attracting bears", "Bears are quite clever and curious though, so visitors should always be on the lookout for bear activity regardless"]
}

GET highlighting-index/_search
{
  "query": {
    "semantic": {
      "field": "inference",
      "query": "Where should I throw away my trash?"
    }
  },
  "highlight": {
    "fields": {
      "inference": {
        "order": "score",
        "number_of_fragments": 1
      }
    }
  }
}

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 15.898441,
        "hits": [
            {
                "_index": "highlighting-index",
                "_id": "1",
                "_score": 15.898441,
                "_source": {
                    "inference": [
                        "Yosemite is one of the most popular national parks in the USA",
                        "Park visitors should dispose of their trash properly in bear-proof trash bins to avoid attracting bears",
                        "Bears are quite clever and curious though, so visitors should always be on the lookout for bear activity regardless"
                    ]
                },
                "highlight": {
                    "inference": [
                        "Park visitors should dispose of their trash properly in bear-proof trash bins to avoid attracting bears"
                    ]
                }
            }
        ]
    }
}


GET highlighting-index/_search
{
  "query": {
    "semantic": {
      "field": "inference",
      "query": "Are bears smart?"
    }
  },
  "highlight": {
    "fields": {
      "inference": {
        "order": "score",
        "number_of_fragments": 1
      }
    }
  }
}

{
    "took": 20,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 18.584934,
        "hits": [
            {
                "_index": "highlighting-index",
                "_id": "1",
                "_score": 18.584934,
                "_source": {
                    "inference": [
                        "Yosemite is one of the most popular national parks in the USA",
                        "Park visitors should dispose of their trash properly in bear-proof trash bins to avoid attracting bears",
                        "Bears are quite clever and curious though, so visitors should always be on the lookout for bear activity regardless"
                    ]
                },
                "highlight": {
                    "inference": [
                        "Bears are quite clever and curious though, so visitors should always be on the lookout for bear activity regardless"
                    ]
                }
            }
        ]
    }
}

请参阅 semantic_text 文档，了解如何使用高亮功能的更多信息。

正式上线

随着 _source 表示方式的调整，我们现在正式宣布 semantic_text 已成为正式发布的功能 🎉！这意味着我们承诺不再对该功能进行破坏性更改，并正式支持在生产环境中使用。作为用户，你可以放心地将 semantic_text 集成到生产工作流中，Elastic 也将持续提供支持，确保长期稳定性。

从 Beta 迁移

为了确保从 Beta 版本的平稳迁移，所有在 Elasticsearch 8.15 至 8.17 版本创建的索引 ，或 在 1 月 30 日之前在 Serverless 中创建的索引 ，仍将按照 Beta _source 结构运行。换句话说，它们仍将使用 Beta 版本的 _source 表示方式。

我们建议尽早迁移到正式发布版本的 _source 结构。你可以通过 重新索引（reindexing）到新索引 来完成迁移：

复制代码

PUT my-new-index
{
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text"
      }
    }
  }
}

POST _reindex
{
  "source": {
    "index": "my-old-index"
  },
  "dest": {
    "index": "my-new-index"
  },
  "script": {
    "source": "ctx._source.inference = ctx._source.inference.text"
  }
}

请注意 script 参数的使用，它用于适配 _source 表示方式的更改。该脚本会从 text 子字段提取值，并直接赋给 semantic_text 字段。

立即尝试

这些更改将在 Elasticsearch 8.18+ （Stack 版）中提供，但如果你想现在就体验，它们已在 Serverless 版本中可用。此外，它们还与我们同期推出的 语义搜索简化 方案完美结合。使用这两者，可以让你的语义搜索能力更上一层楼！

你可以通过 自助式 Search AI 实践课程 亲自体验向量搜索。现在就 开启免费云试用 ，或在本地安装 Elastic 进行尝试！

原文：Semantic Text: Simpler, better, leaner, stronger - Elasticsearch Labs