Elasticsearch：语义搜索快速入门

这个交互式 jupyter notebook 将使用官方 Elasticsearch Python 客户端向你介绍 Elasticsearch 的一些基本操作。你将使用 Sentence Transformers 进行文本嵌入的语义搜索。了解如何将传统的基于文本的搜索与语义搜索集成，形成混合搜索系统。在本示例中，我们将使用模型自带的功能对文本进行向量化，而不借助 Elasticsearch 所提供的 ingest pipeline 来进行矢量化。这样的好处是完全免费，不需要购买版权。如果你想使用 ingest pipeline 来进行向量化，请参阅文章 "Elasticsearch：使用 huggingface 模型的 NLP 文本搜索"。

安装

Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana，请参考如下的链接来进行安装：

在安装的时候，我们可以选择 Elastic Stack 8.x 的安装指南来进行安装。在本博文中，我将使用最新的 Elastic Stack 8.10 来进行展示。

在安装 Elasticsearch 的过程中，我们需要记下如下的信息：

Python 安装包

在本演示中，我们将使用 Python 来进行展示。我们需要安装访问 Elasticsearch 相应的安装包 elasticsearch：

复制代码

python3 -m pip install -qU sentence-transformers elasticsearch transformers

我们将使用 Jupyter Notebook 来进行展示。

markdown 复制代码

1.  $ pwd
2.  /Users/liuxg/python/elser
3.  $ jupyter notebook

创建应用并展示

设置嵌入模型

在此示例中，我们使用 all-MiniLM-L6-v2，它是 sentence_transformers 库的一部分。你可以在 Huggingface 上阅读有关此模型的更多信息。

ini 复制代码

1.  from sentence_transformers import SentenceTransformer
2.  import torch

4.  device = 'cuda' if torch.cuda.is_available() else 'cpu'

6.  model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
7.  model

初始化 Elasticsearch 客户端

现在我们可以实例化 Elasticsearch python 客户端。我们必须提高响应的账号及证书信息：

ini 复制代码

1.  from elasticsearch import Elasticsearch

3.  ELASTCSEARCH_CERT_PATH = "/Users/liuxg/elastic/elasticsearch-8.10.0/config/certs/http_ca.crt"

5.  client = Elasticsearch(  ['https://localhost:9200'],
6.      basic_auth = ('elastic', 'vXDWYtL*my3vnKY9zCfL'),
7.      ca_certs = ELASTCSEARCH_CERT_PATH,
8.      verify_certs = True)
9.  print(client.info())

摄入一些数据

我们的客户端已设置并连接到我们的 Elasticsearch 部署。现在我们需要一些数据来测试 Elasticsearch 查询的基础知识。我们将使用包含以下字段的小型书籍索引：

title
authors
publish_date
num_reviews
publisher

我们在当前项目的根目录下创建 books.json 文件：

books.json

css 复制代码

1.  [2.    {3.      "title": "The Pragmatic Programmer: Your Journey to Mastery",4.      "authors": ["andrew hunt", "david thomas"],
5.      "summary": "A guide to pragmatic programming for software engineers and developers",
6.      "publish_date": "2019-10-29",
7.      "num_reviews": 30,
8.      "publisher": "addison-wesley"
9.    },
10.    {
11.      "title": "Python Crash Course",
12.      "authors": ["eric matthes"],
13.      "summary": "A fast-paced, no-nonsense guide to programming in Python",
14.      "publish_date": "2019-05-03",
15.      "num_reviews": 42,
16.      "publisher": "no starch press"
17.    },
18.    {
19.      "title": "Artificial Intelligence: A Modern Approach",
20.      "authors": ["stuart russell", "peter norvig"],
21.      "summary": "Comprehensive introduction to the theory and practice of artificial intelligence",
22.      "publish_date": "2020-04-06",
23.      "num_reviews": 39,
24.      "publisher": "pearson"
25.    },
26.    {
27.      "title": "Clean Code: A Handbook of Agile Software Craftsmanship",
28.      "authors": ["robert c. martin"],
29.      "summary": "A guide to writing code that is easy to read, understand and maintain",
30.      "publish_date": "2008-08-11",
31.      "num_reviews": 55,
32.      "publisher": "prentice hall"
33.    },
34.    {
35.      "title": "You Don't Know JS: Up & Going",
36.      "authors": ["kyle simpson"],
37.      "summary": "Introduction to JavaScript and programming as a whole",
38.      "publish_date": "2015-03-27",
39.      "num_reviews": 36,
40.      "publisher": "oreilly"
41.    },
42.    {
43.      "title": "Eloquent JavaScript",
44.      "authors": ["marijn haverbeke"],
45.      "summary": "A modern introduction to programming",
46.      "publish_date": "2018-12-04",
47.      "num_reviews": 38,
48.      "publisher": "no starch press"
49.    },
50.    {
51.      "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
52.      "authors": [
53.        "erich gamma",
54.        "richard helm",
55.        "ralph johnson",
56.        "john vlissides"
57.      ],
58.      "summary": "Guide to design patterns that can be used in any object-oriented language",
59.      "publish_date": "1994-10-31",
60.      "num_reviews": 45,
61.      "publisher": "addison-wesley"
62.    },
63.    {
64.      "title": "The Clean Coder: A Code of Conduct for Professional Programmers",
65.      "authors": ["robert c. martin"],
66.      "summary": "A guide to professional conduct in the field of software engineering",
67.      "publish_date": "2011-05-13",
68.      "num_reviews": 20,
69.      "publisher": "prentice hall"
70.    },
71.    {
72.      "title": "JavaScript: The Good Parts",
73.      "authors": ["douglas crockford"],
74.      "summary": "A deep dive into the parts of JavaScript that are essential to writing maintainable code",
75.      "publish_date": "2008-05-15",
76.      "num_reviews": 51,
77.      "publisher": "oreilly"
78.    },
79.    {
80.      "title": "Introduction to the Theory of Computation",
81.      "authors": ["michael sipser"],
82.      "summary": "Introduction to the theory of computation and complexity theory",
83.      "publish_date": "2012-06-27",
84.      "num_reviews": 33,
85.      "publisher": "cengage learning"
86.    }
87.  ]

markdown 复制代码

1.  $ pwd
2.  /Users/liuxg/python/elser
3.  $ ls
4.  Multilingual semantic search.ipynb
5.  NLP text search using hugging face transformer model.ipynb
6.  Semantic search - ELSER.ipynb
7.  Semantic search quick start.ipynb
8.  books.json
9.  data.json

创建索引

让我们使用测试数据的正确映射创建一个 Elasticsearch 索引。

perl 复制代码

1.  INDEX_NAME = "book_index"

3.  if es.indices.exists(index=INDEX_NAME):
4.      print("Deleting existing %s" % INDEX_NAME)
5.      client.options(ignore_status=[400, 404]).indices.delete(index=INDEX_NAME)

7.  # Define the mapping
8.  mappings = {
9.      "properties": {
10.          "title_vector": {
11.              "type": "dense_vector",
12.              "dims": 384,
13.              "index": "true",
14.              "similarity": "cosine"
15.          }
16.      }
17.  }

20.  # Create the index
21.  client.indices.create(index = INDEX_NAME, mappings = mappings)

索引测试数据

运行以下命令上传一些测试数据，其中包含该数据集中的 10 本流行编程书籍的信息。 model.encode 将使用我们之前初始化的模型将文本动态编码为向量。

python 复制代码

1.  import json

3.  # Load data into a JSON object
4.  with open('books.json') as f:
5.     data_json = json.load(f)

7.  print(data_json)

9.  actions = []
10.  for book in data_json:
11.      actions.append({"index": {"_index": INDEX_NAME}})
12.      # Transforming the title into an embedding using the model
13.      book["title_vector"] = model.encode(book["title"]).tolist()
14.      actions.append(book)
15.  client.bulk(index=INDEX_NAME, operations=actions)

我们可以在 Kibana 中进行查看：

bash 复制代码

GET book_index/_search

漂亮地打印 Elasticsearch 响应

你的 API 调用将返回难以阅读的嵌套 JSON。我们将创建一个名为 Pretty_response 的小函数，以从示例中返回漂亮的、人类可读的输出。

ini 复制代码

1.  def pretty_response(response):
2.      for hit in response['hits']['hits']:
3.          id = hit['_id']
4.          publication_date = hit['_source']['publish_date']
5.          score = hit['_score']
6.          title = hit['_source']['title']
7.          summary = hit['_source']['summary']
8.          publisher = hit["_source"]["publisher"]
9.          num_reviews = hit["_source"]["num_reviews"]
10.          authors = hit["_source"]["authors"]
11.          pretty_output = (f"\nID: {id}\nPublication date: {publication_date}\nTitle: {title}\nSummary: {summary}\nPublisher: {publisher}\nReviews: {num_reviews}\nAuthors: {authors}\nScore: {score}")
12.          print(pretty_output)

查询

现在我们已经对书籍进行了索引，我们想要对与给定查询相似的书籍执行语义搜索。我们嵌入查询并执行搜索。

vbscript 复制代码

1.  response = client.search(index="book_index", body={
2.      "knn": {
3.        "field": "title_vector",
4.        "query_vector": model.encode("Best javascript books?"),
5.        "k": 10,
6.        "num_candidates": 100
7.      }
8.  })

10.  pretty_response(response)

过滤

过滤器上下文主要用于过滤结构化数据。例如，使用过滤器上下文来回答以下问题：

该时间戳是否在 2015 年至 2016 年范围内？
状态字段是否设置为 "published"？

每当查询子句传递给过滤器参数（例如布尔查询中的 filter 或 must_not 参数）时，过滤器上下文就会生效。

在 Elasticsearch 文档中了解有关过滤器上下文的更多信息。

示例：关键字过滤

这是向查询添加关键字过滤器的示例。

它通过仅包含 "publisher" 字段等于 "addison-wesley" 的文档来缩小结果范围。

该代码检索类似于 "Best javascript books?" 的热门书籍。基于他们的 title 向量，并以 "addison-wesley" 作为出版商。

css 复制代码

1.  response = client.search(index=INDEX_NAME, body={
2.      "knn": {
3.        "field": "title_vector",
4.        "query_vector": model.encode("Best javascript books?"),
5.        "k": 10,
6.        "num_candidates": 100,
7.        "filter": {
8.            "term": {
9.                "publisher.keyword": "addison-wesley"
10.            }
11.        }
12.      }
13.  })

15.  pretty_response(response)

示例：高级过滤

Elasticsearch 中的高级过滤允许通过应用条件来精确细化搜索结果。它支持多种运算符，可用于根据特定字段、范围或条件过滤结果，从而提高搜索结果的精度和相关性。在此查询和过滤上下文示例中了解更多信息。

markdown 复制代码

1.  response = client.search(index="book_index", body={
2.      "knn": {
3.        "field": "title_vector",
4.        "query_vector": model.encode("Best javascript books?"),
5.        "k": 10,
6.        "num_candidates": 100,
7.        "filter": {
8.            "bool": {
9.                "should": [
10.                    {
11.                      "term": {
12.                          "publisher.keyword": "addison-wesley"
13.                      }
14.                    },
15.                    {
16.                      "term": {
17.                          "authors.keyword": "robert c. martin"
18.                      }
19.                    }
20.                ],

22.            }
23.        }
24.      }
25.  })

27.  pretty_response(response)

Hybrid search

在此示例中，我们正在研究两种搜索算法的组合：用于文本搜索的 BM25 和用于最近邻搜索的 HNSW。通过结合多种排名方法，例如 BM25 和生成密集向量嵌入的 ML 模型，我们可以获得最佳排名结果。这种方法使我们能够利用每种算法的优势并提高整体搜索性能。

倒数排名融合 (RRF) 是一种最先进的排名算法，用于组合来自不同信息检索策略的结果。 RRF 在未经校准的情况下优于所有其他排名算法。简而言之，它支持开箱即用的一流混合搜索。

perl 复制代码

1.  response = client.search(index="book_index", body={
2.      "query": {
3.          "match": {
4.              "summary": "python"
5.          }
6.      },
7.      "knn": {
8.          "field": "title_vector",
9.          # generate embedding for query so it can be compared to `title_vector`
10.          "query_vector" : model.encode("python programming").tolist(),
11.          "k": 5,
12.          "num_candidates": 10
13.      },
14.      "rank": {
15.          "rrf": {
16.              "window_size": 100,
17.              "rank_constant": 20
18.          }
19.      }
20.  })

22.  pretty_response(response)

最终的 jupyter 文件可以在地址下载。