Elasticsearch:使用 huggingface 模型的 NLP 文本搜索

本博文使用由 Elastic 博客 title 组成的简单数据集在 Elasticsearch 中实现 NLP 文本搜索。你将为博客文档建立索引,并使用摄取管道生成文本嵌入。 通过使用 NLP 模型,你将使用自然语言在博客文档上查询文档。

安装

Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,请参考如下的链接来进行安装:

在安装的时候,我们可以选择 Elastic Stack 8.x 的安装指南来进行安装。在本博文中,我将使用最新的 Elastic Stack 8.10 来进行展示。

在安装 Elasticsearch 的过程中,我们需要记下如下的信息:

由于我们将要使用到 eland 来上传模型。这是一个收费的功能。我们需要启动试用功能:

Python 安装包

在本演示中,我们将使用 Python 来进行展示。我们需要安装访问 Elasticsearch 相应的安装包 elasticsearch:

python3 -m pip install -qU sentence-transformers eland elasticsearch transformers

我们将使用 Jupyter Notebook 来进行展示。

markdown 复制代码
1.  $ pwd
2.  /Users/liuxg/python/elser
3.  $ jupyter notebook

准备数据

我们在项目的根目录下,创建如下的一个数据文件: data.json:

data.json

css 复制代码
1.  [2.     {3.        "title":"Pulp Fiction",4.        "runtime":"154",5.        "plot":"The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",6.        "keyScene":"John Travolta is forced to inject adrenaline directly into Uma Thurman's heart after she overdoses on heroin.",7.        "genre":"Crime, Drama",8.        "released":"1994"9.     },10.     {11.        "title":"The Dark Knight",12.        "runtime":"152",13.        "plot":"When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",14.        "keyScene":"Batman angrily responds 'I'm Batman' when asked who he is by Falcone.",15.        "genre":"Action, Crime, Drama, Thriller",16.        "released":"2008"17.     },18.     {19.        "title":"Fight Club",20.        "runtime":"139",21.        "plot":"An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.",22.        "keyScene":"Brad Pitt explains the rules of Fight Club to Edward Norton. The first rule of Fight Club is: You do not talk about Fight Club. The second rule of Fight Club is: You do not talk about Fight Club.",23.        "genre":"Drama",24.        "released":"1999"25.     },26.     {27.        "title":"Inception",28.        "runtime":"148",29.        "plot":"A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into thed of a C.E.O.",30.        "keyScene":"Leonardo DiCaprio explains the concept of inception to Ellen Page by using a child's spinning top.",31.        "genre":"Action, Adventure, Sci-Fi, Thriller",32.        "released":"2010"33.     },34.     {35.        "title":"The Matrix",36.        "runtime":"136",37.        "plot":"A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.",38.        "keyScene":"Red pill or blue pill? Morpheus offers Neo a choice between the red pill, which will allow him to learn the truth about the Matrix, or the blue pill, which will return him to his former life.",39.        "genre":"Action, Sci-Fi",40.        "released":"1999"41.     },42.     {43.        "title":"The Shawshank Redemption",44.        "runtime":"142",45.        "plot":"Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.",46.        "keyScene":"Andy Dufresne escapes from Shawshank prison by crawling through a sewer pipe.",47.        "genre":"Drama",48.        "released":"1994"49.     },50.     {51.        "title":"Goodfellas",52.        "runtime":"146",53.        "plot":"The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.",54.        "keyScene":"Joe Pesci's character Tommy DeVito shoots young Spider in the foot for not getting him a drink.",55.        "genre":"Biography, Crime, Drama",56.        "released":"1990"57.     },58.     {59.        "title":"Se7en",60.        "runtime":"127",61.        "plot":"Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.",62.        "keyScene":"Brad Pitt's character David Mills shoots John Doe after he reveals that he murdered Mills' wife.",63.        "genre":"Crime, Drama, Mystery, Thriller",64.        "released":"1995"65.     },66.     {67.        "title":"The Silence of the Lambs",68.        "runtime":"118",69.        "plot":"A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.",70.        "keyScene":"Hannibal Lecter explains to Clarice Starling that he ate a census taker's liver with some fava beans and a nice Chianti.",71.        "genre":"Crime, Drama, Thriller",72.        "released":"1991"73.     },74.     {75.        "title":"The Godfather",76.        "runtime":"175",77.        "plot":"An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.",78.        "keyScene":"James Caan's character Sonny Corleone is shot to death at a toll booth by a number of machine gun toting enemies.",79.        "genre":"Crime, Drama",80.        "released":"1972"81.     },82.     {83.        "title":"The Departed",84.        "runtime":"151",85.        "plot":"An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.",86.        "keyScene":"Leonardo DiCaprio's character Billy Costigan is shot to death by Matt Damon's character Colin Sullivan.",87.        "genre":"Crime, Drama, Thriller",88.        "released":"2006"89.     },90.     {91.        "title":"The Usual Suspects",92.        "runtime":"106",93.        "plot":"A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.",94.        "keyScene":"Kevin Spacey's character Verbal Kint is revealed to be the mastermind behind the crime, when his limp disappears as he walks away from the police station.",95.        "genre":"Crime, Mystery, Thriller",96.        "released":"1995"97.     }98.  ]
markdown 复制代码
1.  $ pwd
2.  /Users/liuxg/python/elser
3.  $ ls
4.  Multilingual semantic search.ipynb
5.  NLP text search using hugging face transformer model.ipynb
6.  Semantic search - ELSER.ipynb
7.  data.json

创建应用并演示

import modules

javascript 复制代码
1.  import pandas as pd, json
2.  from elasticsearch import Elasticsearch
3.  from getpass import getpass
4.  from urllib.request import urlopen

部署 NLP 模型

我们将使用 eland 工具来安装 text_embedding 模型。 对于我们的模型,我们使用 all-MiniLM-L6-v2 将搜索文本转换为密集向量。

该模型会将你的搜索查询转换为向量,该向量将用于对 Elasticsearch 中存储的文档集进行搜索。

我们在 terminal 中打入如下的命令:

arduino 复制代码
1.  eland_import_hub_model --url https://elastic:vXDWYtL*my3vnKY9zCfL@localhost:9200 \
2.        --hub-model-id sentence-transformers/all-MiniLM-L6-v2 \
3.        --task-type text_embedding \
4.        --ca-cert /Users/liuxg/elastic/elasticsearch-8.10.0/config/certs/http_ca.crt \
5.        --start

请注意

  • 我们需要根据自己的部署来替换上面的 elastic 超级用户的密码
  • 我们需要根据自己的 Elasticsearch 集群的部署来替换上面的 Elasticsearch 访问地址
  • 我们需要根据自己的部署的证书来替换上面的证书路径

我们回到 Kibana 的界面:

连接到 Elasticsearch

我们创建一个客户端连接:

ini 复制代码
1.  ELASTCSEARCH_CERT_PATH = "/Users/liuxg/elastic/elasticsearch-8.10.0/config/certs/http_ca.crt"

3.  es = Elasticsearch(  ['https://localhost:9200'],
4.      basic_auth = ('elastic', 'vXDWYtL*my3vnKY9zCfL'),
5.      ca_certs = ELASTCSEARCH_CERT_PATH,
6.      verify_certs = True)
7.  print(es.info())

创建 ingest pipeline

我们需要创建一个文本嵌入提取管道来生成 title 字段的向量(文本)嵌入。

下面的管道定义了一个用于 NLP 模型的 inference 处理器。

ini 复制代码
1.  # ingest pipeline definition
2.  PIPELINE_ID="vectorize_blogs"

4.  es.ingest.put_pipeline(id=PIPELINE_ID, processors=[{
5.          "inference": {
6.            "model_id": "sentence-transformers__all-minilm-l6-v2",
7.            "target_field": "text_embedding",
8.            "field_map": {
9.              "title": "text_field"
10.            }
11.          }
12.        }])

创建带有映射的索引

现在,在索引文档之前,我们将创建一个具有正确映射的 Elasticsearch 索引。 我们添加 text_embedding 以包含 model_id 和 Predicted_value 来存储嵌入。

perl 复制代码
1.  # define index name
2.  INDEX_

4.  # flag to check if index has to be deleted before creating
5.  SHOULD_DELETE_INDEX=True

7.  # define index mapping
8.  INDEX_MAPPING = {
9.      "properties": {
10.        "title": {
11.          "type": "text",
12.          "fields": {
13.            "keyword": {
14.              "type": "keyword",
15.              "ignore_above": 256
16.            }
17.          }
18.        },

20.        "text_embedding": {
21.          "properties": {
22.            "is_truncated": {
23.              "type": "boolean"
24.            },
25.            "model_id": {
26.              "type": "text",
27.              "fields": {
28.                "keyword": {
29.                  "type": "keyword",
30.                  "ignore_above": 256
31.                }
32.              }
33.            },
34.            "predicted_value": {
35.              "type": "dense_vector",
36.              "dims": 384,
37.              "index": True,
38.              "similarity": "l2_norm"
39.            }
40.          }
41.        }
42.      }
43.    }

45.  INDEX_SETTINGS = {
46.      "index": {
47.        "number_of_replicas": "1",
48.        "number_of_shards": "1",
49.        "default_pipeline": PIPELINE_ID
50.      }
51.  }

53.  # check if we want to delete index before creating the index
54.  if(SHOULD_DELETE_INDEX):
55.    if es.indices.exists(index=INDEX_NAME):
56.      print("Deleting existing %s" % INDEX_NAME)
57.      client.options(ignore_status=[400, 404]).indices.delete(index=INDEX_NAME)

59.  print("Creating index %s" % INDEX_NAME)
60.  es.options(ignore_status=[400,404]).indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS)

摄入数据到 Elasticsearch

让我们使用摄取管道对示例博客数据进行索引。

注意:在我们开始索引之前,请确保你已启动训练模型部署。

python 复制代码
`

1.  from elasticsearch import helpers

3.  # Load data into a JSON object
4.  with open('data.json') as f:
5.     data_json = json.load(f)

7.  print(data_json)

9.  # Prepare the documents to be indexed
10.  documents = []
11.  for doc in data_json:
12.      documents.append({
13.          "_index": "blogs",
14.          "_source": doc,
15.      })

17.  # Use helpers.bulk to index
18.  helpers.bulk(client, documents)

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

我们可以回到 Kibana 的界面查看被写入的文档:

bash 复制代码
GET blogs/_search

查询数据集

下一步是运行查询来搜索相关博客。 该示例查询使用我们上传到 Elasticsearch Sentence-transformers__all-minilm-l6-v2 的模型来搜索 "model_text": "scientific fiction"。

该过程是一个查询,尽管它内部包含两个任务。 首先,查询将使用 NLP 模型为您的搜索文本生成一个向量,然后传递该向量以在数据集上进行搜索。

结果,输出显示按照与搜索查询的接近度排序的查询文档列表。

ini 复制代码
1.  INDEX_

3.  source_fields = [ "id", "title"]

5.  query = {
6.    "field": "text_embedding.predicted_value",
7.    "k": 10,
8.    "num_candidates": 50,
9.    "query_vector_builder": {
10.      "text_embedding": {"model_id": "sentence-transformers__all-minilm-l6-v2",
11.        "model_text": "scientific fiction"
12.      }
13.    }
14.  }

16.  response = es.search(
17.      index=INDEX_NAME,
18.      fields=source_fields,
19.      knn=query,
20.      source=False)

23.  results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))

25.  # shows the result
26.  results[['_id', '_score', 'fields.title']]

上面命令显示的结果为:

你可尝试另外的一个搜索,比如:dark knight

最终的 jupyter 文件可以在地址下载。

相关推荐
Elastic 中国社区官方博客6 小时前
如何将数据从 AWS S3 导入到 Elastic Cloud - 第 3 部分:Elastic S3 连接器
大数据·elasticsearch·搜索引擎·云计算·全文检索·可用性测试·aws
掘金-我是哪吒6 小时前
微服务mysql,redis,elasticsearch, kibana,cassandra,mongodb, kafka
redis·mysql·mongodb·elasticsearch·微服务
研究是为了理解8 小时前
Git Bash 常用命令
git·elasticsearch·bash
晨欣11 小时前
Elasticsearch和Lucene之间是什么关系?(ChatGPT回答)
elasticsearch·chatgpt·lucene
筱源源17 小时前
Elasticsearch-linux环境部署
linux·elasticsearch
Elastic 中国社区官方博客1 天前
释放专利力量:Patently 如何利用向量搜索和 NLP 简化协作
大数据·数据库·人工智能·elasticsearch·搜索引擎·自然语言处理
Shenqi Lotus1 天前
ELK-ELK基本概念_ElasticSearch的配置
elk·elasticsearch
yeye198912241 天前
10-Query & Filtering 与多字符串多字段查询
elasticsearch
Narutolxy2 天前
精准优化Elasticsearch:磁盘空间管理与性能提升技巧20241106
大数据·elasticsearch·jenkins
谢小涛2 天前
ES管理工具Cerebro 0.8.5 Windows版本安装及启动
elasticsearch·es·cerebro