如何基于DSL脚本进行elasticsearch向量检索示例

向量检索是目前RAG库主要依赖功能,这里示例如何在elasticsearch中构建dense_vector向量库,并示例基于DSL脚本的向量相似度检索过程

1 es向量检索

1.1 DSL检索

es访问dense_vector的DSL方法有cosinessimilarity, dotProduct, 1norm或l2norm函数。

需要注意每个DSL脚本只能调用这些函数一次。不要在循环中使用这些函数来计算文档向量和多个其他向量之间的相似性。

如果需要该功能,可以通过直接访问向量值来重新实现这些函数。

1.2 建立连接

假设elasticsearch环境已经构建

以下时连接elasticsearch的代码示例

复制代码
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
es.info()

输出如下所示

ObjectApiResponse({'name': '87eb7cfd07bb', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'HE1OJj1aQYmnnRL2nwVNgw', 'version': {'number': '8.11.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '64cf052f3b56b1fd4449f5454cb88aca7e739d9a', 'build_date': '2023-12-08T11:33:53.634979452Z', 'build_snapshot': False, 'lucene_version': '9.8.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

2 虚拟简单示例

这里通过虚构简单的三维向量,示例向量入库和检索过程。

2.1 建立索引

构建文本字段和向量字段,采用虚拟三维向量。

复制代码
index_name = "vec_index_0"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 3
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_0'})

2.2 插入数据

数据插入代码如下所示,并虚构3个三维向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = [
    [0.5, 10, 6],
    [-0.5, 10, 10],
    [-1, 11, 3]
]
print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

2.3 向量检索

cosinessimilarity函数计算给定查询向量和文档向量之间的余弦相似性度量。

这里通过DSL脚本应用cosine相似度计算,并加1方式相似度为负。

检索代码如下所示。

复制代码
vec = [-0.5, 10, 6]
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.9963301, 'hits': [{'_index': 'vec_query_test_v4', '_id': 'erQPgpoBzVHvZ0eub0Bb', '_score': 1.9963301, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.5, 10, 6]}}, {'_index': 'vec_query_test_v4', '_id': 'e7QPgpoBzVHvZ0eub0Bb', '_score': 1.9701604, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [-0.5, 10, 10]}}, {'_index': 'vec_query_test_v4', '_id': 'fLQPgpoBzVHvZ0eub0Bb', '_score': 1.9618319, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [-1, 11, 3]}}]}}

3 实际向量示例

3.1 创建向量引擎

这里基于langchain的OllamaEmbeddings创建向量计算引起,并假设ollama和langchain已安装。

创建代码示例如下

复制代码
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="bge-m3")

3.2 建立索引

构建文本字段和向量字段,采用bge-m3的1024维向量。

复制代码
index_name = "vec_index_ollama"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 1024
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_ollama'})

3.3 插入数据

数据插入代码如下所示,并虚构3个bge-m3向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = embeddings.embed_documents(texts)

print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

3.4 向量查询

检索代码如下所示,在计算相似度之前,这里需要先用ollama embedding计算查询向量,。

程序代码示例如下。

复制代码
vec = embeddings.embed_query("Rocky Mountain National Park")
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": vec
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

1024

{'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.5386453, 'hits': [{'_index': 'vec_index_ollama', '_id': 'grQbgpoBzVHvZ0eubkCf', '_score': 1.5386453, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [0.030506486, -0.008855448, -0.0542615, 0.032204177, -0.03877517, -0.061205674, ..., 0.012325888, 0.018495426, 0.018682847]}}, {'_index': 'vec_index_ollama', '_id': 'gbQbgpoBzVHvZ0eubkCf', '_score': 1.4467602, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [0.019682772, 0.0011470582, -0.055675734, 0.019353507, 0.0018442876, -0.04981726, -0.0319025, 0.003043613, 0.0074340384, 0.004517093, 0.0070879683, -0.0009796194, ..., 0.036385845, -0.03322887, 0.032527614, 0.047814477]}}, {'_index': 'vec_index_ollama', '_id': 'gLQbgpoBzVHvZ0eubkCf', '_score': 1.4366195, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.04341205, 0.010377382, -0.053659115, 0.034629174, -0.040680032, -0.05273915, -0.021637093, -0.014424741, 0.003394217, 0.005586104, 0.010616635, 0.039135892, -0.021042265, -0.0010980334, ..., -0.0015615255, 0.02610852, 0.015156581, 0.010001613, 0.034023464]}}]}}

reference


向量数据库:使用Elasticsearch实现向量数据存储与搜索

https://zhuanlan.zhihu.com/p/634015951

相关推荐
周杰伦_Jay2 小时前
【电商微服务日志处理全方案】从MySQL瓶颈到大数据架构的实战转型
大数据·mysql·微服务·架构
闲人编程2 小时前
Python与大数据:使用PySpark处理海量数据
大数据·开发语言·分布式·python·spark·codecapsule·大规模
程序员小羊!2 小时前
电商项目练习实操(二)
大数据·数据分析·etl·flume
hadage2332 小时前
--- git 笔记 ---
笔记·git·elasticsearch
谅望者2 小时前
数据分析笔记01:数据分析概述
大数据·数据库·数据仓库·数据分析
一瓢一瓢的饮 alanchan3 小时前
Flink原理与实战(java版)#第2章 Flink的入门(第二节Flink简介)
java·大数据·flink·kafka·实时计算·离线计算·流批一体化计算
尘世壹俗人3 小时前
分离Hadoop客户端单独使用
大数据·hadoop·分布式
厨 神3 小时前
11月10日ES本机
大数据·elasticsearch·搜索引擎
微盛企微增长小知识4 小时前
企业微信AI怎么用?从智能表格落地看如何提升运营效率
大数据·人工智能·企业微信