如何基于DSL脚本进行elasticsearch向量检索示例

向量检索是目前RAG库主要依赖功能,这里示例如何在elasticsearch中构建dense_vector向量库,并示例基于DSL脚本的向量相似度检索过程

1 es向量检索

1.1 DSL检索

es访问dense_vector的DSL方法有cosinessimilarity, dotProduct, 1norm或l2norm函数。

需要注意每个DSL脚本只能调用这些函数一次。不要在循环中使用这些函数来计算文档向量和多个其他向量之间的相似性。

如果需要该功能,可以通过直接访问向量值来重新实现这些函数。

1.2 建立连接

假设elasticsearch环境已经构建

以下时连接elasticsearch的代码示例

复制代码
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
es.info()

输出如下所示

ObjectApiResponse({'name': '87eb7cfd07bb', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'HE1OJj1aQYmnnRL2nwVNgw', 'version': {'number': '8.11.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '64cf052f3b56b1fd4449f5454cb88aca7e739d9a', 'build_date': '2023-12-08T11:33:53.634979452Z', 'build_snapshot': False, 'lucene_version': '9.8.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

2 虚拟简单示例

这里通过虚构简单的三维向量,示例向量入库和检索过程。

2.1 建立索引

构建文本字段和向量字段,采用虚拟三维向量。

复制代码
index_name = "vec_index_0"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 3
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_0'})

2.2 插入数据

数据插入代码如下所示,并虚构3个三维向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = [
    [0.5, 10, 6],
    [-0.5, 10, 10],
    [-1, 11, 3]
]
print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

2.3 向量检索

cosinessimilarity函数计算给定查询向量和文档向量之间的余弦相似性度量。

这里通过DSL脚本应用cosine相似度计算,并加1方式相似度为负。

检索代码如下所示。

复制代码
vec = [-0.5, 10, 6]
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.9963301, 'hits': [{'_index': 'vec_query_test_v4', '_id': 'erQPgpoBzVHvZ0eub0Bb', '_score': 1.9963301, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.5, 10, 6]}}, {'_index': 'vec_query_test_v4', '_id': 'e7QPgpoBzVHvZ0eub0Bb', '_score': 1.9701604, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [-0.5, 10, 10]}}, {'_index': 'vec_query_test_v4', '_id': 'fLQPgpoBzVHvZ0eub0Bb', '_score': 1.9618319, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [-1, 11, 3]}}]}}

3 实际向量示例

3.1 创建向量引擎

这里基于langchain的OllamaEmbeddings创建向量计算引起,并假设ollama和langchain已安装。

创建代码示例如下

复制代码
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="bge-m3")

3.2 建立索引

构建文本字段和向量字段,采用bge-m3的1024维向量。

复制代码
index_name = "vec_index_ollama"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 1024
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_ollama'})

3.3 插入数据

数据插入代码如下所示,并虚构3个bge-m3向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = embeddings.embed_documents(texts)

print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

3.4 向量查询

检索代码如下所示,在计算相似度之前,这里需要先用ollama embedding计算查询向量,。

程序代码示例如下。

复制代码
vec = embeddings.embed_query("Rocky Mountain National Park")
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": vec
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

1024

{'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.5386453, 'hits': [{'_index': 'vec_index_ollama', '_id': 'grQbgpoBzVHvZ0eubkCf', '_score': 1.5386453, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [0.030506486, -0.008855448, -0.0542615, 0.032204177, -0.03877517, -0.061205674, ..., 0.012325888, 0.018495426, 0.018682847]}}, {'_index': 'vec_index_ollama', '_id': 'gbQbgpoBzVHvZ0eubkCf', '_score': 1.4467602, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [0.019682772, 0.0011470582, -0.055675734, 0.019353507, 0.0018442876, -0.04981726, -0.0319025, 0.003043613, 0.0074340384, 0.004517093, 0.0070879683, -0.0009796194, ..., 0.036385845, -0.03322887, 0.032527614, 0.047814477]}}, {'_index': 'vec_index_ollama', '_id': 'gLQbgpoBzVHvZ0eubkCf', '_score': 1.4366195, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.04341205, 0.010377382, -0.053659115, 0.034629174, -0.040680032, -0.05273915, -0.021637093, -0.014424741, 0.003394217, 0.005586104, 0.010616635, 0.039135892, -0.021042265, -0.0010980334, ..., -0.0015615255, 0.02610852, 0.015156581, 0.010001613, 0.034023464]}}]}}

reference


向量数据库:使用Elasticsearch实现向量数据存储与搜索

https://zhuanlan.zhihu.com/p/634015951

相关推荐
字节跳动数据平台1 小时前
代码量减少 70%、GPU 利用率达 95%:火山引擎多模态数据湖如何释放模思智能的算法生产力
大数据
得物技术2 小时前
深入剖析Spark UI界面:参数与界面详解|得物技术
大数据·后端·spark
武子康4 小时前
大数据-238 离线数仓 - 广告业务 Hive分析实战:ADS 点击率、购买率与 Top100 排名避坑
大数据·后端·apache hive
武子康1 天前
大数据-237 离线数仓 - Hive 广告业务实战:ODS→DWD 事件解析、广告明细与转化分析落地
大数据·后端·apache hive
大大大大晴天1 天前
Flink生产问题排障-Kryo serializer scala extensions are not available
大数据·flink
Elasticsearch2 天前
如何使用 Agent Builder 排查 Kubernetes Pod 重启和 OOMKilled 事件
elasticsearch
Elasticsearch3 天前
通用表达式语言 ( CEL ): CEL 输入如何改进 Elastic Agent 集成中的数据收集
elasticsearch
武子康3 天前
大数据-236 离线数仓 - 会员指标验证、DataX 导出与广告业务 ODS/DWD/ADS 全流程
大数据·后端·apache hive
武子康4 天前
大数据-235 离线数仓 - 实战:Flume+HDFS+Hive 搭建 ODS/DWD/DWS/ADS 会员分析链路
大数据·后端·apache hive
DianSan_ERP5 天前
电商API接口全链路监控:构建坚不可摧的线上运维防线
大数据·运维·网络·人工智能·git·servlet