如何基于DSL脚本进行elasticsearch向量检索示例

向量检索是目前RAG库主要依赖功能,这里示例如何在elasticsearch中构建dense_vector向量库,并示例基于DSL脚本的向量相似度检索过程

1 es向量检索

1.1 DSL检索

es访问dense_vector的DSL方法有cosinessimilarity, dotProduct, 1norm或l2norm函数。

需要注意每个DSL脚本只能调用这些函数一次。不要在循环中使用这些函数来计算文档向量和多个其他向量之间的相似性。

如果需要该功能,可以通过直接访问向量值来重新实现这些函数。

1.2 建立连接

假设elasticsearch环境已经构建

以下时连接elasticsearch的代码示例

复制代码
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
es.info()

输出如下所示

ObjectApiResponse({'name': '87eb7cfd07bb', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'HE1OJj1aQYmnnRL2nwVNgw', 'version': {'number': '8.11.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '64cf052f3b56b1fd4449f5454cb88aca7e739d9a', 'build_date': '2023-12-08T11:33:53.634979452Z', 'build_snapshot': False, 'lucene_version': '9.8.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

2 虚拟简单示例

这里通过虚构简单的三维向量,示例向量入库和检索过程。

2.1 建立索引

构建文本字段和向量字段,采用虚拟三维向量。

复制代码
index_name = "vec_index_0"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 3
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_0'})

2.2 插入数据

数据插入代码如下所示,并虚构3个三维向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = [
    [0.5, 10, 6],
    [-0.5, 10, 10],
    [-1, 11, 3]
]
print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

2.3 向量检索

cosinessimilarity函数计算给定查询向量和文档向量之间的余弦相似性度量。

这里通过DSL脚本应用cosine相似度计算,并加1方式相似度为负。

检索代码如下所示。

复制代码
vec = [-0.5, 10, 6]
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.9963301, 'hits': [{'_index': 'vec_query_test_v4', '_id': 'erQPgpoBzVHvZ0eub0Bb', '_score': 1.9963301, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.5, 10, 6]}}, {'_index': 'vec_query_test_v4', '_id': 'e7QPgpoBzVHvZ0eub0Bb', '_score': 1.9701604, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [-0.5, 10, 10]}}, {'_index': 'vec_query_test_v4', '_id': 'fLQPgpoBzVHvZ0eub0Bb', '_score': 1.9618319, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [-1, 11, 3]}}]}}

3 实际向量示例

3.1 创建向量引擎

这里基于langchain的OllamaEmbeddings创建向量计算引起,并假设ollama和langchain已安装。

创建代码示例如下

复制代码
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="bge-m3")

3.2 建立索引

构建文本字段和向量字段,采用bge-m3的1024维向量。

复制代码
index_name = "vec_index_ollama"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 1024
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_ollama'})

3.3 插入数据

数据插入代码如下所示,并虚构3个bge-m3向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = embeddings.embed_documents(texts)

print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

3.4 向量查询

检索代码如下所示,在计算相似度之前,这里需要先用ollama embedding计算查询向量,。

程序代码示例如下。

复制代码
vec = embeddings.embed_query("Rocky Mountain National Park")
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": vec
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

1024

{'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.5386453, 'hits': [{'_index': 'vec_index_ollama', '_id': 'grQbgpoBzVHvZ0eubkCf', '_score': 1.5386453, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [0.030506486, -0.008855448, -0.0542615, 0.032204177, -0.03877517, -0.061205674, ..., 0.012325888, 0.018495426, 0.018682847]}}, {'_index': 'vec_index_ollama', '_id': 'gbQbgpoBzVHvZ0eubkCf', '_score': 1.4467602, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [0.019682772, 0.0011470582, -0.055675734, 0.019353507, 0.0018442876, -0.04981726, -0.0319025, 0.003043613, 0.0074340384, 0.004517093, 0.0070879683, -0.0009796194, ..., 0.036385845, -0.03322887, 0.032527614, 0.047814477]}}, {'_index': 'vec_index_ollama', '_id': 'gLQbgpoBzVHvZ0eubkCf', '_score': 1.4366195, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.04341205, 0.010377382, -0.053659115, 0.034629174, -0.040680032, -0.05273915, -0.021637093, -0.014424741, 0.003394217, 0.005586104, 0.010616635, 0.039135892, -0.021042265, -0.0010980334, ..., -0.0015615255, 0.02610852, 0.015156581, 0.010001613, 0.034023464]}}]}}

reference


向量数据库:使用Elasticsearch实现向量数据存储与搜索

https://zhuanlan.zhihu.com/p/634015951

相关推荐
Elastic 中国社区官方博客4 分钟前
Elasticsearch:Workflows 介绍 - 9.3
大数据·数据库·人工智能·elasticsearch·ai·全文检索
B站_计算机毕业设计之家7 分钟前
豆瓣电影推荐系统 | Python Django Echarts构建个性化影视推荐平台 大数据 毕业设计源码 (建议收藏)✅
大数据·python·机器学习·django·毕业设计·echarts·推荐算法
莽撞的大地瓜19 分钟前
洞察,始于一目了然——让舆情数据自己“说话”
大数据·网络·数据分析
证榜样呀27 分钟前
2026 中专大数据技术专业可考的证书有哪些,必看!
大数据·sql
星辰_mya27 分钟前
Elasticsearch主分片数写入后不能改
大数据·elasticsearch·搜索引擎
班德先生36 分钟前
深耕多赛道品牌全案策划,为科技与时尚注入商业表达力
大数据·人工智能·科技
鸿乃江边鸟1 小时前
Spark Datafusion Comet 向量化Rust Native--CometShuffleExchangeExec怎么控制读写
大数据·rust·spark·native
忆~遂愿1 小时前
CANN metadef 深度解析:动态形状元数据管理、图编译器接口规范与序列化执行机制
大数据·linux
AI职业加油站1 小时前
职业提升之路:我的大数据分析师学习与备考分享
大数据·人工智能·经验分享·学习·职场和发展·数据分析
风指引着方向1 小时前
昇腾算子性能调优:ops-nn 中的内存布局与向量化技巧
java·大数据·人工智能