AI 向量数据库 Pinecone 实战

现在大家或多或少都用过AI的一些产品，而在AI领域中，向量无疑是其技术基底，而本文主要介绍的是就是关于向量数据库的一些基础实战，用到的技术服务商是 Pinecone，希望能帮到大家！

1、安装依赖项

ruby 复制代码

# install pinecone
$ pip install pinecone-client

# install sentence-transformers
$ pip install sentence-transformers

sentence-transformers 库的作用就是将我们的文本数据编码为向量嵌入，并存储在向量数据库中。

sentence-transformers 提供了各种预训练的架构，例如BERT、RoBERTa和DistilBERT，并专门针对句子嵌入进行了微调。

2、导入依赖

javascript 复制代码

from pinecone import Pinecone, PodSpec
from sentence_transformers import SentenceTransformer

3、下载并实例化 DistilBERT 模型

我们在前面的文章也提到，DistilBERT 相比于BERT体积减少了40%，所以本次使用这个模型作为示例

ini 复制代码

model_name = 'distilbert-base-nli-stsb-mean-tokens'
model = SentenceTransformer(model_name)

4、获取密钥

要使用 Pinecone 服务并创建向量数据库，我们需要一个 Pinecone API 密钥。

注册后我们进入如下页面，并从仪表板的左侧面板中获取您的 API 密钥：

使用默认的API Keys，或者创建一个新的都行

5、获取密钥，建立连接

ini 复制代码

pinecone_key = "<YOUR-API-KEY>"
pc = Pinecone(api_key=pinecone_key)

可以通过 list_indexes 方法测试连接是否成功

javascript 复制代码

# 如果是新注册用户，因为还没有创建索引，所以返回空
>>> print(pc.list_indexes())
{'indexes': []}

可能会遇到 huggingface 无法访问的问题

有很多种解决方式，这里我是设置了代理

6、创建索引

这里的索引有点像数据库，跟ES的index含义有点类似，创建索引使用的是 create_index 方法，代码如下

ini 复制代码

>>> pc.create_index(
...     name="vector-demo",
...     dimension=768,
...     metric="euclidean",
...     spec=PodSpec(environment="gcp-starter")
...   )

参数说明：

name：索引名称
dimension：存储在这个索引中的向量的维数。你要插入的向量是多少维，这里就该设置多少，因为我们用的是Sentence Transform模型返回的嵌入维数，所以值为768
metric：用于计算向量之间相似性的方法。euclidean 表示使用欧几里得距离
spec：PodSpec 指定了创建索引的环境。在此示例中，索引是在名为gcp-starter的GCP（Google Cloud Platform）环境中创建的

刷新面板也能看到我们创建的索引

7、上传向量数据

现在我们已经创建了索引，我们可以生成向量嵌入数据，并上传到我们的索引。

为此，我们需要创建一些文本数据并使用SentenceTransformer模型对其进行编码，示例数据如下：

ini 复制代码

data = [
    {"id": "vector1",  "text": "I love using vector databases"},
    {"id": "vector2",  "text": "Vector databases are great for storing and retrieving vectors"},
    {"id": "vector3",  "text": "Using vector databases makes my life easier"},
    {"id": "vector4",  "text": "Vector databases are efficient for storing vectors"},
    {"id": "vector5",  "text": "I enjoy working with vector databases"},
    {"id": "vector6",  "text": "Vector databases are useful for many applications"},
    {"id": "vector7",  "text": "I find vector databases very helpful"},
    {"id": "vector8",  "text": "Vector databases can handle large amounts of data"},
    {"id": "vector9",  "text": "I think vector databases are the future of data storage"},
    {"id": "vector10", "text": "Using vector databases has improved my workflow"}
]

我们为这些句子创建向量嵌入，如下所示：

ini 复制代码

vector_data = []

for sentence in data:
    embedding = model.encode(sentence["text"])
    vector_info = {"id":sentence["id"], "values": embedding.tolist()}
    
    vector_data.append(vector_info)

vector_data 的数据结构如下（values的长度就是它的维数）：

因为一个账户下可以创建多个索引，所以在把向量数据上传前，需要先指定一个索引

ini 复制代码

>>> index = pc.Index("vector-demo")

这里使用的上传方法是Upsert，它是一种结合了update和insert操作的数据库操作。如果文档尚不存在，它将向集合中插入新文档；如果存在，则更新现有文档（如果你用过MongoDB，你会对这个用法非常熟悉）

javascript 复制代码

>>> index.upsert(vectors=vector_data)
{'upserted_count': 10}

虽然返回结果已经告诉我们插入了10条数据，如果你想双重确认的话，还可以通过 describe_index_stats 再次确认

javascript 复制代码

>>> index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0001,
 'namespaces': {'': {'vector_count': 10}},
 'total_vector_count': 10}

返回参数说明：

dimension：存储在索引中的向量的维数
index_fullness：衡量索引有多满的指标，通常表示索引中槽位被占用的百分比。
namespaces：索引中每个命名空间的统计信息。
total_vector_count：所有命名空间的索引中向量的总数

8、相似性搜索

现在向量数据已经存储到索引中，这时可以利用相似性搜索来查看获得的结果。

首先，我们定义搜索文本并生成其嵌入向量：

ini 复制代码

>>> search_text = "Vector database are really helpful"
>>> search_embedding = model.encode(search_text).tolist()

接着执行下面的查询：

javascript 复制代码

>>> index.query(vector=search_embedding, top_k=3)
                
{'matches': [{'id': 'vector7', 'score': 27.7402039, 'values': []},
             {'id': 'vector4', 'score': 60.1513977, 'values': []},
             {'id': 'vector5', 'score': 62.3066101, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}

返回的是最接近的三个文本，因为我们用的是欧几里得算法，所以距离越短，表示两个向量越接近，可以看到相似性分数也是从小到大的。

以上就是本次实战的内容，如果对你有所帮助，希望点个赞支持一下！