在RAG技术中需要使用到向量数据库
向量数据库 Milvus 的安装使用
安装方式
Docker部署(推荐):
安装Docker和Docker Compose。
下载Milvus的Docker Compose配置文件:
|------------------------------------------------------------------------------------------------------------------------------|
| wget https://github.com/milvus-io/milvus/releases/download/v2.6.9/milvus-standalone-docker-compose.yml -O docker-compose.yml |
启动Milvus服务:
|----------------------|
| docker-compose up -d |
验证服务是否运行
|-------------------|
| docker-compose ps |
基本使用
连接Milvus:
使用PyMilvus SDK连接Milvus服务:
|-------------------------------------------------------------------------------------------------------|
| from pymilvus import connections connections.connect(alias="default", host="localhost", port="19530") |
创建集合:
集合是存储向量的基本单位:
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| from pymilvus import CollectionSchema, FieldSchema, DataType fields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True), FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768) ] schema = CollectionSchema(fields=fields) collection = Collection("example_collection", schema) |
插入数据:
|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| import numpy as np vectors = np.random.rand(10, 768).astype(np.float32) collection.insert([[i for i in range(10)], vectors]) collection.index() # 创建索引 |
向量搜索
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| query_vector = np.random.rand(1, 768).astype(np.float32) results = collection.search( data=query_vector, anns_field="embedding", param={"metric_type": "L2", "params": {"nprobe": 10}}, limit=3 ) print(results[0].ids) # 输出相似向量的ID |
在Milvus向量数据库中插入文档数据需要结合文本嵌入模型(如BERT、Sentence-BERT等)将文档转换为向量,再通过Milvus的SDK将向量和元数据(如文档ID、标题等)插入到集合中。以下是详细步骤和示例代码
安装依赖库
|----------------------------------------------------------------------------|
| pip install pymilvus sentence-transformers # 示例使用sentence-transformers生成嵌入 |
启动Milvus服务:
使用Docker部署(推荐):
|-------------------------------------------------------------------------------------------|
| docker run -d --name milvus-standalone -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest |
插入文档数据的完整流程
- 定义文档数据结构
假设有一组文档,每个文档包含ID、标题和内容:
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| documents = [ {"id": 1, "title": "Milvus简介", "content": "Milvus是一款开源的向量数据库,支持高效相似性搜索。"}, {"id": 2, "title": "RAG技术", "content": "RAG(检索增强生成)结合了检索和生成模型,提升问答准确性。"}, {"id": 3, "title": "向量数据库选型", "content": "Milvus适合大规模向量检索,Chroma适合快速原型开发。"} ] |
- 生成文档嵌入向量
使用sentence-transformers将文档内容转换为向量(维度为768):
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| from sentence_transformers import SentenceTransformer model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2") # 支持多语言 embeddings = model.encode([doc["content"] for doc in documents]) # 生成向量列表 |
- 创建Milvus集合
集合需定义字段:
id:主键字段(唯一标识文档)。
embedding:浮点向量字段(存储嵌入向量)。
可选元数据字段(如title):存储文档标题或其他属性
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection # 连接Milvus connections.connect("default", host="localhost", port="19530") # 定义字段 fields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True), FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768), # 向量维度需与生成的一致 FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200) # 可选元数据字段 ] # 创建集合 schema = CollectionSchema(fields=fields) collection = Collection("document_collection", schema) |
- 插入数据
将文档ID、向量和元数据插入集合:
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| # 提取数据 ids = [doc["id"] for doc in documents] titles = [doc["title"] for doc in documents] # 插入数据(需按字段顺序对应) collection.insert([ids, embeddings, titles]) # 创建索引(可选,但推荐以提高搜索效率) collection.create_index("embedding", {"metric_type": "IP", "params": {"nlist": 128}}) collection.load() # 加载集合到内存 |
验证数据插入
查询集合中的数据:
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| # 查询所有文档的ID和标题 results = collection.query( expr="id in [1, 2, 3]", # 查询条件 output_fields=["id", "title"] # 返回字段 ) print(results) # 输出示例: # [{'id': 1, 'title': 'Milvus简介'}, {'id': 2, 'title': 'RAG技术'}, {'id': 3, 'title': '向量数据库选型'}] |
完整代码示例
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection from sentence_transformers import SentenceTransformer # 1. 准备文档数据 documents = [ {"id": 1, "title": "Milvus简介", "content": "Milvus是一款开源的向量数据库,支持高效相似性搜索。"}, {"id": 2, "title": "RAG技术", "content": "RAG(检索增强生成)结合了检索和生成模型,提升问答准确性。"}, {"id": 3, "title": "向量数据库选型", "content": "Milvus适合大规模向量检索,Chroma适合快速原型开发。"} ] # 2. 生成嵌入向量 model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2") embeddings = model.encode([doc["content"] for doc in documents]) # 3. 连接Milvus并创建集合 connections.connect("default", host="localhost", port="19530") fields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True), FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768), FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200) ] schema = CollectionSchema(fields=fields) collection = Collection("document_collection", schema) # 4. 插入数据 ids = [doc["id"] for doc in documents] titles = [doc["title"] for doc in documents] collection.insert([ids, embeddings, titles]) # 5. 创建索引并加载 collection.create_index("embedding", {"metric_type": "IP", "params": {"nlist": 128}}) collection.load() print("文档数据插入成功!") |
