Weaviate 简介与基本使用

[Weaviate 向量数据库](#Weaviate 向量数据库)
- [🚀 快速开始](#🚀 快速开始)
- [⚡ 核心特性](#⚡ 核心特性)
- 工作原理
- 安装部署
- - [方法一：Docker 命令安装](#方法一：Docker 命令安装)
  - [方法二：Weaviate Cloud (WCS)](#方法二：Weaviate Cloud (WCS))
- 详细使用说明
- - [1. 连接与初始化](#1. 连接与初始化)
  - [2. 创建集合 (Schema)](#2. 创建集合 (Schema))
  - [3. 数据操作](#3. 数据操作)
  - [4. 高级配置](#4. 高级配置)
  - - 向量化器配置
    - [HNSW 索引优化](#HNSW 索引优化)
    - 倒排索引配置
  - [5. 实际应用场景](#5. 实际应用场景)
  - - [场景 1：知识库问答系统](#场景 1：知识库问答系统)
    - [场景 2：语义搜索引擎](#场景 2：语义搜索引擎)
    - [场景 3：推荐系统](#场景 3：推荐系统)
- [🔗 相关资源](#🔗 相关资源)

Weaviate 向量数据库

定义

Weaviate 是一个开源的、云原生的 AI-native 向量数据库。它能够同时存储对象和向量嵌入，支持将向量搜索与结构化过滤相结合，具备云原生数据库的容错性和可扩展性。

🚀 快速开始

安装客户端

bash 复制代码

pip install weaviate-client

5 分钟上手

python 复制代码

import weaviate

# 1. 连接
client = weaviate.connect_to_local()

# 2. 创建集合
from weaviate.classes.config import Configure, Property, DataType

articles = client.collections.create(
    name="Article",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT)
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai()
)

# 3. 插入数据
articles.data.insert({
    "title": "Weaviate 入门",
    "content": "Weaviate 是一个强大的向量数据库..."
})

# 4. 搜索
response = articles.query.near_text(
    query="向量数据库",
    limit=5
)

# 5. 关闭连接
client.close()

⚡ 核心特性

混合搜索能力
- 向量搜索（基于语义相似度）
- 关键词搜索（BM25 算法）
- 混合搜索（结合向量和关键词）
模块化架构
- 内置多种向量化模块（text2vec-openai, text2vec-huggingface 等）
- 支持自定义向量化器及可插拔存储后端

工作原理

架构概览
GraphQL/REST
生成嵌入
生成式搜索
客户端应用
API 层
查询处理器
向量索引 HNSW
倒排索引 BM25
对象存储
向量化模块
LLM 模块
LSM Tree 存储引擎

核心机制

向量索引：HNSW 算法

Weaviate 使用分层可导航小世界算法：
- 分层图结构：上层稀疏快速定位，下层密集精确搜索。
- 复杂度 ：查询时间复杂度为 O ( log ⁡ N ) O(\log N) O(logN)。
- 优化：支持内存映射（mmap）与磁盘持久化。
混合搜索融合

用户查询
向量搜索分数
BM25 关键词分数
加权融合 α·vector + 1-α·BM25
最终排序结果

融合公式：

s c o r e = α ⋅ s c o r e v e c t o r + ( 1 − α ) ⋅ s c o r e B M 25 score = \alpha \cdot score_{vector} + (1-\alpha) \cdot score_{BM25} score=α⋅scorevector+(1−α)⋅scoreBM25

α \alpha α 默认值为 0.75（偏向语义搜索）。

安装部署

方法一：Docker 命令安装

拉取 Weaviate镜像
也可以直接跳过该步骤, 直接运行, 没有镜像会自动拉取

bash 复制代码

docker pull semitechnologies/weaviate:latest

运行 Weaviate 容器

bash 复制代码

docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.35.6

测试

bash 复制代码

curl http://localhost:8080/v1/meta

方法二：Weaviate Cloud (WCS)

地址 : console.weaviate.cloud
特点: 提供 14 天免费沙盒，无需维护基础设施。

详细使用说明

1. 连接与初始化

python 复制代码

import weaviate
from weaviate.classes.init import Auth

# 连接本地
client = weaviate.connect_to_local(port=8080)

# 连接云端
# client = weaviate.connect_to_weaviate_cloud(cluster_url="...", auth_credentials=Auth.api_key("..."))

2. 创建集合 (Schema)

python 复制代码

from weaviate.classes.config import Configure, Property, DataType

articles = client.collections.create(
    name="Article",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="views", data_type=DataType.INT)
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(model="text-embedding-3-small"),
    generative_config=Configure.Generative.openai(model="gpt-4")
)

注意

Schema 一旦创建，属性类型不可修改。建议使用 camelCase 命名。

3. 数据操作

插入数据

单条插入：

python 复制代码

# 插入单个对象[[RAG五大项目（LLM+企业级数据库+embedding+问答系统+Agent）教程，从入门到实战简直配享太庙！ p57 9-10 总结和展望：关于企业里需要良好的代码规范和代码管理]]
uuid = articles.data.insert(
    properties={
        "title": "Weaviate 入门指南",
        "content": "Weaviate 是一个强大的向量数据库...",
        "views": 100
    }
)

批量插入：

python 复制代码

# 批量插入（推荐用于大量数据）
with articles.batch.dynamic() as batch:
    for doc in data:
        batch.add_object(properties=doc)

查询数据

向量搜索（语义搜索）：

python 复制代码

# 基于语义相似度搜索
response = articles.query.near_text(
    query="向量数据库的优势",
    limit=5
)

for item in response.objects:
    print(f"标题: {item.properties['title']}")
    print(f"相似度: {item.metadata.distance}")

关键词搜索（BM25）：

python 复制代码

# 基于关键词的精确搜索
response = articles.query.bm25(
    query="Weaviate",
    limit=5
)

混合搜索：

python 复制代码

# 结合语义和关键词搜索
response = articles.query.hybrid(
    query="Weaviate 教程",
    alpha=0.75,  # 0=纯BM25, 1=纯向量, 0.75=偏向语义
    limit=5
)

带过滤器的搜索：

python 复制代码

from weaviate.classes.query import Filter

# 搜索浏览量大于 1000 的文章
response = articles.query.near_text(
    query="机器学习",
    filters=Filter.by_property("views").greater_than(1000),
    limit=5
)

生成式搜索 (RAG)：

python 复制代码

# 搜索并生成答案
response = articles.generate.near_text(
    query="向量数据库优势",
    single_prompt="请用一句话总结：{content}",
    limit=3
)

# 访问生成的内容
for item in response.objects:
    print(f"原文: {item.properties['title']}")
    print(f"生成: {item.generated}")

聚合查询：

python 复制代码

# 统计文章总数
response = articles.aggregate.over_all(
    total_count=True
)
print(f"总文章数: {response.total_count}")

# 按条件聚合
response = articles.aggregate.over_all(
    group_by="category",
    return_metrics=[
        weaviate.classes.aggregate.Metrics("views").sum()
    ]
)

更新数据

python 复制代码

# 更新对象属性
articles.data.update(
    uuid=uuid,
    properties={
        "views": 150,
        "title": "Weaviate 完整指南"
    }
)

# 替换整个对象
articles.data.replace(
    uuid=uuid,
    properties={
        "title": "新标题",
        "content": "新内容",
        "views": 200
    }
)

删除数据

python 复制代码

# 删除单个对象
articles.data.delete_by_id(uuid=uuid)

# 批量删除（根据条件）
articles.data.delete_many(
    where=Filter.by_property("views").less_than(10)
)

4. 高级配置

向量化器配置

python 复制代码

from weaviate.classes.config import Configure

# 使用 OpenAI 嵌入
collection = client.collections.create(
    name="Documents",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
        dimensions=1536,
        vectorize_collection_name=False
    )
)

# 使用 Cohere 嵌入
collection = client.collections.create(
    name="Documents",
    vectorizer_config=Configure.Vectorizer.text2vec_cohere(
        model="embed-multilingual-v3.0"
    )
)

# 使用自定义向量（不使用内置向量化器）
collection = client.collections.create(
    name="Documents",
    vectorizer_config=Configure.Vectorizer.none()
)

# 插入时提供自定义向量
articles.data.insert(
    properties={"title": "标题", "content": "内容"},
    vector=[0.1, 0.2, 0.3, ...]  # 自定义向量
)

HNSW 索引优化

python 复制代码

# 配置 HNSW 参数
collection = client.collections.create(
    name="Documents",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    vector_index_config=Configure.VectorIndex.hnsw(
        distance_metric="cosine",  # cosine, dot, l2-squared, hamming, manhattan
        ef_construction=128,       # 构建时的搜索范围（越大越精确但越慢）
        max_connections=64,        # 每个节点的最大连接数
        ef=-1,                     # 查询时的搜索范围（-1 表示动态）
        dynamic_ef_min=100,        # 动态 ef 的最小值
        dynamic_ef_max=500,        # 动态 ef 的最大值
        dynamic_ef_factor=8        # 动态 ef 因子
    )
)

倒排索引配置

python 复制代码

# 配置 BM25 参数
collection = client.collections.create(
    name="Documents",
    inverted_index_config=Configure.inverted_index(
        bm25_b=0.75,              # 文档长度归一化参数
        bm25_k1=1.2,              # 词频饱和参数
        index_null_state=True,    # 是否索引 null 值
        index_property_length=True,  # 是否索引属性长度
        index_timestamps=True     # 是否索引时间戳
    )
)

5. 实际应用场景

场景 1：知识库问答系统

python 复制代码

# 1. 创建知识库集合
kb = client.collections.create(
    name="KnowledgeBase",
    properties=[
        Property(name="question", data_type=DataType.TEXT),
        Property(name="answer", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
        Property(name="tags", data_type=DataType.TEXT_ARRAY)
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    generative_config=Configure.Generative.openai(model="gpt-4")
)

# 2. 导入知识库数据
knowledge_data = [
    {
        "question": "什么是向量数据库？",
        "answer": "向量数据库是专门用于存储和检索向量嵌入的数据库...",
        "category": "基础概念",
        "tags": ["向量", "数据库", "AI"]
    },
    # 更多数据...
]

with kb.batch.dynamic() as batch:
    for item in knowledge_data:
        batch.add_object(properties=item)

# 3. 智能问答
def ask_question(user_query: str) -> str:
    response = kb.generate.near_text(
        query=user_query,
        single_prompt="基于以下内容回答用户问题：{answer}",
        limit=3
    )
    return response.objects[0].generated

# 使用
answer = ask_question("向量数据库有什么优势？")
print(answer)

场景 2：语义搜索引擎

python 复制代码

# 1. 创建文档集合
docs = client.collections.create(
    name="Documents",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="author", data_type=DataType.TEXT),
        Property(name="publish_date", data_type=DataType.DATE),
        Property(name="category", data_type=DataType.TEXT)
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai()
)

# 2. 高级搜索功能
def semantic_search(query: str, category: str = None, date_from: str = None):
    filters = []

    if category:
        filters.append(Filter.by_property("category").equal(category))

    if date_from:
        filters.append(Filter.by_property("publish_date").greater_or_equal(date_from))

    combined_filter = Filter.all_of(filters) if filters else None

    response = docs.query.hybrid(
        query=query,
        alpha=0.8,
        filters=combined_filter,
        limit=10,
        return_metadata=["score", "distance"]
    )

    return response.objects

# 使用
results = semantic_search(
    query="机器学习最佳实践",
    category="技术",
    date_from="2024-01-01"
)

场景 3：推荐系统

python 复制代码

# 1. 创建产品集合
products = client.collections.create(
    name="Products",
    properties=[
        Property(name="name", data_type=DataType.TEXT),
        Property(name="description", data_type=DataType.TEXT),
        Property(name="price", data_type=DataType.NUMBER),
        Property(name="category", data_type=DataType.TEXT),
        Property(name="tags", data_type=DataType.TEXT_ARRAY)
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai()
)

# 2. 基于产品推荐相似产品
def recommend_similar_products(product_id: str, limit: int = 5):
    # 获取产品向量
    product = products.query.fetch_object_by_id(product_id)

    # 查找相似产品
    response = products.query.near_vector(
        near_vector=product.vector,
        limit=limit + 1,  # +1 因为会包含自己
        filters=Filter.by_id().not_equal(product_id)  # 排除自己
    )

    return response.objects

# 3. 基于用户行为推荐
def recommend_by_user_history(user_viewed_products: list, limit: int = 10):
    # 获取用户浏览过的产品向量
    vectors = []
    for product_id in user_viewed_products:
        product = products.query.fetch_object_by_id(product_id)
        vectors.append(product.vector)

    # 计算平均向量
    avg_vector = [sum(x) / len(vectors) for x in zip(*vectors)]

    # 推荐相似产品
    response = products.query.near_vector(
        near_vector=avg_vector,
        limit=limit
    )

    return response.objects

Weaviate 简介与基本使用

目录

Weaviate 向量数据库

🚀 快速开始

⚡ 核心特性

工作原理

安装部署

方法一：Docker 命令安装

方法二：Weaviate Cloud (WCS)

详细使用说明

1. 连接与初始化

2. 创建集合 (Schema)

3. 数据操作

插入数据

查询数据

更新数据

删除数据

4. 高级配置

向量化器配置

HNSW 索引优化

倒排索引配置

5. 实际应用场景

场景 1：知识库问答系统

场景 2：语义搜索引擎

场景 3：推荐系统

🔗 相关资源