ChromaDB向量数据库Python教程:从入门到实战完整指南(含游戏资源管理系统案例

本项目提供从基础到实战的完整 ChromaDB 学习路径。本文档内嵌四个示例的完整代码,覆盖基础用法、复杂查询、持久化与自定义嵌入,以及一个"游戏资源管理系统"的实战案例。同时提供依赖清单、环境初始化步骤(含清华镜像加速)、运行指令和完整的流程思路与每一步的作用说明。


一、环境初始化(建议使用虚拟环境)

  • Python 版本: 3.8+
  • 建议使用虚拟环境隔离依赖

Windows

bash 复制代码
# 1) 创建并激活虚拟环境
python -m venv venv
venv\Scripts\activate

# 2) 升级 pip
python -m pip install -U pip

# 3) 使用清华镜像安装依赖(推荐,下载更快)
pip install -r tutorial/requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

macOS/Linux

bash 复制代码
# 1) 创建并激活虚拟环境
python3 -m venv venv
source venv/bin/activate

# 2) 升级 pip
python -m pip install -U pip

# 3) 使用清华镜像安装依赖(推荐,下载更快)
pip install -r tutorial/requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

可选:临时指定镜像也可在单次安装时添加参数 -i https://pypi.tuna.tsinghua.edu.cn/simple


二、依赖清单(来自 tutorial/requirements.txt)

text 复制代码
chromadb>=1.0.0
numpy>=1.21.0
python-dateutil>=2.8.0

三、整体流程思路与每步作用

本教程分为四个层次,循序渐进:

  • 教程1(基础)

    • 目标:认识 ChromaDB 的核心对象与最小可行操作
    • 步骤与作用:
      • 创建客户端:获得与向量数据库交互的入口
      • 创建集合:类似"表",承载文档、向量与元数据
      • 添加文档:自动生成嵌入向量并建立索引
      • 语义搜索:基于文本相似度返回相关文档
      • 元数据过滤:在语义搜索基础上做结构化条件筛选
      • 数据获取:遍历全部或按 ID 精确检索,验证数据写入
  • 教程2(中级)

    • 目标:掌握复杂查询、批量写入、更新删除与统计
    • 步骤与作用:
      • 批量添加:构建更丰富的业务数据(如电商商品)
      • 复杂条件:$and/$or/$gt/$gte/$lt/$lte/$ne 精准筛选
      • 更新与验证:维护数据一致性(文本/元数据同步)
      • 统计分析:聚合基本指标(均价、品牌分布、库存等)
      • 条件删除:根据状态删除文档并校验结果
  • 教程3(高级)

    • 目标:持久化、多集合管理、自定义嵌入与批量优化
    • 步骤与作用:
      • 持久化客户端:数据保存到磁盘,程序重启仍在
      • 多集合:按领域拆分(文档库/用户画像/推荐)
      • 自定义嵌入:用外部/自研向量替换默认嵌入
      • 高级查询:多条件组合、向量检索与作者过滤
      • 批量更新:示例对"浏览量"批量加权更新
      • 导出统计:对集合做结构化汇总,便于备份/分析
      • 断点重连校验:验证关停重连后的数据完整性
  • 教程4(实战应用:游戏资源管理系统)

    • 目标:把搜索、推荐、排行、统计、日志串成闭环业务
    • 核心模块与作用:
      • 资源库:承载"资源描述文本 + 元数据",做检索/筛选
      • 用户画像:可扩展为个性化侧的数据基座(示例集中于资源)
      • 搜索日志:记录搜索词/分类/命中量/时间,支持看板分析
      • 搜索:语义检索 + 元数据条件(价格、评分、分类)
      • 推荐:语义检索结果二次加权(相似度/评分/热度)
      • 热门榜:按下载/浏览综合得分排序
      • 资源统计:分类计数、难度分布、价格区间、均分等
      • 相似项:用目标文档再次检索同类相似资源
      • 搜索分析:热门搜索词、分类分布、平均命中数

四、运行方式

bash 复制代码
# 基础教程
python tutorial/01_basic_example.py

# 中级教程  
python tutorial/02_intermediate_example.py

# 高级教程
python tutorial/03_advanced_example.py

# 实战应用案例
python tutorial/04_real_world_application.py

五、完整代码

01_basic_example.py

python 复制代码
"""
ChromaDB 教程 1: 基础使用
======================

这个文件演示ChromaDB的基本功能:
1. 创建客户端和集合
2. 添加文档
3. 基本查询
4. 获取文档
"""

import chromadb
import time

def basic_chromadb_tutorial():
    """基础ChromaDB教程"""
    
    print("=" * 60)
    print("ChromaDB 基础教程开始")
    print("=" * 60)
    
    # 步骤1: 创建ChromaDB客户端
    print("\n步骤1: 创建ChromaDB客户端")
    print("-" * 30)
    client = chromadb.Client()
    print("✓ 客户端创建成功")
    
    # 步骤2: 创建集合(相当于数据库中的表)
    print("\n步骤2: 创建集合")
    print("-" * 30)
    collection_name = "my_first_collection"
    collection = client.create_collection(collection_name)
    print(f"✓ 集合 '{collection_name}' 创建成功")
    
    # 步骤3: 准备一些简单的文档数据
    print("\n步骤3: 准备文档数据")
    print("-" * 30)
    documents = [
        "我喜欢吃苹果,苹果很甜很好吃",
        "香蕉是黄色的水果,营养丰富",
        "橙子含有丰富的维生素C",
        "草莓是红色的小浆果",
        "葡萄可以用来酿酒"
    ]
    
    # 为每个文档创建唯一的ID
    ids = [f"doc_{i+1}" for i in range(len(documents))]
    
    # 创建元数据(可选)
    metadatas = [
        {"category": "水果", "color": "红/绿"},
        {"category": "水果", "color": "黄"},
        {"category": "水果", "color": "橙"},
        {"category": "水果", "color": "红"},
        {"category": "水果", "color": "紫/绿"}
    ]
    
    print(f"✓ 准备了 {len(documents)} 个文档")
    for i, doc in enumerate(documents):
        print(f"  {i+1}. {doc[:20]}...")
    
    # 步骤4: 添加文档到集合
    print("\n步骤4: 添加文档到集合")
    print("-" * 30)
    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )
    print("✓ 所有文档已添加到集合中")
    
    # 步骤5: 基本查询 - 语义搜索
    print("\n步骤5: 进行语义搜索")
    print("-" * 30)
    query_text = "甜的水果"
    print(f"查询文本: '{query_text}'")
    
    results = collection.query(
        query_texts=[query_text],
        n_results=3  # 返回最相关的3个结果
    )
    
    print("搜索结果:")
    for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
        print(f"  {i+1}. {doc} (相似度距离: {distance:.4f})")
    
    # 步骤6: 基于元数据的过滤查询
    print("\n步骤6: 基于元数据的过滤查询")
    print("-" * 30)
    filtered_results = collection.query(
        query_texts=["营养"],
        n_results=3,
        where={"color": "黄"}  # 只查询黄色的水果
    )
    
    print("过滤查询结果(只查询黄色水果):")
    for doc in filtered_results['documents'][0]:
        print(f"  - {doc}")
    
    # 步骤7: 获取所有文档
    print("\n步骤7: 获取集合中的所有文档")
    print("-" * 30)
    all_docs = collection.get()
    print(f"集合中共有 {len(all_docs['documents'])} 个文档:")
    for i, (doc_id, doc, metadata) in enumerate(zip(
        all_docs['ids'], 
        all_docs['documents'], 
        all_docs['metadatas']
    )):
        print(f"  {i+1}. ID: {doc_id}, 内容: {doc[:30]}..., 元数据: {metadata}")
    
    # 步骤8: 根据ID获取特定文档
    print("\n步骤8: 根据ID获取特定文档")
    print("-" * 30)
    specific_doc = collection.get(ids=["doc_1", "doc_3"])
    print("获取到的特定文档:")
    for doc_id, doc in zip(specific_doc['ids'], specific_doc['documents']):
        print(f"  ID: {doc_id}, 内容: {doc}")
    
    print("\n" + "=" * 60)
    print("基础教程完成!")
    print("=" * 60)

if __name__ == "__main__":
    basic_chromadb_tutorial() 

02_intermediate_example.py

python 复制代码
"""
ChromaDB 教程 2: 中级功能
========================

这个文件演示ChromaDB的中级功能:
1. 复杂的查询和过滤
2. 批量操作
3. 更新和删除文档
4. 集合管理
5. 多条件查询
"""

import chromadb
import json
from datetime import datetime

def intermediate_chromadb_tutorial():
    """中级ChromaDB教程"""
    
    print("=" * 60)
    print("ChromaDB 中级教程开始")
    print("=" * 60)
    
    # 步骤1: 创建客户端和集合
    print("\n步骤1: 创建客户端和集合")
    print("-" * 30)
    client = chromadb.Client()
    
    # 尝试删除已存在的集合(如果存在)
    try:
        client.delete_collection("advanced_collection")
        print("已删除现有集合")
    except:
        pass
    
    collection = client.create_collection("advanced_collection")
    print("✓ 集合创建成功")
    
    # 步骤2: 准备更复杂的数据集
    print("\n步骤2: 准备复杂数据集")
    print("-" * 30)
    
    # 模拟电商产品数据
    products = [
        {
            "id": "prod_001",
            "name": "苹果iPhone 15 Pro",
            "description": "最新款苹果手机,配备A17 Pro芯片,48MP主摄像头,支持5G网络",
            "category": "电子产品",
            "brand": "苹果",
            "price": 7999,
            "rating": 4.8,
            "in_stock": True,
            "tags": ["手机", "5G", "摄影", "高端"]
        },
        {
            "id": "prod_002", 
            "name": "小米13 Ultra",
            "description": "小米旗舰手机,徕卡专业摄影,骁龙8 Gen2处理器",
            "category": "电子产品",
            "brand": "小米",
            "price": 5999,
            "rating": 4.6,
            "in_stock": True,
            "tags": ["手机", "摄影", "性价比"]
        },
        {
            "id": "prod_003",
            "name": "华为MatePad Pro",
            "description": "华为平板电脑,适合办公和娱乐,支持手写笔",
            "category": "电子产品", 
            "brand": "华为",
            "price": 3999,
            "rating": 4.5,
            "in_stock": False,
            "tags": ["平板", "办公", "手写"]
        },
        {
            "id": "prod_004",
            "name": "Nike Air Force 1",
            "description": "经典白色运动鞋,舒适透气,适合日常穿着",
            "category": "服装鞋帽",
            "brand": "耐克",
            "price": 899,
            "rating": 4.7,
            "in_stock": True,
            "tags": ["运动鞋", "经典", "白色"]
        },
        {
            "id": "prod_005",
            "name": "阿迪达斯Ultraboost 22",
            "description": "专业跑步鞋,Boost缓震科技,适合长跑",
            "category": "服装鞋帽",
            "brand": "阿迪达斯", 
            "price": 1299,
            "rating": 4.9,
            "in_stock": True,
            "tags": ["跑步鞋", "专业", "缓震"]
        }
    ]
    
    # 提取文档文本和元数据
    documents = []
    metadatas = []
    ids = []
    
    for product in products:
        # 组合名称和描述作为文档内容
        doc_text = f"{product['name']} {product['description']}"
        documents.append(doc_text)
        
        # 创建元数据
        metadata = {
            "category": product["category"],
            "brand": product["brand"],
            "price": product["price"],
            "rating": product["rating"],
            "in_stock": product["in_stock"],
            "tags": json.dumps(product["tags"], ensure_ascii=False)  # 转换为JSON字符串
        }
        metadatas.append(metadata)
        ids.append(product["id"])
    
    print(f"✓ 准备了 {len(products)} 个产品数据")
    
    # 步骤3: 批量添加文档
    print("\n步骤3: 批量添加文档")
    print("-" * 30)
    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )
    print("✓ 所有产品数据已添加到集合中")
    
    # 步骤4: 复杂查询示例
    print("\n步骤4: 复杂查询示例")
    print("-" * 30)
    
    # 4.1 语义搜索 + 价格过滤
    print("4.1 查找价格在5000以下的摄影设备:")
    camera_results = collection.query(
        query_texts=["摄影 拍照 相机"],
        n_results=5,
        where={"$and": [
            {"price": {"$lt": 5000}},  # 价格小于5000
            {"in_stock": True}         # 有库存
        ]}
    )
    
    for i, (doc, metadata) in enumerate(zip(camera_results['documents'][0], camera_results['metadatas'][0])):
        print(f"  {i+1}. {doc[:40]}... - 价格: ¥{metadata['price']}, 评分: {metadata['rating']}")
    
    # 4.2 品牌过滤查询
    print("\n4.2 查找苹果品牌的产品:")
    apple_results = collection.query(
        query_texts=["高端 专业"],
        n_results=3,
        where={"brand": "苹果"}
    )
    
    for doc in apple_results['documents'][0]:
        print(f"  - {doc[:50]}...")
    
    # 4.3 评分范围查询
    print("\n4.3 查找评分4.7以上的产品:")
    high_rating_results = collection.query(
        query_texts=["质量好 推荐"],
        n_results=5,
        where={"rating": {"$gte": 4.7}}  # 评分大于等于4.7
    )
    
    for doc, metadata in zip(high_rating_results['documents'][0], high_rating_results['metadatas'][0]):
        print(f"  - {doc[:40]}... (评分: {metadata['rating']})")
    
    # 步骤5: 文档更新操作
    print("\n步骤5: 文档更新操作")
    print("-" * 30)
    
    # 更新产品信息(比如价格变动)
    collection.update(
        ids=["prod_001"],
        documents=["苹果iPhone 15 Pro 最新款苹果手机,现在特价促销中!配备A17 Pro芯片"],
        metadatas=[{
            "category": "电子产品",
            "brand": "苹果", 
            "price": 7599,  # 降价了
            "rating": 4.8,
            "in_stock": True,
            "tags": json.dumps(["手机", "5G", "摄影", "高端", "促销"], ensure_ascii=False)
        }]
    )
    print("✓ 已更新iPhone 15 Pro的价格和描述")
    
    # 步骤6: 条件查询验证更新
    print("\n步骤6: 验证更新结果")
    print("-" * 30)
    updated_product = collection.get(ids=["prod_001"])
    print("更新后的产品信息:")
    print(f"  文档: {updated_product['documents'][0]}")
    print(f"  元数据: {updated_product['metadatas'][0]}")
    
    # 步骤7: 多条件复合查询
    print("\n步骤7: 多条件复合查询")
    print("-" * 30)
    
    # 查找价格在1000-6000之间,有库存的电子产品
    complex_results = collection.query(
        query_texts=["科技 数码 电子"],
        n_results=5,
        where={
            "$and": [
                {"category": "电子产品"},
                {"price": {"$gte": 1000}},
                {"price": {"$lte": 6000}},
                {"in_stock": True}
            ]
        }
    )
    
    print("符合条件的电子产品:")
    for doc, metadata in zip(complex_results['documents'][0], complex_results['metadatas'][0]):
        print(f"  - {doc[:45]}... - ¥{metadata['price']}")
    
    # 步骤8: 集合统计信息
    print("\n步骤8: 集合统计信息")
    print("-" * 30)
    
    all_items = collection.get()
    total_count = len(all_items['documents'])
    
    # 统计各个品牌的产品数量
    brand_count = {}
    total_value = 0
    in_stock_count = 0
    
    for metadata in all_items['metadatas']:
        brand = metadata['brand']
        brand_count[brand] = brand_count.get(brand, 0) + 1
        total_value += metadata['price']
        if metadata['in_stock']:
            in_stock_count += 1
    
    print(f"集合统计:")
    print(f"  - 总产品数: {total_count}")
    print(f"  - 有库存产品: {in_stock_count}")
    print(f"  - 平均价格: ¥{total_value/total_count:.2f}")
    print(f"  - 品牌分布: {brand_count}")
    
    # 步骤9: 删除文档示例
    print("\n步骤9: 删除文档示例")
    print("-" * 30)
    
    print("删除缺货的产品...")
    # 首先找到缺货的产品
    out_of_stock = collection.get(where={"in_stock": False})
    if out_of_stock['ids']:
        collection.delete(ids=out_of_stock['ids'])
        print(f"✓ 已删除 {len(out_of_stock['ids'])} 个缺货产品")
    else:
        print("没有找到缺货产品")
    
    # 验证删除结果
    remaining = collection.get()
    print(f"剩余产品数量: {len(remaining['documents'])}")
    
    print("\n" + "=" * 60)
    print("中级教程完成!")
    print("=" * 60)

if __name__ == "__main__":
    intermediate_chromadb_tutorial() 

03_advanced_example.py

python 复制代码
"""
ChromaDB 教程 3: 高级功能
========================

这个文件演示ChromaDB的高级功能:
1. 自定义嵌入函数
2. 持久化存储
3. 多集合管理
4. 批量操作优化
5. 嵌入向量操作
6. 集合配置和元数据
"""

import chromadb
from chromadb.config import Settings
import numpy as np
import os
import shutil
from typing import List

def advanced_chromadb_tutorial():
    """高级ChromaDB教程"""
    
    print("=" * 60)
    print("ChromaDB 高级教程开始")
    print("=" * 60)
    
    # 步骤1: 配置持久化存储
    print("\n步骤1: 配置持久化存储")
    print("-" * 30)
    
    # 设置持久化目录
    persist_directory = "./chroma_db_storage"
    
    # 如果目录存在,先清理
    if os.path.exists(persist_directory):
        shutil.rmtree(persist_directory)
        print("已清理旧的存储目录")
    
    # 创建持久化客户端
    client = chromadb.PersistentClient(path=persist_directory)
    print(f"✓ 持久化客户端创建成功,存储路径: {persist_directory}")
    
    # 步骤2: 创建多个不同配置的集合
    print("\n步骤2: 创建多个不同配置的集合")
    print("-" * 30)
    
    # 集合1: 文档库(使用默认嵌入)
    doc_collection = client.create_collection(
        name="document_library",
        metadata={"description": "文档库集合", "type": "documents"}
    )
    print("✓ 文档库集合创建成功")
    
    # 集合2: 用户画像(使用自定义距离函数)
    user_collection = client.create_collection(
        name="user_profiles", 
        metadata={"description": "用户画像集合", "type": "profiles"}
    )
    print("✓ 用户画像集合创建成功")
    
    # 集合3: 产品推荐(使用余弦相似度)
    product_collection = client.create_collection(
        name="product_recommendations",
        metadata={"description": "产品推荐集合", "type": "products"}
    )
    print("✓ 产品推荐集合创建成功")
    
    # 步骤3: 向文档库添加技术文档
    print("\n步骤3: 向文档库添加技术文档")
    print("-" * 30)
    
    tech_docs = [
        {
            "id": "doc_python_001",
            "title": "Python基础教程",
            "content": "Python是一种高级编程语言,具有简洁的语法和强大的功能。适合初学者学习编程,也广泛用于数据科学、web开发、人工智能等领域。",
            "category": "编程语言",
            "difficulty": "初级",
            "tags": ["Python", "编程", "基础"],
            "author": "张三",
            "views": 1250
        },
        {
            "id": "doc_ai_002", 
            "title": "机器学习入门指南",
            "content": "机器学习是人工智能的一个重要分支,通过算法使计算机能够从数据中学习和改进。包括监督学习、无监督学习和强化学习等方法。",
            "category": "人工智能",
            "difficulty": "中级",
            "tags": ["机器学习", "AI", "算法"],
            "author": "李四",
            "views": 2100
        },
        {
            "id": "doc_web_003",
            "title": "React前端开发最佳实践",
            "content": "React是Facebook开发的JavaScript库,用于构建用户界面。具有组件化、虚拟DOM、单向数据流等特点,是现代前端开发的热门选择。",
            "category": "前端开发", 
            "difficulty": "中级",
            "tags": ["React", "JavaScript", "前端"],
            "author": "王五",
            "views": 1800
        },
        {
            "id": "doc_db_004",
            "title": "数据库设计原则与实践",
            "content": "数据库设计是系统开发的重要环节,需要考虑数据模型、范式、索引、性能优化等因素。好的数据库设计能提高系统效率和维护性。",
            "category": "数据库",
            "difficulty": "高级", 
            "tags": ["数据库", "设计", "SQL"],
            "author": "赵六",
            "views": 980
        },
        {
            "id": "doc_cloud_005",
            "title": "云计算架构与部署策略",
            "content": "云计算提供可扩展的计算资源和服务,包括IaaS、PaaS、SaaS等服务模式。合理的云架构设计能提高系统可靠性和降低成本。",
            "category": "云计算",
            "difficulty": "高级",
            "tags": ["云计算", "架构", "部署"],
            "author": "孙七",
            "views": 1450
        }
    ]
    
    # 批量添加文档
    doc_texts = []
    doc_metadatas = []
    doc_ids = []
    
    for doc in tech_docs:
        text = f"{doc['title']} {doc['content']}"
        doc_texts.append(text)
        
        metadata = {
            "title": doc["title"],
            "category": doc["category"],
            "difficulty": doc["difficulty"],
            "author": doc["author"],
            "views": doc["views"],
            "tags": ",".join(doc["tags"])
        }
        doc_metadatas.append(metadata)
        doc_ids.append(doc["id"])
    
    doc_collection.add(
        documents=doc_texts,
        metadatas=doc_metadatas,
        ids=doc_ids
    )
    print(f"✓ 已添加 {len(tech_docs)} 个技术文档")
    
    # 步骤4: 使用自定义嵌入向量
    print("\n步骤4: 使用自定义嵌入向量操作")
    print("-" * 30)
    
    # 为用户画像创建自定义嵌入向量
    user_profiles = [
        {
            "id": "user_001",
            "name": "小明",
            "interests": "编程 游戏 音乐",
            "skills": ["Python", "JavaScript", "SQL"],
            "experience_years": 3,
            "embedding": [0.1, 0.8, 0.3, 0.6, 0.2, 0.9, 0.4, 0.7]  # 自定义8维向量
        },
        {
            "id": "user_002", 
            "name": "小红",
            "interests": "设计 艺术 摄影",
            "skills": ["Photoshop", "Illustrator", "UI设计"],
            "experience_years": 5,
            "embedding": [0.9, 0.2, 0.8, 0.1, 0.7, 0.3, 0.6, 0.4]
        },
        {
            "id": "user_003",
            "name": "小李",
            "interests": "数据分析 机器学习 统计",
            "skills": ["Python", "R", "机器学习", "统计学"],
            "experience_years": 4,
            "embedding": [0.2, 0.9, 0.1, 0.8, 0.5, 0.6, 0.3, 0.7]
        }
    ]
    
    # 添加用户画像(使用自定义嵌入向量)
    user_texts = []
    user_metadatas = []
    user_ids = []
    user_embeddings = []
    
    for user in user_profiles:
        text = f"{user['name']} 的兴趣是 {user['interests']}"
        user_texts.append(text)
        
        metadata = {
            "name": user["name"],
            "interests": user["interests"],
            "skills": ",".join(user["skills"]),
            "experience_years": user["experience_years"]
        }
        user_metadatas.append(metadata)
        user_ids.append(user["id"])
        user_embeddings.append(user["embedding"])
    
    user_collection.add(
        documents=user_texts,
        metadatas=user_metadatas,
        ids=user_ids,
        embeddings=user_embeddings  # 使用自定义嵌入向量
    )
    print(f"✓ 已添加 {len(user_profiles)} 个用户画像(使用自定义嵌入向量)")
    
    # 步骤5: 高级查询操作
    print("\n步骤5: 高级查询操作")
    print("-" * 30)
    
    # 5.1 多条件组合查询文档
    print("5.1 查找适合初学者的编程相关文档:")
    beginner_docs = doc_collection.query(
        query_texts=["编程 学习 入门"],
        n_results=3,
        where={
            "$or": [
                {"difficulty": "初级"},
                {"category": "编程语言"}
            ]
        }
    )
    
    for doc, metadata in zip(beginner_docs['documents'][0], beginner_docs['metadatas'][0]):
        print(f"  - {metadata['title']} ({metadata['difficulty']}) - 浏览量: {metadata['views']}")
    
    # 5.2 基于自定义嵌入向量的用户相似性查询
    print("\n5.2 查找与指定向量相似的用户:")
    query_embedding = [0.15, 0.85, 0.25, 0.65, 0.35, 0.75, 0.45, 0.55]  # 查询向量
    
    similar_users = user_collection.query(
        query_embeddings=[query_embedding],
        n_results=2
    )
    
    for user, metadata, distance in zip(
        similar_users['documents'][0], 
        similar_users['metadatas'][0],
        similar_users['distances'][0]
    ):
        print(f"  - {metadata['name']}: {user} (距离: {distance:.4f})")
        print(f"    技能: {metadata['skills']}")
    
    # 5.3 获取特定作者的所有文档
    print("\n5.3 获取特定作者的所有文档:")
    author_docs = doc_collection.get(
        where={"author": "李四"}
    )
    
    for doc_id, metadata in zip(author_docs['ids'], author_docs['metadatas']):
        print(f"  - {doc_id}: {metadata['title']}")
    
    # 步骤6: 批量操作和性能优化
    print("\n步骤6: 批量操作和性能优化")
    print("-" * 30)
    
    # 批量更新文档的浏览量
    print("批量更新文档浏览量...")
    all_docs = doc_collection.get()
    
    updated_metadatas = []
    for metadata in all_docs['metadatas']:
        new_metadata = metadata.copy()
        new_metadata['views'] += 100  # 增加100次浏览
        updated_metadatas.append(new_metadata)
    
    doc_collection.update(
        ids=all_docs['ids'],
        metadatas=updated_metadatas
    )
    print("✓ 批量更新完成")
    
    # 验证更新结果
    updated_docs = doc_collection.get(ids=["doc_python_001"])
    print(f"验证:Python教程新浏览量 = {updated_docs['metadatas'][0]['views']}")
    
    # 步骤7: 集合管理和统计
    print("\n步骤7: 集合管理和统计")
    print("-" * 30)
    
    # 获取所有集合信息
    collections = client.list_collections()
    print("当前所有集合:")
    for collection in collections:
        print(f"  - {collection.name}: {collection.metadata}")
        
        # 获取集合统计信息
        items = collection.get()
        print(f"    文档数量: {len(items['documents'])}")
    
    # 步骤8: 数据导出和备份
    print("\n步骤8: 数据导出示例")
    print("-" * 30)
    
    # 导出文档库的所有数据
    export_data = doc_collection.get()
    export_summary = {
        "collection_name": "document_library",
        "total_documents": len(export_data['documents']),
        "categories": {},
        "authors": {}
    }
    
    # 统计分类和作者分布
    for metadata in export_data['metadatas']:
        category = metadata['category']
        author = metadata['author']
        
        export_summary["categories"][category] = export_summary["categories"].get(category, 0) + 1
        export_summary["authors"][author] = export_summary["authors"].get(author, 0) + 1
    
    print("导出数据统计:")
    print(f"  总文档数: {export_summary['total_documents']}")
    print(f"  分类分布: {export_summary['categories']}")
    print(f"  作者分布: {export_summary['authors']}")
    
    # 步骤9: 高级搜索示例
    print("\n步骤9: 高级搜索示例")
    print("-" * 30)
    
    # 多关键词搜索
    multi_keyword_results = doc_collection.query(
        query_texts=["Python 数据分析", "JavaScript 前端开发"],
        n_results=2
    )
    
    print("多关键词搜索结果:")
    for i, query_results in enumerate(multi_keyword_results['documents']):
        print(f"  查询 {i+1} 的结果:")
        for doc, metadata in zip(query_results, multi_keyword_results['metadatas'][i]):
            print(f"    - {metadata['title']}")
    
    # 步骤10: 清理和持久化验证
    print("\n步骤10: 持久化验证")
    print("-" * 30)
    
    print("关闭客户端,验证数据持久化...")
    del client
    
    # 重新连接,验证数据仍然存在
    client_new = chromadb.PersistentClient(path=persist_directory)
    collections_after_restart = client_new.list_collections()
    
    print("重新连接后的集合:")
    for collection in collections_after_restart:
        items = collection.get()
        print(f"  - {collection.name}: {len(items['documents'])} 个文档")
    
    print("✓ 数据持久化验证成功")
    
    print("\n" + "=" * 60)
    print("高级教程完成!")
    print("数据已保存到:", persist_directory)
    print("=" * 60)

if __name__ == "__main__":
    advanced_chromadb_tutorial() 

04_real_world_application.py

python 复制代码
"""
ChromaDB 教程 4: 实际应用案例 - 游戏资源管理系统
==============================================

这个文件演示一个完整的实际应用案例:
1. 游戏资源搜索引擎
2. 个性化推荐系统
3. 热门资源排行
4. 分类管理和统计
5. 用户行为分析
6. 资源相似度匹配
"""

import chromadb
import json
import random
from datetime import datetime, timedelta
from typing import List, Dict, Tuple
import re

class GameResourceManager:
    """游戏资源管理系统"""
    
    def __init__(self, persist_path="./game_resource_db"):
        """初始化系统"""
        self.client = chromadb.PersistentClient(path=persist_path)
        self.resources_collection = None
        self.users_collection = None
        self.search_logs_collection = None
        self._init_collections()
    
    def _init_collections(self):
        """初始化集合"""
        try:
            # 删除已存在的集合(用于演示)
            self.client.delete_collection("game_resources")
            self.client.delete_collection("user_profiles") 
            self.client.delete_collection("search_logs")
        except:
            pass
        
        # 创建游戏资源集合
        self.resources_collection = self.client.create_collection(
            name="game_resources",
            metadata={"description": "游戏资源库", "version": "1.0"}
        )
        
        # 创建用户画像集合
        self.users_collection = self.client.create_collection(
            name="user_profiles",
            metadata={"description": "用户画像", "version": "1.0"}
        )
        
        # 创建搜索日志集合
        self.search_logs_collection = self.client.create_collection(
            name="search_logs",
            metadata={"description": "用户搜索日志", "version": "1.0"}
        )
        
        print("✓ 所有集合初始化完成")
    
    def load_sample_resources(self):
        """加载示例游戏资源数据"""
        print("\n开始加载示例游戏资源...")
        
        # 扩展的游戏资源数据
        resources_data = [
            # 角色类资源
            {
                "name": "高达:独角兽(双形态)",
                "category": "怪兽", 
                "subcategory": "机甲",
                "description": "经典高达系列独角兽高达,具备双形态变换能力,毁灭模式和独角兽模式",
                "tags": ["高达", "机甲", "变形", "独角兽", "毁灭模式"],
                "views": 30, "downloads": 26, "rating": 4.8, "price": 0,
                "difficulty": "高级", "size_mb": 15.6, "format": "fbx"
            },
            {
                "name": "都市男J",
                "category": "都市角色",
                "subcategory": "男性角色", 
                "description": "现代都市风格男性角色,适合都市题材游戏",
                "tags": ["都市", "男性", "现代", "角色"],
                "views": 75, "downloads": 154, "rating": 4.5, "price": 0,
                "difficulty": "中级", "size_mb": 8.2, "format": "fbx"
            },
            {
                "name": "绿裙女",
                "category": "修仙角色",
                "subcategory": "女性角色",
                "description": "古风修仙题材女性角色,身着绿色长裙,仙气飘逸",
                "tags": ["修仙", "古风", "女性", "仙女", "绿裙"],
                "views": 17, "downloads": 41, "rating": 4.6, "price": 5,
                "difficulty": "中级", "size_mb": 12.3, "format": "fbx"
            },
            {
                "name": "狼兽人",
                "category": "女频角色",
                "subcategory": "兽人",
                "description": "狼族兽人角色,适合奇幻冒险游戏",
                "tags": ["兽人", "狼族", "奇幻", "野性"],
                "views": 45, "downloads": 57, "rating": 4.3, "price": 3,
                "difficulty": "高级", "size_mb": 18.7, "format": "fbx"
            },
            
            # 道具类资源
            {
                "name": "武器:石质锤子",
                "category": "道具",
                "subcategory": "武器",
                "description": "原始风格的石质战锤,适合史前或野蛮人题材",
                "tags": ["武器", "锤子", "石质", "原始", "近战"],
                "views": 10, "downloads": 24, "rating": 4.1, "price": 2,
                "difficulty": "初级", "size_mb": 3.4, "format": "obj"
            },
            {
                "name": "蛋糕",
                "category": "道具",
                "subcategory": "食物",
                "description": "精美的生日蛋糕模型,适合休闲游戏",
                "tags": ["食物", "蛋糕", "甜品", "生日"],
                "views": 7, "downloads": 21, "rating": 4.0, "price": 1,
                "difficulty": "初级", "size_mb": 2.1, "format": "obj"
            },
            {
                "name": "鱼竿3",
                "category": "道具", 
                "subcategory": "工具",
                "description": "钓鱼竿道具,适合钓鱼或休闲游戏",
                "tags": ["工具", "鱼竿", "钓鱼", "休闲"],
                "views": 8, "downloads": 21, "rating": 3.9, "price": 1,
                "difficulty": "初级", "size_mb": 1.8, "format": "obj"
            },
            
            # 场景类资源  
            {
                "name": "跑车内部后排",
                "category": "场景",
                "subcategory": "车辆内饰",
                "description": "豪华跑车后排座椅内饰场景",
                "tags": ["跑车", "内饰", "豪华", "车辆"],
                "views": 65, "downloads": 143, "rating": 4.7, "price": 8,
                "difficulty": "高级", "size_mb": 25.3, "format": "fbx"
            },
            {
                "name": "草屋室内",
                "category": "场景",
                "subcategory": "建筑内部",
                "description": "乡村风格草屋内部场景,温馨朴素",
                "tags": ["草屋", "乡村", "室内", "朴素"],
                "views": 60, "downloads": 128, "rating": 4.4, "price": 6,
                "difficulty": "中级", "size_mb": 19.8, "format": "fbx"
            },
            {
                "name": "山洞出口",
                "category": "场景",
                "subcategory": "自然环境",
                "description": "神秘山洞出口场景,适合冒险游戏",
                "tags": ["山洞", "出口", "自然", "冒险"],
                "views": 40, "downloads": 74, "rating": 4.2, "price": 4,
                "difficulty": "中级", "size_mb": 16.5, "format": "fbx"
            },
            
            # 特效类资源
            {
                "name": "寒冰特效5",
                "category": "特效",
                "subcategory": "元素特效",
                "description": "寒冰系魔法特效,冰晶飞舞效果",
                "tags": ["特效", "寒冰", "魔法", "冰晶"],
                "views": 9, "downloads": 27, "rating": 4.5, "price": 3,
                "difficulty": "高级", "size_mb": 5.7, "format": "vfx"
            },
            {
                "name": "寒冰攻击5", 
                "category": "特效",
                "subcategory": "攻击特效",
                "description": "寒冰系攻击技能特效",
                "tags": ["特效", "攻击", "寒冰", "技能"],
                "views": 10, "downloads": 29, "rating": 4.3, "price": 3,
                "difficulty": "高级", "size_mb": 6.2, "format": "vfx"
            }
        ]
        
        # 准备数据
        documents = []
        metadatas = []
        ids = []
        
        for i, resource in enumerate(resources_data):
            # 构建搜索文档
            doc_text = f"{resource['name']} {resource['description']} {' '.join(resource['tags'])}"
            documents.append(doc_text)
            
            # 构建元数据
            metadata = {
                "name": resource["name"],
                "category": resource["category"],
                "subcategory": resource["subcategory"],
                "description": resource["description"],
                "tags": json.dumps(resource["tags"], ensure_ascii=False),
                "views": resource["views"],
                "downloads": resource["downloads"],
                "rating": resource["rating"],
                "price": resource["price"],
                "difficulty": resource["difficulty"],
                "size_mb": resource["size_mb"],
                "format": resource["format"],
                "upload_date": (datetime.now() - timedelta(days=random.randint(1, 30))).isoformat()
            }
            metadatas.append(metadata)
            ids.append(f"resource_{i+1:03d}")
        
        # 批量添加到数据库
        self.resources_collection.add(
            documents=documents,
            metadatas=metadatas,
            ids=ids
        )
        
        print(f"✓ 已加载 {len(resources_data)} 个游戏资源")
        return len(resources_data)
    
    def search_resources(self, query: str, category: str = None, 
                        max_price: float = None, min_rating: float = None, 
                        n_results: int = 5) -> List[Dict]:
        """搜索游戏资源"""
        
        # 构建查询条件
        where_conditions = []
        
        if category:
            where_conditions.append({"category": category})
        
        if max_price is not None:
            where_conditions.append({"price": {"$lte": max_price}})
            
        if min_rating is not None:
            where_conditions.append({"rating": {"$gte": min_rating}})
        
        # 组合条件
        where_clause = None
        if where_conditions:
            if len(where_conditions) == 1:
                where_clause = where_conditions[0]
            else:
                where_clause = {"$and": where_conditions}
        
        # 执行搜索
        results = self.resources_collection.query(
            query_texts=[query],
            n_results=n_results,
            where=where_clause
        )
        
        # 记录搜索日志
        self._log_search(query, category, len(results['documents'][0]))
        
        # 格式化结果
        search_results = []
        for doc, metadata, distance in zip(
            results['documents'][0], 
            results['metadatas'][0],
            results['distances'][0]
        ):
            result = {
                "name": metadata["name"],
                "category": metadata["category"],
                "description": metadata["description"],
                "tags": json.loads(metadata["tags"]),
                "views": metadata["views"],
                "downloads": metadata["downloads"], 
                "rating": metadata["rating"],
                "price": metadata["price"],
                "similarity_score": 1 - distance  # 转换为相似度分数
            }
            search_results.append(result)
        
        return search_results
    
    def get_recommendations(self, user_preferences: List[str], n_results: int = 5) -> List[Dict]:
        """基于用户偏好获取推荐"""
        
        # 组合用户偏好作为查询
        query = " ".join(user_preferences)
        
        # 获取推荐(偏向高评分和热门资源)
        results = self.resources_collection.query(
            query_texts=[query],
            n_results=n_results * 2,  # 获取更多结果用于排序
            where={"rating": {"$gte": 4.0}}  # 只推荐高评分资源
        )
        
        # 重新排序(考虑评分、下载量等因素)
        recommendations = []
        for metadata, distance in zip(results['metadatas'][0], results['distances'][0]):
            score = (
                (1 - distance) * 0.4 +  # 相似度权重40%
                (metadata["rating"] / 5.0) * 0.3 +  # 评分权重30%
                min(metadata["downloads"] / 100, 1.0) * 0.3  # 下载量权重30%
            )
            
            recommendation = {
                "name": metadata["name"],
                "category": metadata["category"],
                "description": metadata["description"],
                "rating": metadata["rating"],
                "downloads": metadata["downloads"],
                "price": metadata["price"],
                "recommendation_score": score
            }
            recommendations.append(recommendation)
        
        # 按推荐分数排序并返回指定数量
        recommendations.sort(key=lambda x: x["recommendation_score"], reverse=True)
        return recommendations[:n_results]
    
    def get_popular_resources(self, category: str = None, period_days: int = 30) -> List[Dict]:
        """获取热门资源"""
        
        where_clause = None
        if category:
            where_clause = {"category": category}
        
        # 获取所有资源
        all_resources = self.resources_collection.get(where=where_clause)
        
        # 按下载量排序
        popular_resources = []
        for metadata in all_resources['metadatas']:
            resource = {
                "name": metadata["name"],
                "category": metadata["category"],
                "downloads": metadata["downloads"],
                "views": metadata["views"],
                "rating": metadata["rating"],
                "price": metadata["price"]
            }
            popular_resources.append(resource)
        
        # 排序(下载量权重70%,浏览量权重30%)
        popular_resources.sort(
            key=lambda x: x["downloads"] * 0.7 + x["views"] * 0.3, 
            reverse=True
        )
        
        return popular_resources[:10]
    
    def get_category_statistics(self) -> Dict:
        """获取分类统计信息"""
        
        all_resources = self.resources_collection.get()
        
        stats = {
            "total_resources": len(all_resources['documents']),
            "categories": {},
            "avg_rating": 0,
            "total_downloads": 0,
            "price_range": {"min": float('inf'), "max": 0},
            "difficulty_distribution": {}
        }
        
        total_rating = 0
        
        for metadata in all_resources['metadatas']:
            category = metadata["category"]
            difficulty = metadata["difficulty"]
            rating = metadata["rating"]
            downloads = metadata["downloads"]
            price = metadata["price"]
            
            # 分类统计
            if category not in stats["categories"]:
                stats["categories"][category] = {
                    "count": 0, "avg_rating": 0, "total_downloads": 0
                }
            stats["categories"][category]["count"] += 1
            stats["categories"][category]["total_downloads"] += downloads
            
            # 难度分布
            stats["difficulty_distribution"][difficulty] = stats["difficulty_distribution"].get(difficulty, 0) + 1
            
            # 总体统计
            total_rating += rating
            stats["total_downloads"] += downloads
            stats["price_range"]["min"] = min(stats["price_range"]["min"], price)
            stats["price_range"]["max"] = max(stats["price_range"]["max"], price)
        
        # 计算平均值
        stats["avg_rating"] = total_rating / len(all_resources['metadatas'])
        
        for category in stats["categories"]:
            category_count = stats["categories"][category]["count"]
            stats["categories"][category]["avg_downloads"] = (
                stats["categories"][category]["total_downloads"] / category_count
            )
        
        return stats
    
    def find_similar_resources(self, resource_name: str, n_results: int = 5) -> List[Dict]:
        """查找相似资源"""
        
        # 首先找到指定资源
        target_resource = self.resources_collection.get(
            where={"name": resource_name}
        )
        
        if not target_resource['documents']:
            return []
        
        # 使用目标资源作为查询
        target_doc = target_resource['documents'][0]
        target_metadata = target_resource['metadatas'][0]
        
        # 查找相似资源(排除自身)
        similar_results = self.resources_collection.query(
            query_texts=[target_doc],
            n_results=n_results + 1,  # +1 因为结果会包含自身
            where={"category": target_metadata["category"]}  # 同类别
        )
        
        # 过滤掉自身并格式化结果
        similar_resources = []
        for metadata, distance in zip(similar_results['metadatas'][0], similar_results['distances'][0]):
            if metadata["name"] != resource_name:  # 排除自身
                similar_resource = {
                    "name": metadata["name"],
                    "category": metadata["category"],
                    "description": metadata["description"],
                    "rating": metadata["rating"],
                    "similarity": 1 - distance
                }
                similar_resources.append(similar_resource)
        
        return similar_resources[:n_results]
    
    def _log_search(self, query: str, category: str, result_count: int):
        """记录搜索日志"""
        log_doc = f"用户搜索: {query}"
        if category:
            log_doc += f" 分类: {category}"
        
        log_metadata = {
            "query": query,
            "category": category or "全部",
            "result_count": result_count,
            "timestamp": datetime.now().isoformat()
        }
        
        # 生成日志ID
        log_id = f"search_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{random.randint(1000, 9999)}"
        
        self.search_logs_collection.add(
            documents=[log_doc],
            metadatas=[log_metadata],
            ids=[log_id]
        )
    
    def get_search_analytics(self, days: int = 7) -> Dict:
        """获取搜索分析数据"""
        
        # 获取最近的搜索日志
        all_logs = self.search_logs_collection.get()
        
        if not all_logs['metadatas']:
            return {"message": "暂无搜索数据"}
        
        # 分析搜索数据
        analytics = {
            "total_searches": len(all_logs['metadatas']),
            "popular_queries": {},
            "category_searches": {},
            "avg_results_per_search": 0
        }
        
        total_results = 0
        
        for metadata in all_logs['metadatas']:
            query = metadata["query"]
            category = metadata["category"]
            result_count = metadata["result_count"]
            
            # 统计热门查询
            analytics["popular_queries"][query] = analytics["popular_queries"].get(query, 0) + 1
            
            # 统计分类搜索
            analytics["category_searches"][category] = analytics["category_searches"].get(category, 0) + 1
            
            total_results += result_count
        
        # 计算平均结果数
        analytics["avg_results_per_search"] = total_results / len(all_logs['metadatas'])
        
        # 排序热门查询
        analytics["popular_queries"] = dict(
            sorted(analytics["popular_queries"].items(), key=lambda x: x[1], reverse=True)
        )
        
        return analytics

def demo_game_resource_system():
    """演示游戏资源管理系统"""
    
    print("=" * 60)
    print("游戏资源管理系统演示")
    print("=" * 60)
    
    # 初始化系统
    print("\n1. 初始化系统")
    print("-" * 30)
    manager = GameResourceManager()
    
    # 加载示例数据
    print("\n2. 加载示例资源数据")
    print("-" * 30)
    total_resources = manager.load_sample_resources()
    
    # 基础搜索演示
    print("\n3. 基础搜索演示")
    print("-" * 30)
    
    search_queries = [
        ("高达机甲", None),
        ("武器", "道具"),
        ("场景", "场景"),
        ("特效", None)
    ]
    
    for query, category in search_queries:
        print(f"\n搜索: '{query}'" + (f" (分类: {category})" if category else ""))
        results = manager.search_resources(query, category=category, n_results=3)
        
        for i, result in enumerate(results, 1):
            print(f"  {i}. {result['name']} - {result['category']}")
            print(f"     评分: {result['rating']}, 下载: {result['downloads']}, 相似度: {result['similarity_score']:.3f}")
    
    # 个性化推荐演示
    print("\n4. 个性化推荐演示")
    print("-" * 30)
    
    user_preferences = ["机甲", "高达", "科幻", "变形"]
    print(f"用户偏好: {user_preferences}")
    
    recommendations = manager.get_recommendations(user_preferences, n_results=3)
    print("推荐资源:")
    for i, rec in enumerate(recommendations, 1):
        print(f"  {i}. {rec['name']} - {rec['category']}")
        print(f"     推荐分数: {rec['recommendation_score']:.3f}, 价格: ¥{rec['price']}")
    
    # 热门资源榜
    print("\n5. 热门资源排行榜")
    print("-" * 30)
    
    popular_all = manager.get_popular_resources()[:5]
    print("全部分类热门资源:")
    for i, resource in enumerate(popular_all, 1):
        print(f"  {i}. {resource['name']} - 下载: {resource['downloads']}, 浏览: {resource['views']}")
    
    # 分类统计
    print("\n6. 资源统计分析")
    print("-" * 30)
    
    stats = manager.get_category_statistics()
    print(f"总资源数: {stats['total_resources']}")
    print(f"平均评分: {stats['avg_rating']:.2f}")
    print(f"总下载量: {stats['total_downloads']}")
    print(f"价格范围: ¥{stats['price_range']['min']} - ¥{stats['price_range']['max']}")
    
    print("\n分类分布:")
    for category, info in stats['categories'].items():
        print(f"  {category}: {info['count']} 个资源, 平均下载量: {info['avg_downloads']:.1f}")
    
    print("\n难度分布:")
    for difficulty, count in stats['difficulty_distribution'].items():
        print(f"  {difficulty}: {count} 个资源")
    
    # 相似资源推荐
    print("\n7. 相似资源推荐")
    print("-" * 30)
    
    target_resource = "高达:独角兽(双形态)"
    print(f"查找与 '{target_resource}' 相似的资源:")
    
    similar_resources = manager.find_similar_resources(target_resource, n_results=3)
    for i, similar in enumerate(similar_resources, 1):
        print(f"  {i}. {similar['name']} - 相似度: {similar['similarity']:.3f}")
        print(f"     {similar['description'][:50]}...")
    
    # 搜索分析
    print("\n8. 搜索行为分析")
    print("-" * 30)
    
    analytics = manager.get_search_analytics()
    print(f"总搜索次数: {analytics['total_searches']}")
    print(f"平均搜索结果数: {analytics['avg_results_per_search']:.1f}")
    
    print("\n热门搜索词:")
    for query, count in list(analytics['popular_queries'].items())[:5]:
        print(f"  '{query}': {count} 次")
    
    print("\n分类搜索分布:")
    for category, count in analytics['category_searches'].items():
        print(f"  {category}: {count} 次")
    
    print("\n" + "=" * 60)
    print("游戏资源管理系统演示完成!")
    print("=" * 60)

if __name__ == "__main__":
    demo_game_resource_system() 

六、最终应用价值与落地建议

  • 搜索效率:语义检索结合元数据过滤,快速定位目标内容
  • 个性化体验:结合相似度、评分、下载量实现多因子推荐
  • 可运营性:搜索日志与统计指标可快速搭建数据看板
  • 可扩展性:多集合分层,支持引入自定义嵌入/多模态向量
  • 可持久化:生产环境落地更稳健,支持断点续用

如需扩展为服务形态,建议封装 REST API/GraphQL,增加鉴权、限流与监控告警。