Chroma向量数据库的安装与简单使用

目标

Chroma的版本

官网

安装Chroma

目标

初步掌握Chroma向量数据库的使用方法，包括增删改查及自定义嵌入模型。Chroma向量数据库有Client-Server Mode和Chroma Clients两种使用模式，这里以Chroma Clients模式作为我们的入门演示。

Chroma的版本

1.5.5

官网

https://docs.trychroma.com/docs/overview/getting-startedhttps://docs.trychroma.com/docs/overview/getting-started

安装Chroma

第一步：安装Chroma向量数据库。不能科学上网的同学可以使用国内镜像去安装。

bash 复制代码

pip install chromadb -i http://mirrors.aliyun.com/pypi/simple/

实战

最简实现

第一步：创建集合。

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")

第二步：往集合中添加数据。

python 复制代码

# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"],
    documents=[
        "我喜欢看金庸的武侠小说。",
        "今天的工作任务很多。",
        "人工智能非常难学。",
        "凡人修仙传动画片很好看。",
        "今天的股票大涨。",
        "国际油价持续上涨。",
        "金价持续上涨。",
        "乔丹是NBA当之无愧的第一人。"
    ]
)

第三步：查询集合中的相关数据。这里注意：第一次使用Chroma时，程序会下载并安装all-MiniLM-L6-v2的嵌入模型。

python 复制代码

#使用一组查询文本对集合进行查询，Chroma将返回最相似的n个结果。
results = collection.query(
    query_texts=["经济基础决定上层建筑。"],
    n_results=3
)
print(results)

新增数据

新增元数据（ metadatas ）

向量数据库并非只存向量，元数据也很重要。因为向量只解决相似度检索，元数据提升业务可控性。元数据的作用具体体现在这些场景：文档来源、业务分类、时间信息、权限与安全控制、约束检索范围（不同用户或公司输入相同的问题得到不同的结果）。如下面的代码中，元数据明确规定了不同角色之间的数据权限。

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则：电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则：理财产品需T+2日到账，提前赎回可能产生手续费。",
        "员工请假制度：年假需提前3天申请，病假需提供医疗证明。",
        "服务器故障处理流程：P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策：VIP用户可享受优先退款审核通道，24小时内处理。",
        "新加坡地区税务说明：GST税率为9%，适用于所有消费类交易。",
        "数据访问权限说明：敏感数据仅限admin角色访问，普通用户无权限。",
        "物流配送规则：标准配送3-5天，加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)

新增外部向量

**新增操作必须提供文档、向量或两者都提供（如果文档存储在其他地方或者文档内容很大，推荐只添加嵌入和元数据，这里就不做演示了。）。**元数据是可选的。当只提供文档时，Chroma将使用集合的嵌入功能为生成向量。这一点在之前的最简配置已经得到了证实。如果我们使用其他方法得到了文档对应的向量，我们可以直接将这些向量保存进Chroma中，此时Chroma中的嵌入模型不会重新生成向量将它们覆盖。

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"],
    documents=[
        "我喜欢看金庸的武侠小说。",
        "今天的工作任务很多。",
        "人工智能非常难学。",
        "凡人修仙传动画片很好看。",
        "今天的股票大涨。",
        "国际油价持续上涨。",
        "金价持续上涨。",
        "乔丹是NBA当之无愧的第一人。"
    ],
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.6, 6.3, 4.4],
        [2.0, 3.1, 5.6],
        [7.2, 1.4, 0.9],
        [3.3, 8.8, 2.1],
        [5.5, 2.2, 9.7]
    ]
)
# 使用一组查询文本对集合进行查询，Chroma将返回最相似的n个结果。
results = collection.query(
    # 因为collection中已手动写入3维embedding，
    # 若使用query_texts，Chroma会通过默认embedding_function生成384维向量，
    # 会导致维度不匹配错误。
    # 因此这里使用query_embeddings，直接提供同维度（3维）向量进行检索。
    query_embeddings=[[0.1, 0.2, 0.3]],
    n_results=3,
    include=["embeddings", "documents", "distances"]
)
print(results)
#向量维度
print(len(results["embeddings"][0][0]))

删除数据

根据ID删除数据

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"],
    documents=[
        "我喜欢看金庸的武侠小说。",
        "今天的工作任务很多。",
        "人工智能非常难学。",
        "凡人修仙传动画片很好看。",
        "今天的股票大涨。",
        "国际油价持续上涨。",
        "湖人总冠军。",
        "乔丹是NBA当之无愧的第一人。"
    ]
)
#使用一组查询文本对集合进行查询，Chroma将返回最相似的n个结果。
results = collection.query(
    query_texts=["谁是NBA第一人。"],
    n_results=3
)
print(results)

collection.delete(
    ids=["123463"],
)
results = collection.query(
    query_texts=["谁是NBA第一人。"],
    n_results=3
)
print(results)

根据where删除数据

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则：电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则：理财产品需T+2日到账，提前赎回可能产生手续费。",
        "员工请假制度：年假需提前3天申请，病假需提供医疗证明。",
        "服务器故障处理流程：P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策：VIP用户可享受优先退款审核通道，24小时内处理。",
        "新加坡地区税务说明：GST税率为9%，适用于所有消费类交易。",
        "数据访问权限说明：敏感数据仅限admin角色访问，普通用户无权限。",
        "物流配送规则：标准配送3-5天，加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)
results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1
)
print(results)

collection.delete(
    where={
        "tenant_id": "shop_A",
    }
)

results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1
)
print(results)

修改数据

upsert表示有则覆盖无则插入；update则只做修改。

upsert方法

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则：电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则：理财产品需T+2日到账，提前赎回可能产生手续费。",
        "员工请假制度：年假需提前3天申请，病假需提供医疗证明。",
        "服务器故障处理流程：P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策：VIP用户可享受优先退款审核通道，24小时内处理。",
        "新加坡地区税务说明：GST税率为9%，适用于所有消费类交易。",
        "数据访问权限说明：敏感数据仅限admin角色访问，普通用户无权限。",
        "物流配送规则：标准配送3-5天，加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)
results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.upsert(
    ids=["123456", ],
    documents=["退款规则：电商订单在30天内可申请无理由退款。", ],
)

results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.upsert(
    ids=["888888", ],
    documents=["游戏很好玩。", ],
)

results = collection.query(
    query_texts=["游戏好不好玩？"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

update方法

python 复制代码

import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则：电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则：理财产品需T+2日到账，提前赎回可能产生手续费。",
        "员工请假制度：年假需提前3天申请，病假需提供医疗证明。",
        "服务器故障处理流程：P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策：VIP用户可享受优先退款审核通道，24小时内处理。",
        "新加坡地区税务说明：GST税率为9%，适用于所有消费类交易。",
        "数据访问权限说明：敏感数据仅限admin角色访问，普通用户无权限。",
        "物流配送规则：标准配送3-5天，加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)
results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.update(
    ids=["123456", ],
    documents=["退款规则：电商订单在30天内可申请无理由退款。", ],
)

results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.update(
    ids=["888888", ],
    documents=["张三的爱好是打篮球。", ],
)

results = collection.query(
    query_texts=["张三的爱好是什么？"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

自定义嵌入模型

适配器类

python 复制代码

import requests

class MyOllamaEmbeddingFunction:
    def __init__(self, model="qwen3-embedding:8b"):
        self.model = model
        self.url = "http://localhost:11434/api/embeddings"
        self.session = requests.Session()

    def _embed(self, texts):
        embeddings = []
        for text in texts:
            res = self.session.post(self.url, json={
                "model": self.model,
                "prompt": text
            })
            embeddings.append(res.json()["embedding"])
        return embeddings

    #给 add 用
    def embed_documents(self, input):
        return self._embed(input)

    #给 query 用
    def embed_query(self, input):
        return self._embed(input)

    #为兼容旧接口（可选）
    def __call__(self, input):
        return self._embed(input)

测试

python 复制代码

import chromadb

from chroma_test.MyOllamaEmbeddingFunction import MyOllamaEmbeddingFunction

client = chromadb.Client()

embedding_fn = MyOllamaEmbeddingFunction()

collection = client.create_collection(
    name="rag_collection",
    embedding_function=embedding_fn
)

collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则：电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则：理财产品需T+2日到账，提前赎回可能产生手续费。",
        "员工请假制度：年假需提前3天申请，病假需提供医疗证明。",
        "服务器故障处理流程：P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策：VIP用户可享受优先退款审核通道，24小时内处理。",
        "新加坡地区税务说明：GST税率为9%，适用于所有消费类交易。",
        "数据访问权限说明：敏感数据仅限admin角色访问，普通用户无权限。",
        "物流配送规则：标准配送3-5天，加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)

results = collection.query(
    query_texts=["请说一下退款规则"],
    n_results=1,
    include=["embeddings", "documents", "distances"]
)

print(results)
#向量维度
print(len(results["embeddings"][0][0]))

#验证修改操作是使用的Chroma默认的嵌入模型还是我们自定义的模型
collection.upsert(
    ids=["888888", ],
    documents=["游戏很好玩。", ],
)
results = collection.query(
    query_texts=["游戏好不好玩？"],
    n_results=1,
    include=["embeddings", "documents", "distances"]
)
print(results)
print(len(results["embeddings"][0][0]))

collection.update(
    ids=["888888", ],
    documents=["游戏不好玩。", ],
)
results = collection.query(
    query_texts=["游戏好不好玩？"],
    n_results=1,
    include=["embeddings", "documents", "distances"]
)
print(results)
print(len(results["embeddings"][0][0]))

验证结果