Chroma向量数据库的安装与简单使用

目录

目标

Chroma的版本

官网

安装Chroma

实战

最简实现

新增数据

新增元数据(metadatas)

新增外部向量

删除数据

根据ID删除数据

根据where删除数据

修改数据

upsert方法

update方法

自定义嵌入模型


目标

初步掌握Chroma向量数据库的使用方法,包括增删改查及自定义嵌入模型。Chroma向量数据库有Client-Server Mode和Chroma Clients两种使用模式,这里以Chroma Clients模式作为我们的入门演示。


Chroma的版本

1.5.5


官网

https://docs.trychroma.com/docs/overview/getting-startedhttps://docs.trychroma.com/docs/overview/getting-started


安装Chroma

第一步:安装Chroma向量数据库。不能科学上网的同学可以使用国内镜像去安装。

bash 复制代码
pip install chromadb -i http://mirrors.aliyun.com/pypi/simple/

实战

最简实现

第一步:创建集合。

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")

第二步:往集合中添加数据。

python 复制代码
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"],
    documents=[
        "我喜欢看金庸的武侠小说。",
        "今天的工作任务很多。",
        "人工智能非常难学。",
        "凡人修仙传动画片很好看。",
        "今天的股票大涨。",
        "国际油价持续上涨。",
        "金价持续上涨。",
        "乔丹是NBA当之无愧的第一人。"
    ]
)

第三步:查询集合中的相关数据。这里注意:第一次使用Chroma时,程序会下载并安装all-MiniLM-L6-v2的嵌入模型。

python 复制代码
#使用一组查询文本对集合进行查询,Chroma将返回最相似的n个结果。
results = collection.query(
    query_texts=["经济基础决定上层建筑。"],
    n_results=3
)
print(results)

新增数据

新增元数据( metadatas

向量数据库并非只存向量,元数据也很重要。因为向量只解决相似度检索,元数据提升业务可控性。元数据的作用具体体现在这些场景:文档来源、业务分类、时间信息、权限与安全控制、约束检索范围(不同用户或公司输入相同的问题得到不同的结果)。如下面的代码中,元数据明确规定了不同角色之间的数据权限。

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则:电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。",
        "员工请假制度:年假需提前3天申请,病假需提供医疗证明。",
        "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。",
        "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。",
        "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。",
        "物流配送规则:标准配送3-5天,加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)

新增外部向量

**新增操作必须提供文档、向量或两者都提供(如果文档存储在其他地方或者文档内容很大,推荐只添加嵌入和元数据,这里就不做演示了。)。**元数据是可选的。当只提供文档时,Chroma将使用集合的嵌入功能为生成向量。这一点在之前的最简配置已经得到了证实。如果我们使用其他方法得到了文档对应的向量,我们可以直接将这些向量保存进Chroma中,此时Chroma中的嵌入模型不会重新生成向量将它们覆盖。

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"],
    documents=[
        "我喜欢看金庸的武侠小说。",
        "今天的工作任务很多。",
        "人工智能非常难学。",
        "凡人修仙传动画片很好看。",
        "今天的股票大涨。",
        "国际油价持续上涨。",
        "金价持续上涨。",
        "乔丹是NBA当之无愧的第一人。"
    ],
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.6, 6.3, 4.4],
        [2.0, 3.1, 5.6],
        [7.2, 1.4, 0.9],
        [3.3, 8.8, 2.1],
        [5.5, 2.2, 9.7]
    ]
)
# 使用一组查询文本对集合进行查询,Chroma将返回最相似的n个结果。
results = collection.query(
    # 因为collection中已手动写入3维embedding,
    # 若使用query_texts,Chroma会通过默认embedding_function生成384维向量,
    # 会导致维度不匹配错误。
    # 因此这里使用query_embeddings,直接提供同维度(3维)向量进行检索。
    query_embeddings=[[0.1, 0.2, 0.3]],
    n_results=3,
    include=["embeddings", "documents", "distances"]
)
print(results)
#向量维度
print(len(results["embeddings"][0][0]))

删除数据

根据ID删除数据

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"],
    documents=[
        "我喜欢看金庸的武侠小说。",
        "今天的工作任务很多。",
        "人工智能非常难学。",
        "凡人修仙传动画片很好看。",
        "今天的股票大涨。",
        "国际油价持续上涨。",
        "湖人总冠军。",
        "乔丹是NBA当之无愧的第一人。"
    ]
)
#使用一组查询文本对集合进行查询,Chroma将返回最相似的n个结果。
results = collection.query(
    query_texts=["谁是NBA第一人。"],
    n_results=3
)
print(results)

collection.delete(
    ids=["123463"],
)
results = collection.query(
    query_texts=["谁是NBA第一人。"],
    n_results=3
)
print(results)

根据where删除数据

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则:电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。",
        "员工请假制度:年假需提前3天申请,病假需提供医疗证明。",
        "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。",
        "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。",
        "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。",
        "物流配送规则:标准配送3-5天,加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)
results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1
)
print(results)

collection.delete(
    where={
        "tenant_id": "shop_A",
    }
)

results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1
)
print(results)

修改数据

upsert表示有则覆盖无则插入;update则只做修改

upsert方法

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则:电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。",
        "员工请假制度:年假需提前3天申请,病假需提供医疗证明。",
        "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。",
        "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。",
        "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。",
        "物流配送规则:标准配送3-5天,加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)
results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.upsert(
    ids=["123456", ],
    documents=["退款规则:电商订单在30天内可申请无理由退款。", ],
)

results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.upsert(
    ids=["888888", ],
    documents=["游戏很好玩。", ],
)

results = collection.query(
    query_texts=["游戏好不好玩?"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

update方法

python 复制代码
import chromadb

chroma_client = chromadb.Client()
# 创建集合
collection = chroma_client.create_collection(name="first_collection")
# Chroma将自动存储文本并处理嵌入和索引
# 下面三条数据对应三个ID
collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则:电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。",
        "员工请假制度:年假需提前3天申请,病假需提供医疗证明。",
        "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。",
        "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。",
        "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。",
        "物流配送规则:标准配送3-5天,加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)
results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.update(
    ids=["123456", ],
    documents=["退款规则:电商订单在30天内可申请无理由退款。", ],
)

results = collection.query(
    query_texts=["电商订单在多少天内可申请无理由退款。"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

collection.update(
    ids=["888888", ],
    documents=["张三的爱好是打篮球。", ],
)

results = collection.query(
    query_texts=["张三的爱好是什么?"],
    n_results=1,
    include=["metadatas", "documents", "distances"]
)
print(results)

自定义嵌入模型

适配器类

python 复制代码
import requests

class MyOllamaEmbeddingFunction:
    def __init__(self, model="qwen3-embedding:8b"):
        self.model = model
        self.url = "http://localhost:11434/api/embeddings"
        self.session = requests.Session()

    def _embed(self, texts):
        embeddings = []
        for text in texts:
            res = self.session.post(self.url, json={
                "model": self.model,
                "prompt": text
            })
            embeddings.append(res.json()["embedding"])
        return embeddings

    #给 add 用
    def embed_documents(self, input):
        return self._embed(input)

    #给 query 用
    def embed_query(self, input):
        return self._embed(input)

    #为兼容旧接口(可选)
    def __call__(self, input):
        return self._embed(input)

测试

python 复制代码
import chromadb

from chroma_test.MyOllamaEmbeddingFunction import MyOllamaEmbeddingFunction

client = chromadb.Client()

embedding_fn = MyOllamaEmbeddingFunction()

collection = client.create_collection(
    name="rag_collection",
    embedding_function=embedding_fn
)

collection.add(
    ids=[
        "123456", "123457", "123458", "123459",
        "123460", "123461", "123462", "123463"
    ],
    documents=[
        "退款规则:电商订单在7天内可申请无理由退款。",
        "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。",
        "员工请假制度:年假需提前3天申请,病假需提供医疗证明。",
        "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。",
        "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。",
        "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。",
        "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。",
        "物流配送规则:标准配送3-5天,加急配送24小时内送达。"
    ],
    metadatas=[
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "fintech",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        },
        {
            "tenant_id": "company_hr",
            "business_unit": "hr",
            "region": "global",
            "language": "zh",
            "doc_type": "handbook",
            "permission": "employee"
        },
        {
            "tenant_id": "company_ops",
            "business_unit": "devops",
            "region": "global",
            "language": "zh",
            "doc_type": "runbook",
            "permission": "engineer"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "ecommerce",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "vip"
        },
        {
            "tenant_id": "gov_sg",
            "business_unit": "tax",
            "region": "SG",
            "language": "zh",
            "doc_type": "regulation",
            "permission": "public"
        },
        {
            "tenant_id": "company_it",
            "business_unit": "security",
            "region": "global",
            "language": "zh",
            "doc_type": "policy",
            "permission": "admin"
        },
        {
            "tenant_id": "shop_A",
            "business_unit": "logistics",
            "region": "SG",
            "language": "zh",
            "doc_type": "policy",
            "permission": "user"
        }
    ]
)

results = collection.query(
    query_texts=["请说一下退款规则"],
    n_results=1,
    include=["embeddings", "documents", "distances"]
)

print(results)
#向量维度
print(len(results["embeddings"][0][0]))

#验证修改操作是使用的Chroma默认的嵌入模型还是我们自定义的模型
collection.upsert(
    ids=["888888", ],
    documents=["游戏很好玩。", ],
)
results = collection.query(
    query_texts=["游戏好不好玩?"],
    n_results=1,
    include=["embeddings", "documents", "distances"]
)
print(results)
print(len(results["embeddings"][0][0]))

collection.update(
    ids=["888888", ],
    documents=["游戏不好玩。", ],
)
results = collection.query(
    query_texts=["游戏好不好玩?"],
    n_results=1,
    include=["embeddings", "documents", "distances"]
)
print(results)
print(len(results["embeddings"][0][0]))

验证结果

相关推荐
百年੭ ᐕ)੭*⁾⁾3 天前
Chroma简单上手
人工智能·语言模型·langchain·chroma·rag
深藏功yu名8 天前
Day24:向量数据库 Chroma_FAISS 入门
数据库·人工智能·python·ai·agent·faiss·chroma
深藏功yu名8 天前
Day24(进阶篇):向量数据库 Chroma_FAISS 深度攻坚 —— 索引优化、性能调优与生产级落地
数据库·人工智能·python·ai·agent·faiss·chroma
java1234_小锋1 个月前
嵌入模型与Chroma向量数据库 - Chroma安装与简单应用实例 - AI大模型应用开发必备知识
人工智能·向量数据库·chroma
java1234_小锋1 个月前
嵌入模型与Chroma向量数据库 - Qwen3嵌入模型使用 - AI大模型应用开发必备知识
人工智能·向量数据库·chroma
java1234_小锋1 个月前
嵌入模型与Chroma向量数据库 - 嵌入模型与向量数据库简介 - AI大模型应用开发必备知识
人工智能·向量数据库·chroma
dblens 数据库管理和开发工具2 个月前
开源向量数据库比较:Chroma, Milvus, Faiss,Weaviate
数据库·开源·milvus·faiss·chroma·weaviate
问道飞鱼2 个月前
【大模型知识】Chroma + Ollama + Llama 3.1 搭建本地知识库
llama·知识库·chroma·ollama
luo_yu_11063 个月前
安装chroma的时候报错
python·chroma