一、pgvector 核心原理与架构设计
1.1 什么是 pgvector?
pgvector 是 PostgreSQL 的开源扩展(Extension) ,为世界上最先进的开源关系型数据库添加向量相似度搜索能力。与 Pinecone(SaaS)和 Milvus(独立系统)不同,pgvector 完全集成在 PostgreSQL 内部,复用其存储引擎、事务机制、复制架构和生态工具。
核心定位:
- 零额外基础设施:现有 PostgreSQL 实例直接启用,无需部署新系统
- ACID 事务保障:向量操作与普通数据共享同一事务边界
- SQL 原生支持:用标准 SQL 进行向量操作,无需学习新查询语言
- 生态复用:PgAdmin、DBeaver、连接池、备份工具全部兼容
版本演进:
- 0.1.x-0.4.x:基础功能,ivfflat 索引
- 0.5.x:HNSW 索引引入,性能大幅提升
- 0.6.x:稀疏向量支持,性能优化
- 0.7.x+:迭代式扫描,量化支持,生产级稳定
1.2 架构原理:PostgreSQL 扩展机制
1.2.1 Postgres 扩展架构
PostgreSQL 通过**扩展(Extension)**机制在不修改核心代码的情况下添加功能:
┌─────────────────────────────────────────┐
│ PostgreSQL 核心进程 │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Parser │ │ Planner │ │ Executor│ │
│ │ (SQL解析)│ │ (查询计划)│ │ (执行器) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Extension │ │
│ │ Manager │ │
│ │ (加载扩展) │ │
│ └──────┬──────┘ │
└──────────────────┼───────────────────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│PostGIS │ │pgvector│ │TimescaleDB│
│(地理空间)│ │(向量) │ │(时序) │
└────────┘ └────────┘ └────────┘
pgvector 集成点:
- 新数据类型 :
vector(dim),halfvec(dim),sparsevec(dim) - 新操作符 :
<->(欧氏距离),<=>(余弦相似度),<#>(内积) - 新索引类型 :
ivfflat,hnsw(使用 Postgres 通用索引接口) - 新函数 :
vector_norm,vector_dims,l2_distance等
1.2.2 向量存储格式
pgvector 使用 PostgreSQL 的 TOAST(The Oversized-Attribute Storage Technique) 机制存储向量:
内存中(处理时):
[0.23, -0.56, 0.89, ..., 0.12] ← float4 数组,4字节/维度
磁盘存储(压缩后):
┌─────────────────────────────────────────┐
│ VARHDRA (1-4 bytes) │ 维度 (2 bytes) │ float4[] 数据 │
└─────────────────────────────────────────┘
│
▼
TOAST 压缩(如果 >2KB)
│
▼
磁盘页(8KB 默认)
存储优化:
- halfvec:2字节/维度(float16),节省 50% 存储,轻微精度损失
- sparsevec:稀疏向量存储,仅存储非零维度(类似 CSR 格式)
1.3 索引算法详解
pgvector 实现了两种主流 ANN 索引,复用 PostgreSQL 的索引接口:
1.3.1 IVFFlat 索引
sql
-- 创建 IVF 索引
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- 聚类中心数
原理:
训练阶段:
1. 从现有数据采样,K-Means 聚类成 `lists` 个中心
2. 每个向量分配到最近的中心(倒排列表)
3. 为每个列表构建独立的向量数组
查询阶段:
1. 找到查询向量最近的 `probes` 个中心(默认 1)
2. 在这些中心的列表中暴力搜索
3. 合并结果返回 top_k
参数调优:
| 参数 | 作用 | 建议值 |
|---|---|---|
lists |
聚类中心数 | 数据量的平方根,最大 4096 |
probes |
查询时扫描的列表数 | 1-10,越大召回越高 |
特点:
- 构建快,适合批量导入后不再更新的场景
- 查询速度中等,召回率依赖 probes 参数
- 不支持增量更新(需要重建)
1.3.2 HNSW 索引(推荐)
sql
-- 创建 HNSW 索引(pgvector 0.5.0+)
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
原理:
分层可导航小世界图:
- 第 0 层:密集连接的基础图(所有节点)
- 第 1 层:随机采样 50% 节点构建上层图
- 第 2 层:再采样 50%,共 log(N) 层
构建阶段:
1. 随机插入节点,计算与现有节点的距离
2. 为每层选择 M 个最近邻居建立连接
3. 动态维护图的连通性
查询阶段:
1. 从顶层随机节点开始贪心搜索
2. 找到局部最优后下沉到下一层
3. 在底层精细搜索,ef 控制搜索宽度
参数调优:
| 参数 | 作用 | 建议值 |
|---|---|---|
m |
最大出度 | 8-64,越大图越稠密,构建越慢 |
ef_construction |
构建时搜索深度 | 64-200,越大图质量越高 |
ef |
查询时搜索宽度 | >= top_k,越大召回越高 |
特点:
- 查询极快,召回率高(95%+)
- 支持增量插入(实时更新)
- 构建较慢,内存占用高(约 2 倍原始数据)
1.3.3 索引操作符类
sql
-- 根据距离度量选择操作符类
CREATE INDEX ON items USING hnsw (
embedding vector_cosine_ops -- 余弦相似度(推荐文本)
-- embedding vector_l2_ops -- 欧氏距离(推荐图像)
-- embedding vector_ip_ops -- 内积(推荐推荐系统)
-- embedding vector_hamming_ops -- 汉明距离(二进制向量)
);
1.4 查询执行流程
sql
-- 示例查询
SELECT id, content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
WHERE category = 'tech' AND published > '2024-01-01'
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;
执行计划:
┌─────────────────────────────────────────────────────────┐
│ Limit (cost=100.00..100.10 rows=10) │
│ -> Index Scan using documents_embedding_idx │
│ Index Cond: (embedding <=> '[...]'::vector) │
│ Order By: (embedding <=> '[...]'::vector) │
│ Filter: ((category = 'tech') AND (published > │
│ '2024-01-01'::date)) │
└─────────────────────────────────────────────────────────┘
优化器决策:
- 如果
category选择性高,可能先扫 B-Tree 索引再向量排序 - 如果向量过滤后数据少,直接使用 HNSW 索引
LIMIT推动到索引层,避免全表扫描
二、部署实战:从开发到生产
2.1 开发环境:Docker 快速启动
bash
# 方式 1:官方镜像(推荐)
docker run --name pgvector-dev \
-e POSTGRES_PASSWORD=postgres \
-p 5432:5432 \
-d pgvector/pgvector:pg16
# 进入容器启用扩展
docker exec -it pgvector-dev psql -U postgres -c "CREATE EXTENSION vector;"
# 方式 2:Docker Compose(完整开发栈)
cat > docker-compose.yml <<'EOF'
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: dev
POSTGRES_PASSWORD: devpass
POSTGRES_DB: vectordb
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
command: >
postgres
-c shared_preload_libraries='pg_stat_statements'
-c max_connections=200
pgadmin:
image: dpage/pgadmin4:latest
environment:
PGADMIN_DEFAULT_EMAIL: admin@example.com
PGADMIN_DEFAULT_PASSWORD: admin
ports:
- "8080:80"
depends_on:
- postgres
volumes:
pgdata:
EOF
docker-compose up -d
2.2 生产环境:源码编译安装
场景:需要特定 PostgreSQL 版本,或优化编译参数
bash
# 1. 安装依赖
sudo apt-get update
sudo apt-get install -y postgresql-server-dev-16 build-essential git
# 2. 下载源码
cd /tmp
git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git
cd pgvector
# 3. 编译安装
make OPTFLAGS="-march=native -O3" # 针对 CPU 优化
sudo make install
# 4. 启用扩展
sudo -u postgres psql -c "CREATE EXTENSION vector;"
# 5. 验证
psql -c "SELECT * FROM pg_available_extensions WHERE name = 'vector';"
2.3 云托管服务配置
2.3.1 AWS RDS/Aurora
sql
-- RDS 支持 pgvector(需特定版本)
-- 创建参数组启用
-- 通过 RDS 控制台或 CLI 修改
-- 连接后启用
CREATE EXTENSION IF NOT EXISTS vector;
-- 查看版本
SELECT * FROM pg_extension WHERE extname = 'vector';
限制:
- 不支持 HNSW 索引的某些高级参数调整
- 无法自定义编译优化
- 依赖 RDS 版本发布节奏
2.3.2 Supabase/Neon 等 Serverless Postgres
sql
-- Supabase 已预装 pgvector
-- 直接在 SQL Editor 执行:
CREATE EXTENSION vector;
-- 向量列限制:Supabase 免费版有 500MB 限制
-- 建议:大向量表使用单独的项目
2.4 连接池与高性能配置
ini
# postgresql.conf 优化(生产环境)
# 内存配置(假设 64GB 内存服务器)
shared_buffers = 16GB # 25% 内存
effective_cache_size = 48GB # 75% 内存
work_mem = 256MB # 复杂排序/哈希操作
maintenance_work_mem = 2GB # 索引构建
# 并发配置
max_connections = 200
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
# WAL 配置(SSD 优化)
wal_buffers = 16MB
min_wal_size = 1GB
max_wal_size = 4GB
checkpoint_completion_target = 0.9
# 查询计划
random_page_cost = 1.1 # SSD 设置 1.1,HDD 默认 4
effective_io_concurrency = 200 # SSD 并发数
# pgvector 特定
# 无特殊 GUC 参数,依赖 Postgres 基础优化
连接池(PgBouncer):
ini
; pgbouncer.ini
[databases]
vectordb = host=localhost port=5432 dbname=vectordb
[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
max_client_conn = 10000
default_pool_size = 25
reserve_pool_size = 5
pool_mode = transaction # 推荐事务级池化
三、SQL 操作完全指南
3.1 数据类型与基础操作
sql
-- ==================== 数据类型定义 ====================
-- 稠密向量(最常用)
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536), -- 1536 维 float4 向量
metadata JSONB -- 灵活元数据
);
-- 半精度向量(节省存储)
CREATE TABLE images (
id SERIAL PRIMARY KEY,
url TEXT,
embedding halfvec(512), -- 512 维 float2,省 50% 空间
category VARCHAR(50)
);
-- 稀疏向量(关键词权重)
CREATE TABLE keywords (
id SERIAL PRIMARY KEY,
doc_id INTEGER REFERENCES documents(id),
embedding sparsevec(100000), -- 10万维,仅存储非零值
term TEXT
);
-- ==================== 向量构造与转换 ====================
-- 从数组构造
SELECT '[1,2,3]'::vector(3);
SELECT ARRAY[0.1, 0.2, 0.3]::vector(3);
-- 维度检查(严格)
SELECT '[1,2]'::vector(3); -- 错误:维度不匹配
-- 类型转换
SELECT embedding::halfvec FROM documents; -- 降精度
SELECT embedding::vector FROM images; -- 升精度
-- ==================== 距离计算 ====================
-- 欧氏距离(L2)
SELECT embedding <-> '[0.1, 0.2, ...]'::vector AS distance FROM documents;
-- 余弦距离(1 - 余弦相似度,范围 [0, 2])
SELECT embedding <=> '[0.1, 0.2, ...]'::vector AS distance FROM documents;
-- 内积(负值表示角度 > 90°)
SELECT embedding <#> '[0.1, 0.2, ...]'::vector AS dot_product FROM documents;
-- 负内积(用于最大内积搜索)
SELECT embedding <+> '[0.1, 0.2, ...]'::vector FROM documents;
-- 汉明距离(二进制向量)
SELECT embedding <~> '[1,0,1,...]'::bit(10) FROM binary_items;
-- ==================== 向量属性 ====================
SELECT
vector_dims(embedding) AS dimensions, -- 维度
vector_norm(embedding) AS magnitude, -- 模长
embedding::text AS string_representation -- 文本表示
FROM documents;
3.2 表设计与索引策略
sql
-- ==================== 生产级表设计 ====================
-- 分区表(按时间分区,适合日志/时序数据)
CREATE TABLE document_chunks (
id BIGSERIAL,
doc_id INTEGER NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
metadata JSONB DEFAULT '{}',
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- 创建分区
CREATE TABLE document_chunks_2024_q1 PARTITION OF document_chunks
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE document_chunks_2024_q2 PARTITION OF document_chunks
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- 索引(每个分区独立创建)
CREATE INDEX ON document_chunks_2024_q1 USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON document_chunks_2024_q2 USING hnsw (embedding vector_cosine_ops);
-- 普通 B-Tree 索引(元数据过滤)
CREATE INDEX ON document_chunks (doc_id);
CREATE INDEX ON document_chunks USING gin (metadata); -- JSONB 索引
-- ==================== 索引策略选择 ====================
-- 策略 1:纯 HNSW(小数据量 <100万,高并发查询)
CREATE INDEX idx_hnsw ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- 策略 2:HNSW + 条件索引(分区查询)
CREATE INDEX idx_hnsw_tech ON documents
USING hnsw (embedding vector_cosine_ops)
WHERE category = 'tech';
-- 策略 3:IVFFlat(大数据量,批量导入,少更新)
CREATE INDEX idx_ivf ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- 策略 4:混合检索(全文 + 向量)
-- 需要额外安装 pg_trgm 或 pg_search
CREATE INDEX idx_content_gin ON documents USING gin (content gin_trgm_ops);
CREATE INDEX idx_embedding ON documents USING hnsw (embedding vector_cosine_ops);
-- ==================== 索引维护 ====================
-- 查看索引大小
SELECT pg_size_pretty(pg_relation_size('idx_hnsw'));
-- 重建索引(HNSW 支持并发重建)
REINDEX INDEX CONCURRENTLY idx_hnsw;
-- 删除并重建(IVF 需要)
DROP INDEX idx_ivf;
CREATE INDEX idx_ivf ON documents USING ivfflat (embedding vector_cosine_ops);
-- 统计信息更新
ANALYZE documents;
3.3 查询操作详解
sql
-- ==================== 基础相似度搜索 ====================
-- 最近邻搜索(KNN)
SELECT
id,
content,
embedding <=> '[0.023, -0.456, ...]'::vector AS distance
FROM documents
ORDER BY embedding <=> '[0.023, -0.456, ...]'::vector
LIMIT 10;
-- 使用绑定参数(应用开发)
PREPARE knn_query (vector) AS
SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;
EXECUTE knn_query ('[0.1, 0.2, ...]'::vector);
-- ==================== 混合查询(向量 + 元数据) ====================
-- 方式 1:先元数据过滤,再向量排序(选择性高时)
SELECT
id, content,
embedding <=> '[...]'::vector AS distance
FROM documents
WHERE category = 'tech'
AND published > '2024-01-01'
AND metadata->>'author' = 'Alice'
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;
-- 方式 2:向量索引扫描 + 过滤(向量选择性强时)
-- 优化器自动选择,或强制使用索引
SET enable_seqscan = off;
SELECT
id, content,
embedding <=> '[...]'::vector AS distance
FROM documents
WHERE embedding <=> '[...]'::vector < 0.3 -- 距离阈值
AND category = 'tech'
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;
-- ==================== 范围查询 ====================
-- 距离阈值搜索(返回所有相似度 > 0.8 的)
SELECT
id,
1 - (embedding <=> '[...]'::vector) / 2 AS cosine_similarity
FROM documents
WHERE embedding <=> '[...]'::vector < 0.4 -- 余弦距离 < 0.4 即相似度 > 0.8
ORDER BY embedding <=> '[...]'::vector;
-- ==================== 批量查询 ====================
-- 使用 LATERAL 进行多查询批量处理
WITH queries AS (
SELECT 1 AS qid, '[0.1, ...]'::vector AS vec
UNION ALL
SELECT 2, '[0.2, ...]'::vector
)
SELECT
q.qid,
d.id,
d.content,
d.embedding <=> q.vec AS distance
FROM queries q
LEFT JOIN LATERAL (
SELECT * FROM documents
ORDER BY embedding <=> q.vec
LIMIT 5
) d ON true;
-- ==================== 聚合与分组 ====================
-- 按类别统计平均相似度
SELECT
category,
AVG(embedding <=> '[...]'::vector) AS avg_distance,
COUNT(*) AS count
FROM documents
GROUP BY category;
-- 找出每个类别最接近的文档
SELECT DISTINCT ON (category)
category,
id,
content,
embedding <=> '[...]'::vector AS distance
FROM documents
ORDER BY category, embedding <=> '[...]'::vector;
-- ==================== 向量运算 ====================
-- 向量加法(语义组合:"国王" - "男人" + "女人" ≈ "女王")
SELECT
d1.embedding - d2.embedding + d3.embedding AS combined_vector
FROM documents d1, documents d2, documents d3
WHERE d1.content = 'king'
AND d2.content = 'man'
AND d3.content = 'woman';
-- 使用结果进行搜索
WITH combined AS (
SELECT embedding - '[...]'::vector + '[...]'::vector AS vec
FROM documents WHERE id = 1
)
SELECT d.id, d.content
FROM combined c
CROSS JOIN LATERAL (
SELECT * FROM documents
ORDER BY embedding <=> c.vec
LIMIT 5
) d;
3.4 事务与并发控制
sql
-- ==================== ACID 事务示例 ====================
BEGIN;
-- 插入文档和向量
INSERT INTO documents (content, embedding, metadata)
VALUES (
'PostgreSQL 向量数据库',
'[0.1, 0.2, ...]'::vector,
'{"author": "Bob", "tags": ["database", "ai"]}'
)
RETURNING id;
-- 同时更新相关统计表
UPDATE doc_stats SET count = count + 1 WHERE category = 'database';
-- 向量搜索在同一事务中立即可见
SELECT * FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;
COMMIT;
-- ==================== 并发插入优化 ====================
-- 高并发插入时使用批量提交
-- 应用层代码逻辑:
-- 1. 积累 1000 条记录
-- 2. 单条 INSERT 多值
-- 3. 提交事务
INSERT INTO documents (content, embedding) VALUES
('doc1', '[...]'::vector),
('doc2', '[...]'::vector),
...;
-- ==================== 悲观锁与向量更新 ====================
-- 更新向量(先删后插的替代方案,HNSW 支持原地更新)
BEGIN;
SELECT * FROM documents WHERE id = 100 FOR UPDATE;
UPDATE documents
SET embedding = '[new vector]'::vector,
updated_at = NOW()
WHERE id = 100;
COMMIT;
四、编程语言集成与工具连接
4.1 Python 生态:psycopg2 + pgvector
bash
pip install psycopg2-binary pgvector
python
import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
from openai import OpenAI
# ==================== 连接与初始化 ====================
class PgvectorClient:
def __init__(self, dsn="postgresql://dev:devpass@localhost:5432/vectordb"):
self.conn = psycopg2.connect(dsn)
self.conn.autocommit = False # 手动事务控制
register_vector(self.conn) # 注册 vector 类型适配器
def init_schema(self):
"""初始化数据库 schema"""
with self.conn.cursor() as cur:
# 启用扩展
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
# 创建表
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
title VARCHAR(500),
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
""")
# 创建 HNSW 索引(如果不存在)
cur.execute("""
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_indexes
WHERE indexname = 'documents_embedding_idx'
) THEN
CREATE INDEX documents_embedding_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
END IF;
END $$;
""")
# 创建触发器自动更新 updated_at
cur.execute("""
CREATE OR REPLACE FUNCTION update_updated_at_column()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ language 'plpgsql';
DROP TRIGGER IF EXISTS update_documents_updated_at ON documents;
CREATE TRIGGER update_documents_updated_at
BEFORE UPDATE ON documents
FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column();
""")
self.conn.commit()
print("Schema 初始化完成")
# ==================== 数据操作 ====================
def insert_document(self, title: str, content: str, embedding: list, metadata: dict = None):
"""插入单条文档"""
with self.conn.cursor() as cur:
cur.execute("""
INSERT INTO documents (title, content, embedding, metadata)
VALUES (%s, %s, %s, %s)
RETURNING id;
""", (title, content, embedding, metadata))
doc_id = cur.fetchone()[0]
self.conn.commit()
return doc_id
def batch_insert(self, documents: list, batch_size: 1000):
"""批量插入(高效)"""
from psycopg2.extras import execute_values
total = len(documents)
for i in range(0, total, batch_size):
batch = documents[i:i+batch_size]
with self.conn.cursor() as cur:
execute_values(
cur,
"""
INSERT INTO documents (title, content, embedding, metadata)
VALUES %s
""",
[(d['title'], d['content'], d['embedding'], d.get('metadata', {}))
for d in batch],
template="(%s, %s, %s::vector, %s::jsonb)"
)
self.conn.commit()
print(f"已插入 {min(i+batch_size, total)}/{total}")
return total
# ==================== 向量搜索 ====================
def search_similar(self, query_embedding: list, top_k: int = 5,
filters: dict = None, min_similarity: float = None):
"""相似度搜索"""
with self.conn.cursor() as cur:
# 构建查询
sql = """
SELECT
id,
title,
content,
1 - (embedding <=> %s::vector) / 2 AS similarity,
metadata
FROM documents
WHERE 1=1
"""
params = [query_embedding]
# 动态过滤条件
if filters:
for key, value in filters.items():
if isinstance(value, list):
sql += f" AND metadata->>'{key}' = ANY(%s)"
params.append(value)
else:
sql += f" AND metadata->>'{key}' = %s"
params.append(value)
# 相似度阈值
if min_similarity:
max_distance = 2 * (1 - min_similarity) # 转换余弦相似度到距离
sql += " AND embedding <=> %s::vector < %s"
params.extend([query_embedding, max_distance])
sql += " ORDER BY embedding <=> %s::vector LIMIT %s"
params.extend([query_embedding, top_k])
cur.execute(sql, params)
results = cur.fetchall()
return [
{
"id": r[0],
"title": r[1],
"content": r[2][:200] + "..." if len(r[2]) > 200 else r[2],
"similarity": round(r[3], 4),
"metadata": r[4]
}
for r in results
]
# ==================== 混合全文搜索 ====================
def hybrid_search(self, query_text: str, query_embedding: list,
top_k: int = 10, text_weight: float = 0.3):
"""结合全文和向量搜索(需要 pg_trgm)"""
with self.conn.cursor() as cur:
cur.execute("""
WITH vector_scores AS (
SELECT
id,
1 - (embedding <=> %s::vector) / 2 AS v_score,
ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS v_rank
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s * 3
),
text_scores AS (
SELECT
id,
similarity(content, %s) AS t_score,
ROW_NUMBER() OVER (ORDER BY similarity(content, %s) DESC) AS t_rank
FROM documents
WHERE content %% %s -- 相似度操作符
ORDER BY similarity(content, %s) DESC
LIMIT %s * 3
)
SELECT
COALESCE(v.id, t.id) AS id,
d.title,
d.content,
COALESCE(v.v_score, 0) * (1 - %s) +
COALESCE(t.t_score, 0) * %s AS combined_score
FROM vector_scores v
FULL OUTER JOIN text_scores t ON v.id = t.id
JOIN documents d ON d.id = COALESCE(v.id, t.id)
ORDER BY combined_score DESC
LIMIT %s;
""", (
query_embedding, query_embedding, query_embedding, top_k,
query_text, query_text, query_text, query_text, top_k,
text_weight, text_weight, top_k
))
return cur.fetchall()
def close(self):
self.conn.close()
# ==================== RAG 应用完整示例 ====================
def rag_example():
# 初始化
client = PgvectorClient()
client.init_schema()
# 嵌入模型
openai = OpenAI()
# 准备数据
docs = [
"PostgreSQL 是世界上最先进的开源关系型数据库",
"pgvector 为 Postgres 添加了向量相似度搜索功能",
"向量数据库是 AI 应用的基础设施",
"HNSW 索引提供高效的近似最近邻搜索"
]
# 生成嵌入并插入
for doc in docs:
response = openai.embeddings.create(
model="text-embedding-3-small",
input=doc
)
embedding = response.data[0].embedding
client.insert_document(
title=doc[:50],
content=doc,
embedding=embedding,
metadata={"source": "manual", "category": "tech"}
)
# 查询
query = "什么是向量数据库?"
query_emb = openai.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = client.search_similar(
query_embedding=query_emb,
top_k=3,
min_similarity=0.7
)
print("搜索结果:")
for r in results:
print(f" [{r['similarity']:.3f}] {r['content']}")
client.close()
if __name__ == "__main__":
rag_example()
4.2 Node.js 集成:pg + pgvector
bash
npm install pg pgvector
javascript
const { Pool } = require('pg');
const pgvector = require('pgvector/pg');
class PgvectorService {
constructor() {
this.pool = new Pool({
host: 'localhost',
port: 5432,
database: 'vectordb',
user: 'dev',
password: 'devpass',
max: 20, // 连接池大小
idleTimeoutMillis: 30000
});
// 注册 vector 类型
this.pool.on('connect', async (client) => {
await pgvector.registerType(client);
});
}
async init() {
const client = await this.pool.connect();
try {
// 启用扩展
await client.query('CREATE EXTENSION IF NOT EXISTS vector');
// 创建表
await client.query(`
CREATE TABLE IF NOT EXISTS items (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
embedding vector(1536),
attributes JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW()
)
`);
// 检查并创建索引
const indexExists = await client.query(`
SELECT 1 FROM pg_indexes
WHERE indexname = 'items_embedding_idx'
`);
if (indexExists.rows.length === 0) {
await client.query(`
CREATE INDEX items_embedding_idx
ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
`);
console.log('HNSW 索引创建成功');
}
} finally {
client.release();
}
}
async insertItem(name, embedding, attributes = {}) {
const query = `
INSERT INTO items (name, embedding, attributes)
VALUES ($1, $2, $3)
RETURNING id
`;
const result = await this.pool.query(query, [name, embedding, attributes]);
return result.rows[0].id;
}
async batchInsert(items) {
const client = await this.pool.connect();
try {
await client.query('BEGIN');
const insertQuery = `
INSERT INTO items (name, embedding, attributes)
SELECT * FROM UNNEST($1::text[], $2::vector[], $3::jsonb[])
`;
const names = items.map(i => i.name);
const embeddings = items.map(i => i.embedding);
const attributes = items.map(i => JSON.stringify(i.attributes || {}));
await client.query(insertQuery, [names, embeddings, attributes]);
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
throw e;
} finally {
client.release();
}
}
async searchSimilar(queryEmbedding, options = {}) {
const {
topK = 5,
minSimilarity = null,
filterAttributes = {}
} = options;
let sql = `
SELECT
id,
name,
1 - (embedding <=> $1) / 2 AS similarity,
attributes
FROM items
WHERE 1=1
`;
const params = [queryEmbedding];
let paramIdx = 1;
// 动态过滤
for (const [key, value] of Object.entries(filterAttributes)) {
paramIdx++;
sql += ` AND attributes->>'${key}' = $${paramIdx}`;
params.push(value);
}
// 相似度阈值
if (minSimilarity !== null) {
const maxDistance = 2 * (1 - minSimilarity);
paramIdx++;
sql += ` AND embedding <=> $1 < $${paramIdx}`;
params.push(maxDistance);
}
paramIdx++;
sql += ` ORDER BY embedding <=> $1 LIMIT $${paramIdx}`;
params.push(topK);
const result = await this.pool.query(sql, params);
return result.rows;
}
async close() {
await this.pool.end();
}
}
// 使用示例
async function main() {
const service = new PgvectorService();
await service.init();
// 插入示例
await service.insertItem(
'Test Item',
Array(1536).fill(0).map(() => Math.random()),
{ category: 'test', priority: 1 }
);
// 搜索
const results = await service.searchSimilar(
Array(1536).fill(0).map(() => Math.random()),
{ topK: 3, minSimilarity: 0.5 }
);
console.log(results);
await service.close();
}
main().catch(console.error);
4.3 可视化工具连接
4.3.1 DBeaver / DataGrip
连接配置:
- Host:
localhost - Port:
5432 - Database:
vectordb - 驱动:标准 PostgreSQL JDBC 驱动
查看向量数据:
sql
-- DBeaver 会显示 vector 类型为字符串 "[0.12, -0.34, ...]"
-- 可安装 pgvector 插件获得更好格式化(社区开发)
-- 自定义查询模板
SELECT
id,
content,
LEFT(embedding::text, 50) || '...' AS embedding_preview,
embedding <=> '[...]'::vector AS distance
FROM documents;
4.3.2 PgAdmin 4
python
# 通过 Docker Compose 部署时自动包含
# 访问 http://localhost:8080
# 查询工具中执行:
CREATE EXTENSION vector;
# 查看表结构时,vector 类型显示为 "vector(1536)"
# 数据浏览时,向量显示为数组格式
4.3.3 自定义可视化(Python)
python
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import psycopg2
from pgvector.psycopg2 import register_vector
def visualize_vectors(conn_str, sample_size=1000):
"""降维可视化向量分布"""
conn = psycopg2.connect(conn_str)
register_vector(conn)
# 采样数据
with conn.cursor() as cur:
cur.execute("""
SELECT embedding, metadata->>'category' as category
FROM documents
TABLESAMPLE SYSTEM (10) -- 10% 系统采样
LIMIT %s
""", (sample_size,))
rows = cur.fetchall()
conn.close()
if not rows:
print("无数据")
return
vectors = [r[0] for r in rows]
categories = [r[1] for r in rows]
# PCA 降维
pca = PCA(n_components=2)
reduced = pca.fit_transform(vectors)
# 绘制
plt.figure(figsize=(12, 8))
for cat in set(categories):
mask = [c == cat for c in categories]
plt.scatter(
reduced[mask, 0],
reduced[mask, 1],
label=cat,
alpha=0.6
)
plt.legend()
plt.title('Pgvector 向量空间可视化 (PCA)')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.savefig('pgvector_viz.png')
plt.show()
# 使用
visualize_vectors("postgresql://dev:devpass@localhost:5432/vectordb")
五、生态集成:LangChain、LlamaIndex 等
5.1 LangChain 集成
python
from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
# 连接字符串
CONNECTION_STRING = "postgresql://dev:devpass@localhost:5432/vectordb"
# 初始化
vector_store = PGVector(
connection_string=CONNECTION_STRING,
embedding_function=OpenAIEmbeddings(),
collection_name="langchain_docs",
distance_strategy="cosine", # 或 "euclidean", "max_inner_product"
pre_delete_collection=False # 是否先删除已有集合
)
# 添加文档
documents = [
Document(
page_content="Pgvector 是 PostgreSQL 的向量扩展",
metadata={"source": "docs", "page": 1}
),
Document(
page_content="HNSW 索引提供高效的相似度搜索",
metadata={"source": "docs", "page": 2}
)
]
vector_store.add_documents(documents)
# 相似度搜索
results = vector_store.similarity_search(
"什么是 HNSW 索引?",
k=3,
filter={"source": "docs"} # 元数据过滤
)
# 带分数的搜索
results_with_scores = vector_store.similarity_search_with_score(
"向量数据库原理",
k=5
)
for doc, score in results_with_scores:
print(f"[{score:.4f}] {doc.page_content}")
# 作为 LangChain 检索器
retriever = vector_store.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 5, "score_threshold": 0.8}
)
# 在 RAG 链中使用
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(),
chain_type="stuff",
retriever=retriever
)
answer = qa.invoke({"query": "解释 pgvector 的索引类型"})
5.2 LlamaIndex 集成
python
from llama_index.core import VectorStoreIndex, StorageContext, Document
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
# 配置
vector_store = PGVectorStore.from_params(
host="localhost",
port=5432,
database="vectordb",
user="dev",
password="devpass",
table_name="llamaindex",
embed_dim=1536, # 必须匹配嵌入模型
hnsw_kwargs={ # HNSW 参数
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops"
}
)
# 创建存储上下文
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# 加载文档
documents = [
Document(text="Pgvector 支持 HNSW 和 IVF 索引"),
Document(text="PostgreSQL 的 ACID 特性保障数据一致性")
]
# 创建索引(自动嵌入并存储)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=OpenAIEmbedding()
)
# 查询
query_engine = index.as_query_engine()
response = query_engine.query("Pgvector 支持哪些索引?")
print(response)
5.3 SQLAlchemy 集成(ORM 方式)
python
from sqlalchemy import create_engine, Column, Integer, String, DateTime, JSON, text
from sqlalchemy.orm import declarative_base, Session
from pgvector.sqlalchemy import Vector
from datetime import datetime
Base = declarative_base()
class Document(Base):
__tablename__ = 'documents'
id = Column(Integer, primary_key=True)
title = Column(String(500))
content = Column(String)
embedding = Column(Vector(1536)) # pgvector 类型
metadata = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
# 连接(注册 vector 类型)
engine = create_engine(
"postgresql://dev:devpass@localhost:5432/vectordb",
connect_args={'options': '-csearch_path=public'}
)
# 创建表(包括 HNSW 索引)
Base.metadata.create_all(engine)
# 手动添加 HNSW 索引(SQLAlchemy 不自动创建)
with engine.connect() as conn:
conn.execute(text("""
CREATE INDEX IF NOT EXISTS idx_documents_embedding
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
"""))
conn.commit()
# ORM 操作
with Session(engine) as session:
# 插入
doc = Document(
title="Test",
content="Content",
embedding=[0.1] * 1536, # 自动转换
metadata={"key": "value"}
)
session.add(doc)
session.commit()
# 向量搜索(原生 SQL)
result = session.execute(text("""
SELECT id, title, content, embedding <=> :vec AS distance
FROM documents
ORDER BY embedding <=> :vec
LIMIT 5
"""), {"vec": [0.1] * 1536})
for row in result:
print(row)
六、生产环境最佳实践
6.1 性能优化
sql
-- ==================== 索引优化 ====================
-- 1. 选择合适的索引类型
-- 数据量 < 100万,频繁更新 → HNSW
-- 数据量 > 1000万,批量导入,少更新 → IVFFlat
-- 2. HNSW 参数调优
-- m:通常 16-64,数据量越大 m 越大
-- ef_construction:通常 64-200,构建时间 vs 查询质量权衡
-- 3. 查询参数调优
SET hnsw.ef_search = 100; -- 全局设置,会话级
-- 或在查询中指定(pgvector 0.6.0+ 通过索引提示)
-- ==================== 查询优化 ====================
-- 使用 EXPLAIN ANALYZE 分析
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM documents
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;
-- 预期结果:Index Scan using documents_embedding_idx
-- 避免全表扫描:确保有 LIMIT 或 WHERE 条件选择性足够高
-- ==================== 写入优化 ====================
-- 1. 批量插入(单条 INSERT 多值)
INSERT INTO documents (content, embedding) VALUES
('c1', '[...]'), ('c2', '[...]'), ...;
-- 2. 禁用 WAL(仅初始导入,风险自负)
-- ALTER TABLE documents SET UNLOGGED;
-- 导入完成后:ALTER TABLE documents LOGGED;
-- 3. 调整 maintenance_work_mem(索引构建)
SET maintenance_work_mem = '4GB';
CREATE INDEX ... ; -- 大内存加速构建
-- 4. 并发构建索引(不锁表)
CREATE INDEX CONCURRENTLY idx_name ON ...;
6.2 高可用与备份
bash
# ==================== 流复制(Streaming Replication) ====================
# 主库配置 postgresql.conf
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
synchronous_commit = remote_apply # 同步复制
# 备库配置(recovery.conf 或 postgresql.auto.conf)
primary_conninfo = 'host=primary port=5432 user=replicator password=...'
primary_slot_name = 'replica_1'
# pgvector 完全支持流复制,索引自动同步
# ==================== 备份策略 ====================
# 1. pg_dump(逻辑备份,适合小数据量)
pg_dump -h localhost -U dev vectordb > vectordb_backup.sql
# 2. pg_basebackup(物理备份,推荐)
pg_basebackup -h localhost -D /backup/vectordb -U replicator -v -P -W
# 3. 连续归档(Point-in-Time Recovery)
# postgresql.conf:
archive_mode = on
archive_command = 'cp %p /archive/%f'
# 4. 向量特定:索引重建比备份恢复更快(大数据量时)
# 策略:仅备份数据,恢复后重建索引
6.3 监控与维护
sql
-- ==================== 监控查询 ====================
-- 1. 索引使用情况
SELECT
schemaname,
tablename,
indexname,
idx_scan, -- 索引扫描次数
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexname LIKE '%embedding%'
ORDER BY idx_scan DESC;
-- 2. 表大小和膨胀
SELECT
relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
n_live_tup AS live_tuples,
n_dead_tup AS dead_tuples,
round(100 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
WHERE relname = 'documents';
-- 3. 索引大小
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE indexrelname = 'documents_embedding_idx';
-- ==================== 维护操作 ====================
-- 1. 更新统计信息
ANALYZE documents;
-- 2. 清理死元组(VACUUM)
VACUUM ANALYZE documents;
-- 3. 重建索引(解决膨胀)
REINDEX INDEX CONCURRENTLY documents_embedding_idx;
-- 4. 表膨胀整理(需要额外磁盘空间)
ALTER TABLE documents SET (fillfactor = 85); -- 预留更新空间
VACUUM FULL documents; -- 谨慎使用,锁表时间长
6.4 安全实践
sql
-- ==================== 权限控制 ====================
-- 1. 创建只读角色
CREATE ROLE vector_read;
GRANT CONNECT ON DATABASE vectordb TO vector_read;
GRANT USAGE ON SCHEMA public TO vector_read;
GRANT SELECT ON documents TO vector_read;
-- 2. 创建应用专用角色(仅必要权限)
CREATE ROLE app_writer;
GRANT INSERT, UPDATE, DELETE ON documents TO app_writer;
-- 不授予 TRUNCATE, DROP 权限
-- 3. 行级安全(RLS)- 多租户隔离
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents
USING (metadata->>'tenant_id' = current_setting('app.current_tenant'));
-- 应用层设置租户上下文
SET app.current_tenant = 'tenant_123';
-- ==================== 数据加密 ====================
-- 1. SSL 连接强制
-- postgresql.conf:
ssl = on
ssl_cert_file = 'server.crt'
ssl_key_file = 'server.key'
ssl_ca_file = 'root.crt'
-- 2. 字段级加密(敏感元数据)
-- 使用 pgcrypto 扩展
CREATE EXTENSION pgcrypto;
-- 加密存储
INSERT INTO documents (content, metadata)
VALUES (
'sensitive content',
pgp_sym_encrypt('{"secret": "data"}', 'encryption-key')
);
-- 解密查询
SELECT
content,
pgp_sym_decrypt(metadata::bytea, 'encryption-key')::jsonb
FROM documents;
七、pgvector vs Pinecone/Milvus 选型对比
| 维度 | pgvector | Pinecone | Milvus |
|---|---|---|---|
| 部署复杂度 | ⭐ 极低(Postgres 扩展) | ⭐⭐ 低(SaaS) | ⭐⭐⭐⭐ 高(K8s 集群) |
| 运维成本 | 复用 DBA 团队 | 无(全托管) | 需专职运维 |
| 扩展上限 | ~1000万向量(单机) | ~10亿(自动扩展) | ~1000亿(分布式) |
| 查询性能 | 中等(单线程查询) | 高(专用优化) | 极高(并行查询) |
| 功能丰富度 | 基础 ANN + SQL 生态 | 中等(托管限制) | 极高(GPU、稀疏向量等) |
| 事务支持 | ⭐⭐⭐⭐⭐ 完整 ACID | ⭐⭐ 最终一致 | ⭐⭐⭐ 可调一致性 |
| 混合查询 | ⭐⭐⭐⭐⭐ SQL 原生 | ⭐⭐⭐ 元数据过滤 | ⭐⭐⭐⭐ 表达式过滤 |
| 成本(大规模) | 低(自建硬件) | 高(按量付费) | 中(自运维优化) |
| 适用场景 | 已有 Postgres、中小规模、复杂事务 | 快速启动、无运维团队 | 超大规模、极致性能 |
选型决策树:
已有 PostgreSQL 且数据量 < 1000万?
├─ 是 → pgvector(零额外成本)
└─ 否 → 需要复杂 SQL 事务?
├─ 是 → Milvus + 外部事务系统,或分库架构
└─ 否 → 需要立即启动无运维?
├─ 是 → Pinecone
└─ 否 → Milvus(长期成本更优)
八、三篇系列总结
| 数据库 | 核心优势 | 最佳场景 | 关键记忆点 |
|---|---|---|---|
| Pinecone | 零运维 SaaS,分钟级启动 | MVP、快速验证、中小规模 | "最快上手" |
| Milvus | 开源分布式,百亿级扩展 | 大规模生产、GPU加速、定制需求 | "最强扩展" |
| pgvector | SQL原生,ACID事务,零新增基础设施 | 已有Postgres、复杂业务查询、中小规模 | "最简集成" |
技术演进趋势:
- 短期:pgvector 快速普及,成为 Postgres 标配
- 中期:Milvus/Zilliz 主导大规模场景,云原生优化
- 长期:多模融合(向量+全文+图),AI原生数据库(自治调优、学习型索引)