第6章:Vector 向量化技术与语义搜索
前言
大家好,我是鲫小鱼。是一名不写前端代码
的前端工程师,热衷于分享非前端的知识,带领切图仔逃离切图圈子,欢迎关注我,微信公众号:《鲫小鱼不正经》
。欢迎点赞、收藏、关注,一键三连!!
🎯 本章学习目标
- 理解文本向量(Embedding)的原理与常见度量方式(cosine、dot、L2)
- 掌握文档加载、清洗、分块与向量化的工程流程
- 熟练使用 LangChain.js 集成 Chroma/Pinecone/Weaviate 等向量数据库
- 会构建检索器(Retriever),实现 TopK、MMR、Metadata Filter 与混合搜索
- 能在 Next.js 中落地语义搜索 API 与前端界面(移动端适配)
- 完成"企业文档语义搜索 + 快速问答"实战项目,具备生产优化与错误处理能力
📖 理论基础:向量与语义相似
6.1 Embedding 是什么
- 将自然语言(文本、关键词、句子、段落)映射到高维稠密向量空间
- 语义相似的文本在空间中"距离更近"
- 常用于:语义搜索、RAG、推荐召回、聚类、去重、异常检测
6.2 常见相似度度量
- 余弦相似度(Cosine):衡量夹角,最常见,范围 [-1,1]
- 点积(Dot Product):与向量长度相关;部分模型推荐使用
- 欧氏距离(L2):距离越小越相似
6.3 文本分块与上下文窗口
- LLM/RAG 常用"文档分块(chunking)"策略:固定长度、重叠窗口、句子/段落边界对齐
- 目标:确保检索到的块"语义完整",同时控制 token 成本
- 分块过大:检索粗;过小:语义破碎;需结合评测调优
6.4 检索器策略
- TopK 最近邻:最基础(按相似度排序取前 K)
- MMR(Maximal Marginal Relevance):在相关性与多样性之间平衡,降低重复
- Metadata Filter:基于结构化元数据的筛选(时间、作者、类型、部门等)
- Rerank:用更强模型(Cross-Encoder/LLM)对候选进行重排
- 混合搜索:BM25(关键词)+ 向量搜索 融合,提高鲁棒性
🛠️ 环境与依赖
bash
# 基础依赖
npm i @langchain/core @langchain/community @langchain/openai
# 可选:本地向量库(Chroma)
npm i chromadb
# 可选:Pinecone/Weaviate 客户端(示意)
# npm i @pinecone-database/pinecone weaviate-ts-client
# 开发工具
npm i -D tsx typescript @types/node dotenv
.env
示例:
bash
OPENAI_API_KEY=sk-xxx
# PINECONE_API_KEY=...
# WEAVIATE_HOST=...
💻 文档加载、清洗与分块
6.5 文档加载(示例:本地 Markdown/PDF/URL)
typescript
// 文件:src/ch06/loaders.ts
import fs from "node:fs/promises";
import path from "node:path";
export type RawDoc = { id: string; text: string; meta?: Record<string, any> };
export async function loadMarkdownDir(dir: string): Promise<RawDoc[]> {
const files = await fs.readdir(dir);
const docs: RawDoc[] = [];
for (const f of files) {
if (!f.endsWith(".md")) continue;
const full = path.join(dir, f);
const text = await fs.readFile(full, "utf8");
docs.push({ id: f, text, meta: { source: full, type: "md" } });
}
return docs;
}
// TODO: PDF、URL 可按需扩展(例如 pdf-parse / cheerio 抓取),此处略
6.6 清洗与分块
typescript
// 文件:src/ch06/chunk.ts
export type Chunk = { id: string; text: string; meta?: Record<string, any> };
export function clean(text: string): string {
return text
.replace(/\r/g, "\n")
.replace(/\n{3,}/g, "\n\n")
.replace(/[\t\u00A0]+/g, " ")
.trim();
}
export function splitIntoChunks(
text: string,
chunkSize = 800,
overlap = 100
): string[] {
const out: string[] = [];
let i = 0;
while (i < text.length) {
const slice = text.slice(i, i + chunkSize);
out.push(slice);
i += chunkSize - overlap;
}
return out;
}
export function makeChunks(
docs: { id: string; text: string; meta?: Record<string, any> }[],
chunkSize = 800,
overlap = 100
): Chunk[] {
const chunks: Chunk[] = [];
for (const d of docs) {
const t = clean(d.text);
const parts = splitIntoChunks(t, chunkSize, overlap);
parts.forEach((p, idx) => {
chunks.push({ id: `${d.id}#${idx}`, text: p, meta: { ...(d.meta || {}), chunkIndex: idx } });
});
}
return chunks;
}
🔢 向量化与存储(Chroma 示例)
6.7 OpenAI Embeddings + Chroma 快速入门
typescript
// 文件:src/ch06/chroma-basic.ts
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { v4 as uuid } from "uuid";
import { makeChunks } from "./chunk";
import { loadMarkdownDir } from "./loaders";
import * as dotenv from "dotenv";
dotenv.config();
export async function buildChromaFromDir(dir = "./docs") {
const raw = await loadMarkdownDir(dir);
const chunks = makeChunks(raw, 800, 120);
const texts = chunks.map(c => c.text);
const metadatas = chunks.map(c => ({ ...c.meta, id: c.id }));
const ids = chunks.map(() => uuid());
const db = await Chroma.fromTexts(
texts,
metadatas,
new OpenAIEmbeddings({ model: "text-embedding-3-small" }),
{ collectionName: "docs" }
);
await db.addVectors([], [], []); // 占位,确保初始化(某些版本不需要)
return db;
}
if (require.main === module) {
buildChromaFromDir().then(() => console.log("Chroma ready"));
}
6.8 查询与相似度搜索
typescript
// 文件:src/ch06/chroma-query.ts
import { buildChromaFromDir } from "./chroma-basic";
export async function searchChroma(q: string) {
const db = await buildChromaFromDir();
const docs = await db.similaritySearch(q, 5); // TopK=5
for (const d of docs) {
console.log("[hit]", d.metadata?.id, d.metadata?.source, "\n", d.pageContent.slice(0, 120), "...\n");
}
}
if (require.main === module) searchChroma("如何使用 LangChain.js 构建链?");
6.9 带过滤与 MMR 检索
typescript
// 文件:src/ch06/chroma-advanced.ts
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
export async function chromaAdvanced(q: string) {
const db = await Chroma.fromExistingCollection(
new OpenAIEmbeddings({ model: "text-embedding-3-small" }),
{ collectionName: "docs" }
);
// 过滤:例如仅检索 type=md 的内容
const filter = { type: "md" } as any;
// TopK + MMR(多样性)
const results = await db.similaritySearch(q, 8, filter);
// 自定义 MMR:此处简化,真实可用向量与贪心去冗余
const unique: any[] = [];
const seen = new Set<string>();
for (const r of results) {
const k = r.metadata?.source + "#" + r.metadata?.chunkIndex;
if (!seen.has(k)) { seen.add(k); unique.push(r); }
if (unique.length >= 5) break;
}
return unique;
}
if (require.main === module) {
chromaAdvanced("性能优化要点").then(r => console.log("hits:", r.length));
}
☁️ Pinecone/Weaviate(接口示意)
6.10 Pinecone
typescript
// 文件:src/ch06/pinecone.ts
// 伪示意:实际接入请参考 @langchain/community 的 PineconeVectorStore
// import { Pinecone } from "@pinecone-database/pinecone";
// import { PineconeStore } from "@langchain/community/vectorstores/pinecone";
export async function pineconeDemo() {
// const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
// const index = pc.Index("my-index");
// const store = await PineconeStore.fromTexts([...], [...], embeddings, { pineconeIndex: index });
// const res = await store.similaritySearch("query", 5);
}
6.11 Weaviate
typescript
// 文件:src/ch06/weaviate.ts
// 伪示意:实际接入请参考社区 VectorStore 实现
// import weaviate from "weaviate-ts-client";
// import { WeaviateStore } from "@langchain/community/vectorstores/weaviate";
export async function weaviateDemo() {
// const client = weaviate.client({ host: process.env.WEAVIATE_HOST! });
// const store = await WeaviateStore.fromTexts([...], [...], embeddings, { client, indexName: "docs" });
// const res = await store.similaritySearch("query", 5);
}
🔎 构建 Retriever 与混合搜索
6.12 基础 Retriever
typescript
// 文件:src/ch06/retriever-basic.ts
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
export type Retrieved = { text: string; source?: string; score?: number; meta?: any };
export async function createRetriever() {
const store = await Chroma.fromExistingCollection(
new OpenAIEmbeddings({ model: "text-embedding-3-small" }),
{ collectionName: "docs" }
);
return async (q: string, k = 5, filter?: any): Promise<Retrieved[]> => {
const hits = await store.similaritySearchWithScore(q, k, filter);
return hits.map(([d, s]) => ({ text: d.pageContent, source: d.metadata?.source, score: s, meta: d.metadata }));
};
}
6.13 关键词(BM25)+ 向量的混合
typescript
// 文件:src/ch06/hybrid.ts
import { createRetriever } from "./retriever-basic";
function keywordSearch(query: string, corpus: { id: string; text: string }[], k = 5) {
const q = query.toLowerCase();
const scored = corpus.map(d => ({ d, s: (d.text.toLowerCase().match(new RegExp(q, "g")) || []).length }));
return scored.sort((a,b) => b.s - a.s).slice(0, k).map(x => x.d);
}
export async function hybridSearch(query: string) {
const retr = await createRetriever();
const local = [
{ id: "k1", text: "LangChain 提供 Runnable/Prompt/Agent 等" },
{ id: "k2", text: "向量搜索通常结合 BM25 提升鲁棒性" },
];
const [vecHits, kwHits] = await Promise.all([
retr(query, 5),
Promise.resolve(keywordSearch(query, local, 3)),
]);
const items = [
...vecHits.map(h => ({ text: h.text, score: (1 - (h.score ?? 0)) * 0.7, source: h.source })),
...kwHits.map(h => ({ text: h.text, score: 0.3, source: "keyword" })),
];
return items.sort((a,b)=> b.score - a.score).slice(0, 5);
}
if (require.main === module) {
hybridSearch("向量 搜索").then(r => console.log(r));
}
6.14 Rerank(重排序)
typescript
// 文件:src/ch06/rerank.ts
// 伪实现:实际可用 Cross-Encoder 或 @langchain/community 的 Reranker(如 Cohere、Jina)
export async function simpleRerank(query: string, candidates: { text: string }[]) {
const terms = query.toLowerCase().split(/\s+/);
return candidates
.map(c => ({ c, s: terms.reduce((acc, t)=> acc + (c.text.toLowerCase().includes(t) ? 1 : 0), 0) }))
.sort((a,b)=> b.s - a.s)
.map(x => x.c);
}
🤖 RAG 快速问答链(检索→融合→回答)
6.15 组装 QA Chain
typescript
// 文件:src/ch06/rag-qa.ts
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { JsonOutputParser } from "@langchain/core/output_parsers";
import { createRetriever } from "./retriever-basic";
const answerPrompt = ChatPromptTemplate.fromMessages([
["system", `你是可靠的技术助手。请仅基于"检索到的片段"回答问题。
若资料不足,请说"我不知道"。
严格输出:
{
"answer": string,
"quotes": string[]
}`],
["human", `问题:{question}\n检索片段:\n{chunks}\n请回答:`],
]);
export async function buildQA() {
const retr = await createRetriever();
const llm = new ChatOpenAI({ temperature: 0 });
const parser = new JsonOutputParser<{ answer: string; quotes: string[] }>();
return async (q: string) => {
const hits = await retr(q, 6);
const merged = hits.map(h => `- ${h.text.replace(/\n/g, ' ').slice(0, 400)}...`).join("\n");
const res = await answerPrompt
.pipe(llm)
.pipe(parser)
.invoke({ question: q, chunks: merged });
return { ...res, rawHits: hits };
};
}
if (require.main === module) {
(async () => {
const qa = await buildQA();
const out = await qa("如何设计向量检索的分块策略?");
console.log(out);
})();
}
🌐 Next.js 落地:语义搜索 API 与界面
6.16 API(Route Handler)
typescript
// 文件:src/app/api/search/route.ts
import { NextRequest } from "next/server";
import { hybridSearch } from "@/src/ch06/hybrid";
export const runtime = "edge";
export async function POST(req: NextRequest) {
const { q } = await req.json();
const items = await hybridSearch(q);
return Response.json({ ok: true, items });
}
6.17 前端页(移动端适配)
tsx
// 文件:src/app/search/page.tsx
"use client";
import { useState } from "react";
export default function SearchPage() {
const [q, setQ] = useState("");
const [items, setItems] = useState<any[]>([]);
const [loading, setLoading] = useState(false);
const [err, setErr] = useState("");
const go = async () => {
try {
setLoading(true); setErr("");
const res = await fetch("/api/search", { method: "POST", body: JSON.stringify({ q }) });
const data = await res.json();
setItems(data.items || []);
} catch (e: any) {
setErr(e.message);
} finally {
setLoading(false);
}
};
return (
<main className="mx-auto max-w-screen-sm p-4">
<div className="flex gap-2">
<input className="flex-1 border rounded px-3 py-2" placeholder="输入搜索关键词" value={q} onChange={e=>setQ(e.target.value)} />
<button className="px-4 py-2 bg-blue-600 text-white rounded" onClick={go} disabled={loading}>搜索</button>
</div>
{err && <p className="text-red-600 mt-2">{err}</p>}
<ul className="mt-4 space-y-3">
{items.map((it, i) => (
<li key={i} className="border rounded p-3">
<div className="text-sm text-gray-500">来源:{it.source || "混合"}</div>
<div className="mt-1 whitespace-pre-wrap break-words leading-relaxed">{it.text}</div>
</li>
))}
</ul>
</main>
);
}
🚀 实战项目:企业文档语义搜索 + 快速问答
6.18 场景与目标
- 场景:企业内部有多源文档(需求、设计、API、周报、FAQ),员工希望"像问同事一样"搜索答案
- 目标:
- 高质量召回(混合搜索 + MMR + Rerank)
- 可追溯(返回引用来源)
- 移动端友好(响应式界面)
- 错误可恢复(检索为空、超时重试、服务降级)
6.19 目录结构
bash
src/
ch06/
loaders.ts # 文档加载
chunk.ts # 清洗分块
chroma-basic.ts # 向量库
retriever-basic.ts
hybrid.ts
rag-qa.ts
app/
api/search/route.ts
api/qa/route.ts
search/page.tsx
qa/page.tsx
6.20 QA API(基于 RAG)
typescript
// 文件:src/app/api/qa/route.ts
import { NextRequest } from "next/server";
import { buildQA } from "@/src/ch06/rag-qa";
export const runtime = "edge";
export async function POST(req: NextRequest) {
const { q } = await req.json();
const qa = await buildQA();
try {
const out = await qa(q);
return Response.json({ ok: true, data: out });
} catch (e: any) {
return Response.json({ ok: false, message: e.message }, { status: 500 });
}
}
6.21 前端 QA 页
tsx
// 文件:src/app/qa/page.tsx
"use client";
import { useState } from "react";
export default function QA() {
const [q, setQ] = useState("");
const [data, setData] = useState<any>(null);
const [loading, setLoading] = useState(false);
const [err, setErr] = useState("");
const ask = async () => {
try {
setLoading(true); setErr("");
const res = await fetch("/api/qa", { method: "POST", body: JSON.stringify({ q }) });
const json = await res.json();
if (!json.ok) throw new Error(json.message || "unknown");
setData(json.data);
} catch (e: any) {
setErr(e.message);
} finally { setLoading(false); }
};
return (
<main className="mx-auto max-w-screen-sm p-4">
<div className="flex gap-2">
<input className="flex-1 border rounded px-3 py-2" placeholder="提问:例如 向量检索如何做 MMR?" value={q} onChange={e=>setQ(e.target.value)} />
<button className="px-4 py-2 bg黑 text-white rounded" onClick={ask} disabled={loading}>提问</button>
</div>
{err && <p className="text-red-600 mt-2">{err}</p>}
{data && (
<section className="mt-4 space-y-3">
<h2 className="font-semibold">回答</h2>
<div className="whitespace-pre-wrap leading-relaxed">{data.answer}</div>
<h3 className="font-semibold mt-2">引用</h3>
<ul className="list-disc pl-5">
{data.quotes?.map((t: string, i: number) => (<li key={i}>{t}</li>))}
</ul>
</section>
)}
</main>
);
}
6.22 生产级优化
- 向量批处理:
embeddings.embedDocuments
一次性处理 N 条,降低请求开销 - 缓存:对"相同文本"的向量做缓存(key=文本指纹),避免重复计算
- 过滤与权限:检索前按部门/角色过滤 Metadata,防止越权
- 降级:向量库不可用时,回退关键词搜索;Rerank 异常时跳过重排
- 日志与监控:记录 TopK 命中率、QA 命中率、平均延迟、token 成本
- 成本控制:选择合适的向量模型(例如 text-embedding-3-small vs large)
6.23 错误处理清单
- 加载失败:路径/权限 → 返回具体错误与排查建议
- 空检索:提示缩小范围、给出示例问题
- 超时:重试 + 指数退避 + UI 侧"重试"入口
- 输出结构:对 QA 输出做 JSON 解析失败时的兜底(展示纯文本)
📈 评测与质量
- 构建 Golden Set(真实常见问题 + 标准答案 + 引用来源)
- 评测指标:TopK Recall、MRR、Rerank 精度、QA BLEU/ROUGE、人工评分
- A/B:不同分块大小/overlap、不同 TopK、是否 MMR、不同 Embedding 模型
- 线上回放:对"无答案/低满意度"样例做定位并修复(改分块/丰富文档/调检索)
📚 延伸资源
- LangChain.js:
https://js.langchain.com/
- 向量数据库 Chroma:
https://docs.trychroma.com/
- Pinecone:
https://docs.pinecone.io/
- Weaviate:
https://weaviate.io/developers/weaviate
- 向量检索理论综述(可搜索"dense retrieval survey")
✅ 本章小结
- 掌握了向量化的核心概念、分块策略与相似度度量
- 会用 LangChain.js 集成向量数据库并实现 TopK/MMR/过滤/混合搜索
- 完成了"企业文档语义搜索 + 快速问答"的端到端实现
- 知道如何在生产环境中做性能优化、错误兜底与质量评测
🎯 下章预告
下一章《RAG(检索增强生成)架构设计与实现》中,我们将:
- 系统拆解 RAG 的检索、融合与生成三层
- 引入重排序、引用抽取与事实校验,降低幻觉
- 构建可扩展的 RAG Pipeline 与监控评测体系
最后感谢阅读!欢迎关注我,微信公众号:
《鲫小鱼不正经》
。欢迎点赞、收藏、关注,一键三连!!!