食用指南
向量数据库是专门的数据存储,能够基于向量表示来索引和检索信息。这些向量称为嵌入,捕获了已嵌入数据的语义含义。
向量数据库经常用于搜索非结构化数据,例如文本、图像和音频,以基于语义相似性而非精确的关键字匹配来检索相关信息。

本文将展示如何使用 LangChain
+ PGVector
搭建智能向量数据库。 LangChain
作为大语言模型(LLM)应用开发框架,提供模块化工具链支持;PGVector
是 PostgreSQL
的向量扩展,支持高效存储和检索高维向量数据。两者结合可实现智能化的语义搜索、问答系统等场景。
温馨提示:本文搭配 Jupyter notebooks 食用更佳,在交互式环境中学习是更好地理解它们的好方法。
一、配置LLM环境变量
python
import os
os.environ["OPENAI_BASE_URL"] = "https://dashscope.aliyuncs.com/compatible-mode/v1" # 阿里云百炼 api
os.environ["OPENAI_API_KEY"] = "sk-xxx" # 如何获取API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
os.environ["DASHSCOPE_API_KEY"] = "sk-xxx"
LLM 调用示例:
python
from langchain.chat_models import init_chat_model
llm = init_chat_model("qwen-plus", model_provider="openai")
llm.invoke("编写SolidWorks宏脚本,绘制一个圆柱体,高30mm,半径10mm")
less
AIMessage(content='在SolidWorks中,可以使用VBA(Visual Basic for Applications)编写宏脚本来自动化创建几何体。以下是一个用于绘制圆柱体的VBA宏脚本示例,该圆柱体的高度为30mm,半径为10mm。\n\n### 宏脚本代码\n\n```vba\n\' SolidWorks 宏:创建一个高30mm、半径10mm的圆柱体\nOption Explicit\n\nSub main()\n Dim swApp As SldWorks.SldWorks\n Dim swModel As SldWorks.ModelDoc2\n Dim swPart As SldWorks.PartDoc\n Dim swSketchSegment As SldWorks.SketchSegment\n Dim boolstatus As Boolean\n Dim longstatus As Long, longwarnings As Long\n \n \' 初始化 SolidWorks 应用程序\n Set swApp = Application.SldWorks\n Set swModel = swApp.NewPart\n Set swPart = swModel\n \n \' 设置单位为毫米\n swModel.SetLengthUnit 0 \' 0 表示毫米\n \n \' 创建草图并进入草图模式\n boolstatus = swModel.Extension.SelectByRay(Array(0, 0, 0), Array(0, 0, 1), 0, 0, 0, False, 0, Nothing, 0)\n boolstatus = swModel.SketchManager.InsertSketch(True)\n \n \' 在草图中绘制一个圆\n Set swSketchSegment = swModel.SketchManager.CreateCircle(0, 0, 0, 10, 0, 0) \' 圆心 (0,0,0),半径 10mm\n \n \' 退出草图模式\n boolstatus = swModel.SketchManager.ExitSketch\n \n \' 使用拉伸特征创建圆柱体\n Dim swFeature As SldWorks.Feature\n Dim swBody As SldWorks.Body2\n Set swFeature = swModel.FeatureManager.FeatureExtrusion2(True, False, False, 0, 0, 30, 0, False, False, False, False, 0, 0, False, False, False, False, True, True, False)\n \n \' 更新模型\n swModel.ForceRebuild3 False\n \n \' 提示完成\n MsgBox "圆柱体已创建!高度:30mm,半径:10mm", vbInformation, "SolidWorks 宏"\nEnd Sub\n```\n\n---\n\n### 代码说明\n\n1. **初始化 SolidWorks 应用程序**:\n - `Set swApp = Application.SldWorks` 获取当前 SolidWorks 实例。\n - `Set swModel = swApp.NewPart` 创建一个新的零件文件。\n\n2. **设置单位**:\n - `swModel.SetLengthUnit 0` 将长度单位设置为毫米(`0` 表示毫米)。\n\n3. **创建草图并绘制圆**:\n - 使用 `swModel.SketchManager.InsertSketch` 进入草图模式。\n - 使用 `swModel.SketchManager.CreateCircle` 绘制一个圆,圆心位于 `(0,0,0)`,半径为 `10mm`。\n\n4. **退出草图模式**:\n - 使用 `swModel.SketchManager.ExitSketch` 退出草图模式。\n\n5. **创建拉伸特征**:\n - 使用 `swModel.FeatureManager.FeatureExtrusion2` 方法将草图中的圆拉伸成圆柱体,拉伸高度为 `30mm`。\n\n6. **更新模型并提示用户**:\n - 调用 `swModel.ForceRebuild3` 强制重建模型。\n - 使用 `MsgBox` 显示操作完成的消息。\n\n---\n\n### 如何运行宏\n\n1. 打开 SolidWorks。\n2. 点击菜单栏中的 **工具 > 宏 > 新建**。\n3. 将上述代码粘贴到 VBA 编辑器中。\n4. 点击运行按钮(绿色三角形)执行宏。\n\n运行后,您将在 SolidWorks 中看到一个高 30mm、半径 10mm 的圆柱体。', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 903, 'prompt_tokens': 31, 'total_tokens': 934, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}, 'model_name': 'qwen-plus', 'system_fingerprint': None, 'id': 'chatcmpl-d6faa355-65a2-9130-bb11-af9a547a8448', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--c9146b1b-f45d-4177-8467-a74caa492688-0', usage_metadata={'input_tokens': 31, 'output_tokens': 903, 'total_tokens': 934, 'input_token_details': {'cache_read': 0}, 'output_token_details': {}})
二、嵌入模型与向量存储
python
pip install -qU langchain-openai
使用最新的 text-embedding-v4
嵌入模型:
python
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-v4", check_embedding_ctx_length = False)
除了使用 OpenAIEmbeddings
(主流厂商都支持兼容)还可以使用 DashScopeEmbeddings
(厂商自己的 SDK):
python
pip install dashscope
python
from langchain_community.embeddings import DashScopeEmbeddings
embeddings = DashScopeEmbeddings( model="text-embedding-v4")
使用 pgvector
作为向量存储:
python
# 启动一个带有 pgvector 扩展的 postgres 容器
sudo docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16

初始化 pg 向量存储:
python
pip install -qU langchain_postgres
python
from langchain_postgres import PGVector
# See docker command above to launch a postgres instance with pgvector enabled.
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain" # Uses psycopg3!
collection_name = "my_docs"
vector_store = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=connection,
use_jsonb=True,
)
三、读取 txt 并切割文档
当前目录下的 data 文件夹有 812 个 txt 文档,接下来我们读取这些文档并做解析。

先来看一下我们的文档元数据信息:
python
from langchain_community.document_loaders import TextLoader
raw_documents = TextLoader('data/Combine_Bodies_Example_CSharp.txt').load()
raw_documents[0]
rust
Document(metadata={'source': 'data/Combine_Bodies_Example_CSharp.txt'}, page_content='SOLIDWORKS API Help\nCombine Bodies Example (C#)\nThis example shows how to combine bodies in a multibody part.\n//-------------------------------------------------------------\n// Preconditions:\n// 1. Verify that the part document to open exists.\n// 2. Open the Immediate window.\n//\n// Postconditions:\n// 1. Opens the specified part document.\n// 2. Selects two solid bodies.\n// 3. Inserts a combine feature using the two selected\n// bodies.\n// 4. Prints the type of combine feature to the Immediate\n// window.\n// 5. Examine the Immediate window.\n//\n// NOTE: Because the part document is used elsewhere, do not\n// save changes.\n//--------------------------------------------------------------\n \nusing SolidWorks.Interop.sldworks;\nusing SolidWorks.Interop.swconst;\nusing System.Runtime.InteropServices;\nusing System;\nusing System.Diagnostics;\n \nnamespace CombineBodiesCSharp.csproj\n{\n public partial class SolidWorksMacro\n { \n public void Main()\n {\n ModelDoc2 swModel = default(ModelDoc2);\n ModelDocExtension swModelDocExt = default(ModelDocExtension);\n FeatureManager swFeatureMgr = default(FeatureManager);\n Feature swFeature = default(Feature);\n CombineBodiesFeatureData swCombineBodiesFeatureData = default(CombineBodiesFeatureData);\n string fileName = null;\n bool status = false;\n int errors = 0;\n int warnings = 0;\n \n fileName = "C:\\\\Users\\\\Public\\\\Documents\\\\SOLIDWORKS\\\\SOLIDWORKS 2018\\\\samples\\\\tutorial\\\\multibody\\\\multi_inter.sldprt";\n swModel = (ModelDoc2)swApp.OpenDoc6(fileName, (int)swDocumentTypes_e.swDocPART, (int)swOpenDocOptions_e.swOpenDocOptions_Silent, "", ref errors, ref warnings);\n \n swModelDocExt = (ModelDocExtension)swModel.Extension;\n status = swModelDocExt.SelectByID2("Extrude-Thin1", "SOLIDBODY", 0, 0, 0, true, 0, null, 0);\n status = swModelDocExt.SelectByID2("Boss-Extrude1", "SOLIDBODY", 0, 0, 0, true, 0, null, 0);\n swModel.ClearSelection2(true);\n status = swModelDocExt.SelectByID2("Extrude-Thin1", "SOLIDBODY", 0, 0, 0, false, 2, null, 0);\n status = swModelDocExt.SelectByID2("Boss-Extrude1", "SOLIDBODY", 0, 0, 0, true, 2, null, 0);\n swFeatureMgr = (FeatureManager)swModel.FeatureManager;\n swFeature = (Feature)swFeatureMgr.InsertCombineFeature((int)swBodyOperationType_e.SWBODYADD, null, null);\n \n swCombineBodiesFeatureData = (CombineBodiesFeatureData)swFeature.GetDefinition();\n status = swCombineBodiesFeatureData.AccessSelections(swModel, null);\n //swCombineBodiesOperationType_e:\n // swCombineBodiesOperationAdd = 0\n // swCombineBodiesOperationCommon = 2\n // swCombineBodiesOperationSubract = 1\n Debug.Print("Type of combine feature: " + swCombineBodiesFeatureData.OperationType);\n swCombineBodiesFeatureData.ReleaseSelectionAccess();\n }\n \n /// <summary>\n /// The SldWorks swApp variable is pre-assigned for you.\n /// </summary>\n public SldWorks swApp;\n }\n}\n')
python
import os
# 遍历文件夹
def recursive_listdir(path):
dirlist = []
for entry in os.listdir(path):
full_path = os.path.join(path, entry)
if os.path.isdir(full_path):
recursive_listdir(full_path)
else:
# print("Processing: " + full_path)
dirlist.append(full_path)
return dirlist
python
dirlist = recursive_listdir('data')
len(dirlist)
812
python
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
def doc2vectorstore(txt):
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader(txt).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
all_splits = text_splitter.split_documents(raw_documents)
# Index chunks
_ = vector_store.add_documents(documents=all_splits)
python
for item in dirlist:
print(item)
doc2vectorstore(item)
kotlin
data/Get_Wizard_Hole_Standards_Data_Example_CSharp.txt
...
data/Create_TaskPaneView_Add-in_Example_CSharp.txt
期间遇到的报错:
报错1、 BadRequestError: Error code: 400 - {'error': {'code': 'InvalidParameter', 'param': None, 'message': '<400> InternalError.Algo.InvalidParameter: Value error, contents is neither str nor list of str.: input.contents', 'type': 'InvalidParameter'}, 'id': 'f1161a31-f4b0-9865-9de6-f51f2ebb6c5f', 'request_id': 'f1161a31-f4b0-9865-9de6-f51f2ebb6c5f'}
解决:
方案1、初始化 OpenAIEmbeddings
时设置参数 check_embedding_ctx_length = False
方案2、使用 DashScopeEmbeddings
,参考:developer.aliyun.com/article/166...
ini
# pip install dashscope
from langchain_community.embeddings import DashScopeEmbeddings
embeddings = DashScopeEmbeddings(
model="text-embedding-v2",
# other params...
)
text = "This is a test document."
query_result = embeddings.embed_query(text)
print("文本向量长度:", len(query_result), sep='')
doc_results = embeddings.embed_documents(
[
"Hi there!",
"Oh, hello!",
"What's your name?",
"My friends call me World",
"Hello World!"
])
print("文本向量数量:", len(doc_results), ",文本向量长度:", len(doc_results[0]), sep='')

百炼平台错误码对照表:help.aliyun.com/zh/model-st...

报错2:
css
BadRequestError: Error code: 400 - {'error': {'code': 'InvalidParameter', 'param': None, 'message': 'Range of input length should be [1, 8192]', 'type': 'InvalidParameter'}, 'id': '433b3853-b915-9746-bc80-ee14586f7d7f', 'request_id': '433b3853-b915-9746-bc80-ee14586f7d7f'}'}
官方的说法是:调用模型时输入内容长度超过模型上限。
解决: 把 TXT 拆成多部分解析。


四、查询向量存储
过滤支持
向量存储支持一组过滤器,可以针对文档的元数据字段应用这些过滤器。
运算符 | 含义/类别 |
---|---|
$eq | 相等 (==) |
$ne | 不等 (!=) |
$lt | 小于 (<) |
$lte | 小于或等于 (<=) |
$gt | 大于 (>) |
$gte | 大于或等于 (>=) |
$in | 特殊情况 (in) |
$nin | 特殊情况 (not in) |
$between | 特殊情况 (between) |
$like | 文本 (like) |
$ilike | 文本(不区分大小写的 like) |
$and | 逻辑 (and) |
$or | 逻辑 (or) |
1、直接查询
执行相似性搜索,并按元数据进行筛选:
- query:搜索查询文本,这里是对LangChain功能的描述
- k=2:指定返回最相似的2个结果
- filter:使用元数据过滤条件,这里限定匹配data目录下所有文件
python
results = vector_store.similarity_search(
query="virtual assembly",
k=2,
filter={"source": {"$like": "data%"}}
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
scss
* SOLIDWORKS API Help
Insert and Save Virtual Assembly Example (C#)
...
} [{'source': 'data/Insert_and_Save_Virtual_Assembly_Example_CSharp.txt'}]
* SOLIDWORKS API Help
Insert New Instance of Virtual Component (C#)
...
} [{'source': 'data/Insert_New_Instance_of_Virtual_Component_Example_CSharp.txt'}]
执行相似性搜索并接收相应的分数:
python
results = vector_store.similarity_search_with_score(
query="dimensions",
k=1,
filter={"source": {"$like": "data%"}}
)
for doc, score in results:
print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")
scss
* [SIM=0.597795] SOLIDWORKS API Help
Get Whether Linear Dimension Is Foreshortened Example (C#)
...
} [{'source': 'data/Get_Whether_Linear_Dimension_Is_Foreshortened_Example_CSharp.txt'}]
2、通过转换为检索器进行查询
将向量存储转换为检索器,以便在链中使用。
- search_type="similarity_score_threshold"指定采用阈值过滤的相似度搜索模式
- score_threshold: 0.45设定最小相似度分数为0.45
python
retriever = vector_store.as_retriever(
search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.45}
)
retriever.invoke("Linear Dimension")
rust
[Document(id='b345c841-e6d8-4386-91ab-3ad2cdd01111', metadata={'source': 'data/Get_Whether_Linear_Dimension_Is_Foreshortened_Example_CSharp.txt'}, page_content='SOLIDWORKS API Help\nGet Whether Linear Dimension Is Foreshortened Example (C#)\nThis example shows how to get whether a linear dimension is foreshortened.\n//------------------------------------------------------------------\n// Preconditions:\n// 1. Open public_documents\\samples\\tutorial\\api\\chair.slddrw.\n// 2. Click Tools > Options > Document Properties > Dimensions > Linear.\n// 3. Click ANSI in Base linear dimension standard.\n// 4. Verify that the following check box and option are selected in\n// Foreshortened:\n// * Automatic \n// * Zigzag \n// 5. Click OK.\n// 6. Dimension an outside linear edge.\n//\n// Postconditions:\n// 1. Gets whether the dimension is foreshortened.\n// 2. Examine the Immediate window and drawing.\n//\n// NOTES:\n// * Foreshortened dimensions are only valid for\n// linear dimensions and only when the detailing standard\n// is ANSI.\n// * Because the part and drawing are used elsewhere, do not\n// save changes.\n//-------------------------------------------------------------------\nusing SolidWorks.Interop.sldworks;\nusing SolidWorks.Interop.swconst;\nusing System.Runtime.InteropServices;\nusing System;\nusing System.Diagnostics;\n \nnamespace Macro1CSharp.csproj\n{\n public partial class SolidWorksMacro\n {\n \n \n public void Main()\n {\n \n ModelDoc2 swModel = default(ModelDoc2);\n SelectionMgr swSelectionMgr = default(SelectionMgr);\n DisplayDimension swDisplayDimension = default(DisplayDimension);\n \n swModel = (ModelDoc2)swApp.ActiveDoc;\n swSelectionMgr = (SelectionMgr)swModel.SelectionManager;\n swDisplayDimension = (DisplayDimension)swSelectionMgr.GetSelectedObject6(1, -1);\n Debug.Print("Foreshortened dimension? " + swDisplayDimension.Foreshortened);\n \n swModel.ClearSelection2(true);\n \n }\n \n /// <summary>\n /// The SldWorks swApp variable is pre-assigned for you.\n /// </summary>\n public SldWorks swApp;\n }\n}')]
五、关于查询结果不准确
首先肯定是嵌入模型的问题,考虑换一个性能更强的模型。其次是搜索条件,不同的向量数据库有自己的高级搜索特性,需要从官方学习如何使用这些特性,然后放到 filter
中提升搜索准确度。