Langchian检索YouTube视频字幕

任务概述

目标：基于现有 Langchian检索YouTube视频字幕.py，构建一个流程化教程，展示如何使用 Qwen 模型与 Chroma 向量库进行字幕检索。
核心能力：加载已持久化的向量数据库、利用结构化 LLM 输出生成检索指令、按条件执行相似度搜索。

任务细节

向量库 ：Chroma，Embedding 模型为 DashScopeEmbeddings，数据目录 ./chroma_data_dir。
LLM ：ChatOpenAI（DashScope Qwen 兼容），系统提示把自然语言问题转为数据库查询。
数据结构 ：Search Pydantic 模型包含 query 与 publish_year，用于约束 LLM 输出格式。
隐私处理 ：示例中的 api_key 使用 YOUR_DASHSCOPE_API_KEY 占位，实际使用时请从安全配置中读取。

解决逻辑

初始化 Qwen LLM 与 DashScope Embeddings，并指定 DashScope 兼容接口。
通过 persist_directory 加载已持久化的 Chroma 向量库，无需重新构建语料。
使用 ChatPromptTemplate 与 RunnablePassthrough 构造链条，并通过 with_structured_output(Search) 获取结构化的检索指令。
在 retrieval 中根据 publish_year 拼接 Chroma 过滤条件（{"$eq": year}），再执行 similarity_search。
将用户问题 → LLM 指令 → 向量检索串联为 new_chain，最终返回标题与年份列表，便于展示或进一步处理。

关键代码片段

python 复制代码

qwen_api_key = 'YOUR_DASHSCOPE_API_KEY'

model = ChatOpenAI(
    model='qwen-turbo',
    api_key=qwen_api_key,
    base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)

embeddings = DashScopeEmbeddings(
    model='text-embedding-v1',
    dashscope_api_key=qwen_api_key
)

vectorstore = Chroma(
    persist_directory='./chroma_data_dir',
    embedding_function=embeddings
)

python 复制代码

class Search(BaseModel):
    query: str = Field(None, description='Similarity search query applied to video transcripts.')
    publish_year: Optional[int] = Field(None, description='Year video was published')

chain = {'question': RunnablePassthrough()} | prompt | model.with_structured_output(Search)

def retrieval(search: Search) -> list[Document]:
    _filter = None
    if search.publish_year:
        _filter = {'publish_year': {"$eq": search.publish_year}}
    return vectorstore.similarity_search(search.query, filter=_filter)

new_chain = chain | retrieval
print([(doc.metadata['title'], doc.metadata['publish_year']) for doc in new_chain.invoke('RAG tutorial')])

文字流程图

加载配置：读取 API Key，初始化 Qwen 模型与 DashScope Embeddings。
载入向量库 ：通过 persist_directory 打开本地 Chroma 数据，完成检索准备。
构建链路 ：Prompt → Runnable → LLM 组合成结构化输出链，确保生成 Search。
生成检索指令 ：LLM 根据用户问题输出 query 及可选 publish_year。
应用过滤条件 ：如存在年份，拼接 {"$eq": year} 过滤器；否则直接查询。
执行相似度搜索 ：调用 vectorstore.similarity_search 获取相关字幕片段。
整理输出 ：返回 (title, publish_year) 列表以供展示或进一步分析。

总结

通过结构化输出确保检索指令稳定、可解释。
DashScope Qwen 与 Chroma 的组合可以直接复用现有向量库，避免重复切片。
元数据过滤支持诸如年份等条件检索，为教程内容的精准筛选提供保障。

完整代码

python 复制代码

import datetime
from typing import Optional
from xml.dom.minidom import Document
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import BaseModel
from pydantic import Field  # Qwen 嵌入模型


# Qwen（通义千问）API Key
qwen_api_key = 'TripleH'  # 请替换为您的 DashScope API Key

# 创建 Qwen LLM 模型
# 可选模型：qwen-turbo, qwen-plus, qwen-max, qwen-max-longcontext
model = ChatOpenAI(
    model='qwen-turbo',  # 可以根据需要改为 qwen-plus 或 qwen-max
    api_key=qwen_api_key,
    base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)

embeddings = DashScopeEmbeddings(
    model='text-embedding-v1',  # Qwen 的嵌入模型
    dashscope_api_key=qwen_api_key
)

persist_dir = './chroma_data_dir'  # 存放向量数据库的目录

# 1. 先从网站获取信息，再进行本地持久化
# 一些YouTube的视频连接
# urls = [
#     "https://www.youtube.com/watch?v=HAn9vnJy6S4",
#     "https://www.youtube.com/watch?v=dA1cHGACXCo",
#     "https://www.youtube.com/watch?v=ZcEMLz27sL4",
#     "https://www.youtube.com/watch?v=hvAPnpSfSGo",
# ]

# # document的数组
# docs = []
# from langchain_community.document_loaders import WebBaseLoader, YoutubeLoader

# for url in urls:
#     # 一个Youtube的视频对应一个document
#     docs.extend(YoutubeLoader.from_youtube_url(url, add_video_info=False).load())

# print(len(docs))
# print(docs[0])

# # 给doc添加额外的元数据： 视频发布的年份

# for doc in docs:
# 	doc.metadata['publish_year'] = int(
# 		datetime.datetime.strptime(
# 			doc.metadata['publish_date'], '%Y-%m-%d %H:%M:%S').strftime('%Y')
# 			)
	
# print(docs[0].metadata)
# # 第一个视频的字幕内容
# print(docs[0].page_content[:500])

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=30)
# split_doc = text_splitter.split_documents(docs)
# 2.向量数据库的持久化
# vectorstore = Chroma.from_documents(split_doc, embeddings, persist_directory=persist_dir)  # 并且把向量数据库持久化到磁盘


# 加载磁盘中的向量数据库
vectorstore = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

result = vectorstore.similarity_search_with_score('how do I build a RAG agent')
print(result[0])
print(result[0][0].metadata['publish_year'])

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate

system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a list of database queries optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)


from langchain_core.runnables import RunnableWithMessageHistory, RunnablePassthrough

# pydantic

class Search(BaseModel):
	"""
	定义了一个数据模型
	"""
	# 内容的相似性和发布年份
	query:str = Field(None, description = 'Similarity search query applied to video transcripts.')
	publish_year: Optional[int] = Field(None, description='Year video was published')

chain = {'question' : RunnablePassthrough()} | prompt | model.with_structured_output(Search)

resp1 = chain.invoke('how do i build a RAG agent?')
# print(resp1)
# query='build RAGagent' publish_year=None
resp2 = chain.invoke('videos on RAG published in 2023')
# print(resp2)
# query='RAG' publish_year=2023

# 到目前为止，生成要去向量数据库进行检索的指令

# 根据检索条件去执行

def retrieval(search : Search) -> list[Document]:
	_filter = None
	if search.publish_year:
		# 根据publish_year，存在得到一个检索条件
        # "$eq"是Chroma向量数据库的固定语法
		_filter = {'publish_year' : {"$eq" : search.publish_year}}
	return vectorstore.similarity_search(search.query, filter=_filter)

new_chain = chain | retrieval

# result = new_chain.invoke('videos on RAG published in 2023')
result = new_chain.invoke('RAG tutorial')
print([(doc.metadata['title'], doc.metadata['publish_year']) for doc in result])