基于 InternLM 和 LangChain 搭建你的知识库

本文基于InternStudio 算力平台利用 InternLM 和 LangChain 搭建知识库。

InternStudio (OpenAIDE)[1] 是面向算法开发者与研究员的云端集成开发环境。基于「容器实例」，「镜像中心」，「分布式训练」，「公开数据集」模块为用户提供 "算力、算法、数据" 深度学习模型训练三要素，让算法开发变得更简单、更方便。笔者认为 InternStudio 是支撑书生·浦语大模型全链路开源开放体系的坚实基础，让我们在学习理论之后能够在开发机上实战。

InternLM (书生·浦语)[2]是一个多语千亿参数基座模型，可以处理英文、中文和代码等多种语言;InternLM在语言理解和推理任务上表现出色;InternLM还提供了聊天版本的模型，可以与人类进行高质量、安全和符合道德的对话。

LangChain[3]灵活的抽象和丰富的工具包使开发人员能够构建上下文感知、推理的大语言模型(LLM)应用程序。

准备

首先需要注册InternStudio算力平台并创建开发机，笔者在前面的文章实战中已经创建了开发机，如果您有遇到问题欢迎与我沟通，或者通过 Q&A文档自行解决：书生·浦语大模型实战营Q&A文档[4]。

创建Conda环境并激活

bash # 进入bash
/root/share/install_conda_env_internlm_base.sh InternLM # 创建环境，默认pytorch 2.0.1 的环境
conda activate InternLM # 激活环境
安装python依赖

升级pip

python -m pip install --upgrade pip
pip install modelscope==1.9.5
pip install transformers==4.35.2
pip install streamlit==1.24.0
pip install sentencepiece==0.1.99
pip install accelerate==0.24.1
pip install langchain==0.0.292
pip install gradio==4.4.0
pip install chromadb==0.4.15
pip install sentence-transformers==2.2.2
pip install unstructured==0.10.30
pip install markdown==3.3.7
模型下载

使用本地环境已有的模型

mkdir -p /root/data/model/Shanghai_AI_Laboratory
cp -r /root/share/temp/model_repos/internlm-chat-7b /root/data/model/Shanghai_AI_Laboratory/internlm-chat-7b

或者通过 ModelScope、HuggingFace 下载，如ModelScope

python -c "import torch; from modelscope import snapshot_download, AutoModel, AutoTokenizer; import os; model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-chat-7b', cache_dir='/root/data/model', revision='v1.0.3')"
词向量模型下载

本次使用 huggingface_hub 下载 Sentence-Transformers

pip install -U huggingface_hub
python -c "import os; os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'; os.system('huggingface-cli download --resume-download sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 --local-dir /root/data/model/sentence-transformer')"

此处使用了 hf-mirror.com 镜像网站

Sentence-Transformers

下载 NLTK 相关资源

避免众所周知的原因下不到第三方库资源

cd /root
git clone https://gitee.com/yzy0612/nltk_data.git --branch gh-pages
cd nltk_data
mv packages/* ./
cd tokenizers
unzip punkt.zip
cd ../taggers
unzip averaged_perceptron_tagger.zip

NLTK

数据收集(以 InternLM 开源相关资料为例)

进入到数据库盘

cd /root/data

clone 上述开源仓库

git clone --depth=1 https://github.com/InternLM/tutorial
git clone --depth=1 https://gitee.com/open-compass/opencompass.git
git clone --depth=1 https://gitee.com/InternLM/lmdeploy.git
git clone --depth=1 https://gitee.com/InternLM/xtuner.git
git clone --depth=1 https://gitee.com/InternLM/InternLM-XComposer.git
git clone --depth=1 https://gitee.com/InternLM/lagent.git
git clone --depth=1 https://gitee.com/InternLM/InternLM.git

数据收集

知识库搭建

我们遵循数据处理、LangChain 自定义LLM类构建(基于InternLM-Chat-7B)、Gradio 构建对话应用这个三个步骤来实现。后续我们构建自己的大模型应用也可以参照这种思路，只需替换不同的数据、不同的模型就是针对特定场景的大模型应用。完整代码请参考：internlm-langchain-demo[5],代码中注释很完善，本文仅针对核心点讲解。

数据处理(create_db.py)

在使用LangChain时，构建向量数据库并且将加载的向量数据库持久化到磁盘上的目的是可以提高语料库的检索效率和准确度。向量数据库是一种利用向量空间模型来表示文档的数据结构，它可以通过计算文档向量之间的相似度来快速找出与用户输入相关的文档片段。向量数据库的构建过程需要对语料库进行分词、嵌入和索引等操作，这些操作比较耗时和资源，所以将构建好的向量数据库保存到磁盘上，可以避免每次使用时都重复进行这些操作，从而节省时间和空间。加载向量数据库时，只需要从磁盘上读取已经构建好的向量数据库，然后根据用户输入进行相似性检索，就可以得到相关的文档片段，再将这些文档片段传给LLM，得到最终的答案。这里我们使用的是开源向量数据库Chroma[6]，

复制代码

# 首先导入所需第三方库
# ...
from langchain.vectorstores import Chroma
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
# ...

# 加载开源词向量模型
embeddings = HuggingFaceEmbeddings(model_name="/root/data/model/sentence-transformer")

# 构建向量数据库
# 定义持久化路径
persist_directory = 'data_base/vector_db/chroma'
# 加载数据库
vectordb = Chroma.from_documents(
    documents=split_docs,
    embedding=embeddings,
    persist_directory=persist_directory 
)
# 将加载的向量数据库持久化到磁盘上
vectordb.persist()

create_db.py

我们只需执行一次 python create_db.py即可创建向量数据库。

构建持久化向量数据库

InternLM 基类构建（LLM.py）

LangChain 支持自定义 LLM，也就是我们常说的使用本地大模型。自定义 LLM 只需实现两个必要条件：

一个 _call 方法，用于接收一个字符串、一些可选的停止词，并返回一个字符串。
返回字符串的 _llm_type 属性。仅用于记录日志。它还支持可选项：
一个 _identifying_params 属性，用于帮助打印自定义LLM类。

...

from langchain.llms.base import LLM
from typing import Any, List, Optional
from langchain.callbacks.manager import CallbackManagerForLLMRun

...

class InternLM_LLM(LLM):
# 基于本地 InternLM 自定义 LLM 类
def _call(self, prompt : str, stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any):
# 重写调用函数
system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
"""
messages = [(system_prompt, '')]
response, history = self.model.chat(self.tokenizer, prompt , history=messages)
return response
复制代码
```
  @property
  def _llm_type(self) -> str:
      return "InternLM"
```

不同版本的LangChain中自定义LLM类的具体实现代码有些差别，可参考LangChain官方文档：Custom LLM[7]

LLM.py

Gradio 构建应用(run_gradio.py)

检索问答链是LangChain的一个核心模块，它可以根据用户的查询，在向量存储库中检索相关文档，并使用语言模型生成回答。要构建检索问答链，我们需要以下几个步骤：

创建向量存储库。我们可以使用LangChain提供的Chroma模块，或者自己选择合适的向量数据库。

...

from langchain.vectorstores import Chroma
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

...

加载数据库
复制代码
```
  vectordb = Chroma(
      persist_directory=persist_directory,  # 前文持久化路径
      embedding_function=embeddings
  )
```
创建检索QA链。我们可以使用LangChain提供的RetrievalQA模块，或者自己定义链类型和参数。

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),return_source_documents=True,chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

创建 Gradio 应用：

复制代码

#...
import gradio as gr
#...
block = gr.Blocks()
with block as demo:
  #...
gr.close_all()
# 启动新的 Gradio 应用，设置分享功能为 True，并使用环境变量 PORT1 指定服务器端口。
# demo.launch(share=True, server_port=int(os.environ['PORT1']))
# 直接启动
demo.launch()

run_gradio.py

我们只需执行python run_gradio.py即可运行部署一个我们专属的知识库。

run gradio

参考资料

1\]InternStudio (OpenAIDE): https://studio.intern-ai.org.cn/\[2\]InternLM(书生·浦语): https://internlm.org/\[4\]书生·浦语大模型实战营Q\&A文档: https://cguue83gpz.feishu.cn/docx/Noi7d5lllo6DMGxkuXwclxXMn5f\[6\]Chroma: https://www.trychroma.com/ ​ ### 如何学习AI大模型？ 作为一名热心肠的互联网老兵，我决定把宝贵的AI知识分享给大家。 至于能学习到多少就看你的学习毅力和能力了 。**`我已将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。`** **这份完整版的大模型 AI 学习资料已经上传CSDN，朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【`保证100%免费`】** ![](https://i-blog.csdnimg.cn/img_convert/3b8ed00328aa1bcece0a8f07eab1dc12.jpeg) #### 一、全套AGI大模型学习路线 **AI大模型时代的学习之旅：从基础到前沿，掌握人工智能的核心技能！** ![img](https://img-blog.csdnimg.cn/direct/73960f44794245eb988e286620c38b59.png) #### 二、640套AI大模型报告合集 这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。 ![img](https://img-blog.csdnimg.cn/direct/ecbe365405e6442986e91b29da53efbd.png) #### 三、AI大模型经典PDF籍 随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。 那以下这些PDF籍就是非常不错的学习资源。 ![img](https://img-blog.csdnimg.cn/direct/f3f83643ea7e4954ad51c4b3099dddc6.png) #### 四、AI大模型商业化落地方案 ![img](https://img-blog.csdnimg.cn/direct/de6bd4e8e37c4e61a79c37b2551d466e.png) 作为普通人，入局大模型时代需要持续学习和实践，不断提高自己的技能和认知水平，同时也需要有责任感和伦理意识，为人工智能的健康发展贡献力量。

基于 InternLM 和 LangChain 搭建你的知识库

准备

升级pip

使用本地环境已有的模型

或者通过 ModelScope、HuggingFace 下载，如ModelScope

本次使用 huggingface_hub 下载 Sentence-Transformers

此处使用了 hf-mirror.com 镜像网站

避免众所周知的原因下不到第三方库资源

进入到数据库盘

clone 上述开源仓库

知识库搭建

数据处理(create_db.py)

InternLM 基类构建（LLM.py）

...

...

Gradio 构建应用(run_gradio.py)

...

...

加载数据库

参考资料