环境:
conda create -n rag_new python=3.10 -y
conda activate rag_new
pip install langchain
pip install langchain-community
pip install langchain-core
pip install unstructured
pip install "unstructured[md]"
pip install "unstructured[image]"
pip install "unstructured[ppt]"
pip install pytesseract # ocr识别文字
pip install python-magic-bin #识别文件的类型
pip install chardet # 字符集编码
运行代码:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
loader = DirectoryLoader("./数据",
silent_errors=True,
loader_kwargs={'autodetect_encoding': True})
docs = loader.load()
print(len(docs))
报错
解决方案:
1.执行下面代码,查找nltk路径
|--------------------------------------------------|
| import nltk # 查看路径 print(nltk.data.find('')) |
输出结果:
C:\Users\17662\AppData\Roaming\nltk_data
- 将D:\人工智能2024\大模型应用开发 RAG实战课\所需软件\ nltk_data.zip的内容解压到上述C:\Users\17662\AppData\Roaming\nltk_data里
- 将D:\人工智能2024\大模型应用开发 RAG实战课\所需软件\punkt.zip和punkt_tab.zip两个复制到C:\Users\17662\AppData\Roaming\nltk_data\tokenizers目录里,并解压
如下所示