目录
- [报错信息:ImportError: failed to find libmagic. Check your installation](#报错信息:ImportError: failed to find libmagic. Check your installation)
- 按照网络上找的办法修改
- [还是报错:LookupError:Resource punkt not found.](#还是报错:LookupError:Resource punkt not found.)
- 下载nltk_data
- [又报错:AttributeError: 'tuple' object has no attribute 'page_content'](#又报错:AttributeError: 'tuple' object has no attribute 'page_content')
- 怀疑是头文件的问题,修改头文件
- 终成功!
报错信息:ImportError: failed to find libmagic. Check your installation
bash
Traceback (most recent call last):
File "D:\mydatapro\myweb\AutoTokenizer.py", line 22, in <module>
split_data = main_embedding()
^^^^^^^^^^^^^^^^
File "D:\mydatapro\myweb\AutoTokenizer.py", line 11, in main_embedding
data = loader.load()# 加载数据
^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_core\document_loaders\base.py", line 30, in load
return list(self.lazy_load())
^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 150, in lazy_load
yield from load_file(f=self.file, f_path=self.file_path)
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 184, in lazy_load
else self._elements_json
^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 203, in _elements_json
return self._convert_elements_to_dicts(self._elements_via_local)
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 221, in _elements_via_local
return partition(
^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\partition\auto.py", line 186, in partition
file_type = detect_filetype(
^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 100, in detect_filetype
return _FileTypeDetector.file_type(ctx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 133, in file_type
return cls(ctx)._file_type
^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 143, in _file_type
if file_type := self._file_type_from_guessed_mime_type:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 183, in _file_type_from_guessed_mime_type
mime_type = self._ctx.mime_type
^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\utils.py", line 155, in __get__
value = self._fget(obj)
^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 364, in mime_type
import magic
File "D:\mydatapro\venv_net\Lib\site-packages\magic\__init__.py", line 209, in <module>
libmagic = loader.load_lib()
^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\magic\loader.py", line 49, in load_lib
raise ImportError('failed to find libmagic. Check your installation')
ImportError: failed to find libmagic. Check your installation
按照网络上找的办法修改
还是报错:LookupError:Resource punkt not found.
bash
Traceback (most recent call last):
File "D:\mydatapro\myweb\AutoTokenizer.py", line 22, in <module>
split_data = main_embedding()
^^^^^^^^^^^^^^^^
File "D:\mydatapro\myweb\AutoTokenizer.py", line 11, in main_embedding
data = loader.load()# 加载数据
^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_core\document_loaders\base.py", line 30, in load
return list(self.lazy_load())
^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 150, in lazy_load
yield from load_file(f=self.file, f_path=self.file_path)
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 184, in lazy_load
else self._elements_json
^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 203, in _elements_json
return self._convert_elements_to_dicts(self._elements_via_local)
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_unstructured\document_loaders.py", line 221, in _elements_via_local
return partition(
^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\partition\auto.py", line 415, in partition
elements = partition_text(
^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\partition\text.py", line 102, in partition_text
return _partition_text(
^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\documents\elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 706, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\file_utils\filetype.py", line 662, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\partition\text.py", line 181, in _partition_text
file_content = _split_by_paragraph(
^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\partition\text.py", line 361, in _split_by_paragraph
_split_content_to_fit_max(
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\partition\text.py", line 393, in _split_content_to_fit_max
sentences = sent_tokenize(content)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\unstructured\nlp\tokenize.py", line 131, in sent_tokenize
return _sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\nltk\data.py", line 750, in load
opened_resource = _open(resource_url)
^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\nltk\data.py", line 876, in _open
return find(path_, path + [""]).open()
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\nltk\data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/english.pickle
Searched in:
- 'C:\\Users\\shuhu/nltk_data'
- 'D:\\mydatapro\\venv_net\\nltk_data'
- 'D:\\mydatapro\\venv_net\\share\\nltk_data'
- 'D:\\mydatapro\\venv_net\\lib\\nltk_data'
- 'C:\\Users\\shuhu\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- ''
**********************************************************************
下载nltk_data
- 网络一直不太稳定下载了很久,还设置了环境变量。
又报错:AttributeError: 'tuple' object has no attribute 'page_content'
- 这个函数可不是我写的,这个是官方文件里面的。
bash
D:\mydatapro\venv_net\Lib\site-packages\langchain_core\_api\deprecation.py:141: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 0.3.0. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFaceEmbeddings`.
warn_deprecated(
INFO: Use pytorch device_name: cpu
INFO: Load pretrained SentenceTransformer: F:\\moka-ai_m3e-base
Traceback (most recent call last):
File "D:\mydatapro\myweb\AutoTokenizer.py", line 24, in <module>
INFO: Use pytorch device_name: cpu
INFO: Load pretrained SentenceTransformer: F:\\moka-ai_m3e-base
Traceback (most recent call last):
File "D:\mydatapro\myweb\AutoTokenizer.py", line 24, in <module>
INFO: Load pretrained SentenceTransformer: F:\\moka-ai_m3e-base
Traceback (most recent call last):
File "D:\mydatapro\myweb\AutoTokenizer.py", line 24, in <module>
Traceback (most recent call last):
File "D:\mydatapro\myweb\AutoTokenizer.py", line 24, in <module>
File "D:\mydatapro\myweb\AutoTokenizer.py", line 24, in <module>
split_data = main_embedding()
split_data = main_embedding()
^^^^^^^^^^^^^^^^
File "D:\mydatapro\myweb\AutoTokenizer.py", line 18, in main_embedding
File "D:\mydatapro\myweb\AutoTokenizer.py", line 18, in main_embedding
db = FAISS.from_documents(embeddings,split_data)# 构建向量库
db = FAISS.from_documents(embeddings,split_data)# 构建向量库
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_core\vectorstores\base.py", line 831, in from_documents
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\mydatapro\venv_net\Lib\site-packages\langchain_core\vectorstores\base.py", line 831, in <listcomp>
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'page_content'
- 不知道为什么这个报错信息很多重复的,难道是因为网络?不太懂。
怀疑是头文件的问题,修改头文件
我只改了头文件,所以我只能把错误原因归为网络问题or库有问题,这个是我最后用的所有头文件:
python
from langchain_unstructured import UnstructuredLoader# 加载文档
from langchain_text_splitters import RecursiveCharacterTextSplitter# 切分文档
from langchain_huggingface import HuggingFaceEmbeddings# 向量化
from langchain_community.vectorstores import FAISS# 向量库
终成功!
bash
(venv_net) PS D:\mydatapro\myweb> python AutoTokenizer.py
INFO: Use pytorch device_name: cpu
INFO: Load pretrained SentenceTransformer: F:\\moka-ai_m3e-base
INFO: Loading faiss with AVX2 support.
INFO: Successfully loaded faiss with AVX2 support.
[Document(metadata={'source': './dataset/test.txt', 'file_directory': './dataset', 'filename': 'test.txt', 'last_modified': '2024-08-16T16:11:37', 'languages': ['zho'], 'filetype': 'text/plain', 'category': 'Title', 'element_id': '2ec66fdb03bd40ec722fd30005d3739a'}, page_content='国家建立的负责收集和保存本国出版物,担负国家总书库职能的图书馆。'), Document(metadata={'source': './dataset/test.txt', 'file_directory': './dataset', 'filename': 'test.txt', 'last_modified': '2024-08-16T16:11:37', 'languages': ['zho'], 'filetype': 'text/plain', 'category': 'Title', 'element_id': '39a938c715ce1a4b38af2b878c2d29d4'}, page_content='国家图书馆一般除收藏本国出版物外,还收藏大量外文出版物 (包括有关本国的外文书刊), 并负责编制国家书目和联合目录。'), Document(metadata={'source': './dataset/test.txt', 'file_directory': './dataset', 'filename': 'test.txt', 'last_modified': '2024-08-16T16:11:37', 'languages': ['zho'], 'filetype': 'text/plain', 'category': 'Title', 'element_id': '2ddfef3787246755bfd1955ef3eacb54'}, page_content='国家图书馆是一个国家 图书事业的推动者,是面向全国的中心图书馆,既是全国的藏书中心、馆际互借中心、国际书刊交换中心,'), Document(metadata={'source': './dataset/test.txt', 'file_directory': './dataset', 'filename': 'test.txt', 'last_modified': '2024-08-16T16:11:37', 'languages': ['zho'], 'filetype': 'text/plain', 'category': 'Title', 'element_id': 'ca80db5e9d73b32e59eb3dc122b274c6'}, page_content='也是全国的书目 和图书馆学研究的中心。')]