介绍
由于数据可能来自多个地方,并非所有读取器都是内置的。相反,您可以从我们的数据连接器注册表LlamaHub(https://docs.llamaindex.ai/en/stable/understanding/loading/llamahub/)中下载它们。
LlamaHub (Llama Hub)提供了多种开源数据连接器,这些连接器可以轻松地集成到任何LlamaIndex应用程序(+ Agent Tools和Llama Packs)中。以下是一些使用模式和可用连接器的介绍:
LlamaHub 是一个专注于连接大型语言模型(LLM)与各种知识及数据源的生态系统,提供数据加载器、工具、数据集等实用组件,旨在简化数据集成流程。
地址 Llama Hub

**核心功能与组件:** LlamaHub 的核心组件包括数据加载器(如 CSVReader、DocxReader、ConfluenceReader)、工具(如Google Calendar 工具)和数据集(如paulgrahamessaydataset),这些组件支持多种数据源(如Google Docs、Notion、数据库)并可与框架如 LlamaIndex、LangChain 配合使用,用于构建数据代理或检索增强生成(RAG)应用
使用方式
pip install llama-index-readers-file
代码
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import (
DocxReader,
HWPReader,
PDFReader,
EpubReader,
FlatReader,
HTMLTagReader,
ImageCaptionReader,
ImageReader,
ImageVisionLLMReader,
IPYNBReader,
MarkdownReader,
MboxReader,
PptxReader,
PandasCSVReader,
VideoAudioReader,
UnstructuredReader,
PyMuPDFReader,
ImageTabularChartReader,
XMLReader,
PagedCSVReader,
CSVReader,
RTFReader,
)
# PDF Reader with `SimpleDirectoryReader`
parser = PDFReader()
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Docx Reader example
parser = DocxReader()
file_extractor = {".docx": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# HWP Reader example
parser = HWPReader()
file_extractor = {".hwp": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Epub Reader example
parser = EpubReader()
file_extractor = {".epub": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Flat Reader example
parser = FlatReader()
file_extractor = {".txt": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# HTML Tag Reader example
parser = HTMLTagReader()
file_extractor = {".html": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Image Reader example
parser = ImageReader()
file_extractor = {
".jpg": parser,
".jpeg": parser,
".png": parser,
} # Add other image formats as needed
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# IPYNB Reader example
parser = IPYNBReader()
file_extractor = {".ipynb": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Markdown Reader example
parser = MarkdownReader()
file_extractor = {".md": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Mbox Reader example
parser = MboxReader()
file_extractor = {".mbox": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Pptx Reader example
# Basic usage - extracts text, tables, charts, and speaker notes
parser = PptxReader()
# Advanced usage - control parsing behavior
parser = PptxReader(
extract_images=True, # Enable image captioning
context_consolidation_with_llm=True, # Use LLM for content synthesis
num_workers=4, # Parallel processing
batch_size=10, # Slides processed per worker batch
raise_on_error=True, # Raise value error if file_parsing is not successful
)
file_extractor = {".pptx": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Pandas CSV Reader example
parser = PandasCSVReader()
file_extractor = {".csv": parser} # Add other CSV formats as needed
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# PyMuPDF Reader example
parser = PyMuPDFReader()
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# XML Reader example
parser = XMLReader()
file_extractor = {".xml": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Paged CSV Reader example
parser = PagedCSVReader()
file_extractor = {".csv": parser} # Add other CSV formats as needed
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# CSV Reader example
parser = CSVReader()
file_extractor = {".csv": parser} # Add other CSV formats as needed
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
数据库连接器
在此示例中,LlamaIndex下载并安装了名为 DatabaseReader的连接器,该连接器对SQL数据库运行查询,并将结果的每一行作为Document返回:
from llama_index.core import download_loader
from llama_index.readers.database import DatabaseReader
import os
reader = DatabaseReader(
scheme=os.getenv("DB_SCHEME"),
host=os.getenv("DB_HOST"),
port=os.getenv("DB_PORT"),
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASS"),
dbname=os.getenv("DB_NAME"),
)
query = "SELECT * FROM users"
documents = reader.load_data(query=query)
LlamaHub上有数百个连接器可供使用!