HTML Document Loaders in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html

HTML Document Loaders in LangChain

This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).

1. What Are HTML Loaders?

HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.

Core function: Extract text content from HTML (removing tags like <h1>, <p>) and attach metadata (e.g., file source).
Two supported loaders:
- UnstructuredHTMLLoader: Simple loader for basic text extraction.
- BSHTMLLoader: Uses the BeautifulSoup4 library to extract text + page title (stored in metadata).

2. Prerequisites

For BSHTMLLoader, install the BeautifulSoup4 library first (required for HTML parsing):
bash 复制代码
```
pip install beautifulsoup4
```

3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)

This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).

Step 3.1: Import the Loader

python 复制代码

from langchain.document_loaders import UnstructuredHTMLLoader

Step 3.2: Initialize and Load the HTML File

python 复制代码

# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 3.3: View the Result

python 复制代码

data

Output (Exact as Original):

python 复制代码

[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

4. Loader 2: BSHTMLLoader (Text + Title Extraction)

This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).

Step 4.1: Import the Loader

python 复制代码

from langchain.document_loaders import BSHTMLLoader

Step 4.2: Initialize and Load the HTML File

python 复制代码

# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 4.3: View the Result

python 复制代码

data

Output (Exact as Original):

python 复制代码

[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

Key Takeaways

UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).
BSHTMLLoader: Requires BeautifulSoup4, extracts text + page title (stored in metadata["title"]).
Both loaders return Document objects with page_content (extracted text) and metadata["source"] (file path).