https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html
HTML Document Loaders in LangChain
This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.
Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).
1. What Are HTML Loaders?
HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.
- Core function: Extract text content from HTML (removing tags like
<h1>,<p>) and attach metadata (e.g., file source). - Two supported loaders:
UnstructuredHTMLLoader: Simple loader for basic text extraction.BSHTMLLoader: Uses theBeautifulSoup4library to extract text + page title (stored in metadata).
2. Prerequisites
-
For
BSHTMLLoader, install theBeautifulSoup4library first (required for HTML parsing):bashpip install beautifulsoup4
3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)
This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).
Step 3.1: Import the Loader
python
from langchain.document_loaders import UnstructuredHTMLLoader
Step 3.2: Initialize and Load the HTML File
python
# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
# Load the HTML into a Document object
data = loader.load()
Step 3.3: View the Result
python
data
Output (Exact as Original):
python
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]
4. Loader 2: BSHTMLLoader (Text + Title Extraction)
This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).
Step 4.1: Import the Loader
python
from langchain.document_loaders import BSHTMLLoader
Step 4.2: Initialize and Load the HTML File
python
# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")
# Load the HTML into a Document object
data = loader.load()
Step 4.3: View the Result
python
data
Output (Exact as Original):
python
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]
Key Takeaways
UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).BSHTMLLoader: RequiresBeautifulSoup4, extracts text + page title (stored inmetadata["title"]).- Both loaders return
Documentobjects withpage_content(extracted text) andmetadata["source"](file path).