HTML Document Loaders in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html

HTML Document Loaders in LangChain

This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).

1. What Are HTML Loaders?

HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.

  • Core function: Extract text content from HTML (removing tags like <h1>, <p>) and attach metadata (e.g., file source).
  • Two supported loaders:
    • UnstructuredHTMLLoader: Simple loader for basic text extraction.
    • BSHTMLLoader: Uses the BeautifulSoup4 library to extract text + page title (stored in metadata).

2. Prerequisites

  • For BSHTMLLoader, install the BeautifulSoup4 library first (required for HTML parsing):

    bash 复制代码
    pip install beautifulsoup4

3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)

This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).

Step 3.1: Import the Loader

python 复制代码
from langchain.document_loaders import UnstructuredHTMLLoader

Step 3.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 3.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

4. Loader 2: BSHTMLLoader (Text + Title Extraction)

This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).

Step 4.1: Import the Loader

python 复制代码
from langchain.document_loaders import BSHTMLLoader

Step 4.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 4.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

Key Takeaways

  • UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).
  • BSHTMLLoader: Requires BeautifulSoup4, extracts text + page title (stored in metadata["title"]).
  • Both loaders return Document objects with page_content (extracted text) and metadata["source"] (file path).
相关推荐
just小千1 小时前
HTML进阶——常用标签及其属性
前端·html
惜.己1 小时前
html笔记(一)
前端·笔记·html
曹牧1 小时前
HTML实体名称
前端·html
guangzan1 小时前
在 React 中重拾原生 HTML 属性
html
AI大模型3 小时前
5步构建企业级RAG应用:Dify与LangChain v1.0集成实战
langchain·llm·agent
坚持就完事了4 小时前
CSS-5:盒子模型
前端·css·html
FreeCode5 小时前
LangSmith本地部署LangGraph应用
python·langchain·agent
橙 子_5 小时前
我本以为代码是逻辑,直到遇见了HTML的“形”与“意”【一】
前端·html
坚持就完事了7 小时前
CSS-3:背景设置
前端·css·html