HTML Document Loaders in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html

HTML Document Loaders in LangChain

This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).

1. What Are HTML Loaders?

HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.

  • Core function: Extract text content from HTML (removing tags like <h1>, <p>) and attach metadata (e.g., file source).
  • Two supported loaders:
    • UnstructuredHTMLLoader: Simple loader for basic text extraction.
    • BSHTMLLoader: Uses the BeautifulSoup4 library to extract text + page title (stored in metadata).

2. Prerequisites

  • For BSHTMLLoader, install the BeautifulSoup4 library first (required for HTML parsing):

    bash 复制代码
    pip install beautifulsoup4

3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)

This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).

Step 3.1: Import the Loader

python 复制代码
from langchain.document_loaders import UnstructuredHTMLLoader

Step 3.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 3.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

4. Loader 2: BSHTMLLoader (Text + Title Extraction)

This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).

Step 4.1: Import the Loader

python 复制代码
from langchain.document_loaders import BSHTMLLoader

Step 4.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 4.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

Key Takeaways

  • UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).
  • BSHTMLLoader: Requires BeautifulSoup4, extracts text + page title (stored in metadata["title"]).
  • Both loaders return Document objects with page_content (extracted text) and metadata["source"] (file path).
相关推荐
奇舞精选7 小时前
写 HTML 就能做视频?HeyGen 开源的这个工具有点意思
html·agent
秦歌66610 小时前
LangChain-9-多agent电商助手案例
langchain
河阿里11 小时前
HTML5标准完全教学手册
前端·html·html5
之歆12 小时前
Day03_HTML 列表、表格、表单完整指南(下)
android·javascript·html
Lumos_77712 小时前
Linux -- exec 进程替换
linux·运维·chrome
索木木14 小时前
ShortCut MoE模型分析
前端·html
之歆16 小时前
Day02_HTML 标签详解:从基础到实践的完整指南
css·html·div
Arvid16 小时前
LangGraph 记忆系统设计实战
langchain
We་ct17 小时前
HTML5 原生拖拽 API 实战案例与拓展避坑
前端·html·api·html5·拖拽