HTML Document Loaders in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html

HTML Document Loaders in LangChain

This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).

1. What Are HTML Loaders?

HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.

  • Core function: Extract text content from HTML (removing tags like <h1>, <p>) and attach metadata (e.g., file source).
  • Two supported loaders:
    • UnstructuredHTMLLoader: Simple loader for basic text extraction.
    • BSHTMLLoader: Uses the BeautifulSoup4 library to extract text + page title (stored in metadata).

2. Prerequisites

  • For BSHTMLLoader, install the BeautifulSoup4 library first (required for HTML parsing):

    bash 复制代码
    pip install beautifulsoup4

3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)

This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).

Step 3.1: Import the Loader

python 复制代码
from langchain.document_loaders import UnstructuredHTMLLoader

Step 3.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 3.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

4. Loader 2: BSHTMLLoader (Text + Title Extraction)

This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).

Step 4.1: Import the Loader

python 复制代码
from langchain.document_loaders import BSHTMLLoader

Step 4.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 4.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

Key Takeaways

  • UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).
  • BSHTMLLoader: Requires BeautifulSoup4, extracts text + page title (stored in metadata["title"]).
  • Both loaders return Document objects with page_content (extracted text) and metadata["source"] (file path).
相关推荐
AIkk861 小时前
班级群学习资料分享指南:工具推荐与实践
大数据·人工智能·html
兆。1 小时前
简历高光_Agent_RAG项目描述
人工智能·langchain
加点油。。。。1 小时前
【1.Obsidian渲染html文件】
前端·html·obsidian
leikooo2 小时前
LangChain4j 调用 DeepSeek 工具时报 400?用 pi 抓包定位,同包覆盖修复 reasoning_content
langchain·deepseek
wuhen_n3 小时前
RAG 入门:检索增强生成核心原理
前端·人工智能·typescript·langchain·ai编程
晚笙coding3 小时前
从零讲透 LangChain 提示词模板:不只是 Prompt,而是“可复用的 AI 指令工厂”
人工智能·langchain·prompt
HackTwoHub4 小时前
WEB扫描器Invicti-Professional-V26.50.0(自动化爬虫扫描)更新
前端·人工智能·chrome·爬虫·web安全·网络安全·自动化
晚笙coding4 小时前
从零讲透 LangChain 输出格式化:让模型真的“能用”
java·开发语言·langchain
程序员小羊!4 小时前
01HTML预备知识
前端·html
颜酱4 小时前
LangChain 调大模型:模板拼接 + invoke / stream / batch
python·langchain