HTML Document Loaders in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html

HTML Document Loaders in LangChain

This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).

1. What Are HTML Loaders?

HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.

  • Core function: Extract text content from HTML (removing tags like <h1>, <p>) and attach metadata (e.g., file source).
  • Two supported loaders:
    • UnstructuredHTMLLoader: Simple loader for basic text extraction.
    • BSHTMLLoader: Uses the BeautifulSoup4 library to extract text + page title (stored in metadata).

2. Prerequisites

  • For BSHTMLLoader, install the BeautifulSoup4 library first (required for HTML parsing):

    bash 复制代码
    pip install beautifulsoup4

3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)

This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).

Step 3.1: Import the Loader

python 复制代码
from langchain.document_loaders import UnstructuredHTMLLoader

Step 3.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 3.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

4. Loader 2: BSHTMLLoader (Text + Title Extraction)

This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).

Step 4.1: Import the Loader

python 复制代码
from langchain.document_loaders import BSHTMLLoader

Step 4.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 4.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

Key Takeaways

  • UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).
  • BSHTMLLoader: Requires BeautifulSoup4, extracts text + page title (stored in metadata["title"]).
  • Both loaders return Document objects with page_content (extracted text) and metadata["source"] (file path).
相关推荐
大模型真好玩25 分钟前
LangGraph智能体开发设计模式(四)——LangGraph多智能体设计模式:网络架构
人工智能·langchain·agent
菜择贰1 小时前
在linux(wayland)中禁用键盘
linux·运维·chrome
科雷软件测试1 小时前
推荐几个常用的校验yaml、json、xml、md等多种文件格式的在线网站
xml·html·md·yaml
测试游记2 小时前
基于 FastGPT 的 LangChain.js + RAG 系统实现
开发语言·前端·javascript·langchain·ecmascript
Serendipity-Solitude2 小时前
HTML 五子棋实现方法
前端·html
花果山总钻风2 小时前
在 Debian 10.x 安装Chrome浏览器和ChromeDriver
运维·chrome·debian
TOPGUS3 小时前
谷歌Chrome浏览器即将对HTTP网站设卡:突出展示“始终使用安全连接”功能
前端·网络·chrome·http·搜索引擎·seo·数字营销
weixin_462446233 小时前
【原创实践】LangChain + Qwen 智能体项目完整解析:构建RPA自动化操作代理
langchain·自动化·rpa
研☆香4 小时前
html css js文件开发规范
javascript·css·html
王五周八4 小时前
html转化为base64编码的pdf文件
前端·pdf·html