HTML Document Loaders in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/html

HTML Document Loaders in LangChain

This content is based on LangChain's official documentation (langchain.com.cn) and explains two HTML loaders ---tools to extract text and metadata from HTML files into LangChain Document objects---in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: HTML (HyperText Markup Language) is the standard language for web documents. LangChain's HTML loaders strip away HTML tags to extract usable text, with optional metadata (e.g., page title).

1. What Are HTML Loaders?

HTML loaders convert raw HTML files into structured Document objects for LangChain workflows.

  • Core function: Extract text content from HTML (removing tags like <h1>, <p>) and attach metadata (e.g., file source).
  • Two supported loaders:
    • UnstructuredHTMLLoader: Simple loader for basic text extraction.
    • BSHTMLLoader: Uses the BeautifulSoup4 library to extract text + page title (stored in metadata).

2. Prerequisites

  • For BSHTMLLoader, install the BeautifulSoup4 library first (required for HTML parsing):

    bash 复制代码
    pip install beautifulsoup4

3. Loader 1: UnstructuredHTMLLoader (Basic Text Extraction)

This loader extracts plain text from HTML, ignoring complex metadata (e.g., page title).

Step 3.1: Import the Loader

python 复制代码
from langchain.document_loaders import UnstructuredHTMLLoader

Step 3.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = UnstructuredHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 3.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

4. Loader 2: BSHTMLLoader (Text + Title Extraction)

This loader uses BeautifulSoup4 to extract both text content and the HTML page's title (stored in the title field of metadata).

Step 4.1: Import the Loader

python 复制代码
from langchain.document_loaders import BSHTMLLoader

Step 4.2: Initialize and Load the HTML File

python 复制代码
# Initialize loader with the path to your HTML file
loader = BSHTMLLoader("example_data/fake-content.html")

# Load the HTML into a Document object
data = loader.load()

Step 4.3: View the Result

python 复制代码
data

Output (Exact as Original):

python 复制代码
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

Key Takeaways

  • UnstructuredHTMLLoader: Extracts basic text from HTML (no title metadata).
  • BSHTMLLoader: Requires BeautifulSoup4, extracts text + page title (stored in metadata["title"]).
  • Both loaders return Document objects with page_content (extracted text) and metadata["source"] (file path).
相关推荐
苏打水com9 小时前
第十五篇:Day43-45 前端性能优化进阶——从“可用”到“极致”(对标职场“高并发场景优化”需求)
前端·css·vue·html·js
苏打水com9 小时前
第十六篇:Day46-48 前端安全进阶——从“漏洞防范”到“安全体系”(对标职场“攻防实战”需求)
前端·javascript·css·vue.js·html
JANG102410 小时前
【Linux】进程
linux·网络·chrome
油丶酸萝卜别吃11 小时前
修改chrome配置,关闭跨域校验
前端·chrome
大模型教程11 小时前
使用Langchain4j和Ollama3搭建RAG系统
langchain·llm·ollama
私人珍藏库12 小时前
[吾爱大神原创工具] FlowMouse - 心流鼠标手势 v1.0【Chrome浏览器插件】
前端·chrome·计算机外设
Elwin Wong12 小时前
本地运行LangChain Agent用于开发调试
人工智能·langchain·大模型·llm·agent·codingagent
Можно13 小时前
ES6 Map 全面解析:从基础到实战的进阶指南
前端·javascript·html
BD_Marathon13 小时前
【JavaWeb】乱码问题_HTML_Tomcat日志_sout乱码问题
java·tomcat·html
小肖爱笑不爱笑13 小时前
2025/12/16 HTML CSS
java·开发语言·css·html·web