LangChain 自动化工具集成指南:面向爬虫开发者
引言
网页抓取和自动化是获取数据的重要手段。LangChain 提供多种网页抓取工具集成,让你可以轻松获取网页内容并与大模型结合。
一、主流网页抓取工具对比
| 工具/平台 | 类型 | 是否免费 | 免费额度 | 是否需要 API Key | 官方地址 |
|---|---|---|---|---|---|
| Beautiful Soup | HTML 解析 | 完全免费 | 无限制 | 否 | https://www.crummy.com |
| Playwright | 浏览器自动化 | 完全免费 | 无限制 | 否 | https://playwright.dev |
| Requests | HTTP 请求 | 完全免费 | 无限制 | 否 | https://requests.readthedocs.io |
| Apify | 云爬虫平台 | 有免费层 | 每月 $5 额度 | 是 | https://apify.com |
| AgentQL | AI 增强抓取 | 有免费层 | 免费试用 | 是 | https://agentql.com |
| FireCrawl | AI 网页抓取 | 有免费层 | 免费试用 | 是 | https://firecrawl.dev |
| Scrapy | 爬虫框架 | 完全免费 | 无限制 | 否 | https://scrapy.org |
二、基础网页抓取(全部免费,无需 API Key)
2.1 Requests + Beautiful Soup
最经典的静态网页抓取方案。
python
import requests
from bs4 import BeautifulSoup
from langchain_core.documents import Document
# 请求网页
url = "https://example.com"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
# 解析 HTML
soup = BeautifulSoup(response.text, "html.parser")
# 提取内容
title = soup.find("h1").text if soup.find("h1") else "无标题"
paragraphs = [p.text for p in soup.find_all("p")]
# 创建 LangChain Document
doc = Document(
page_content=f"标题: {title}\n\n内容: {' '.join(paragraphs)}",
metadata={"source": url, "title": title}
)
print(f"抓取完成:{doc.page_content[:300]}...")
2.2 LangChain 封装版 BSHTMLLoader
python
from langchain_community.document_loaders import BSHTMLLoader
import requests
# 先获取网页内容
url = "https://example.com"
response = requests.get(url)
with open("page.html", "w", encoding="utf-8") as f:
f.write(response.text)
# 使用 LangChain 加载器
loader = BSHTMLLoader("page.html", open_encoding="utf-8")
documents = loader.load()
print(f"加载了 {len(documents)} 个文档")
print(f"内容预览:{documents[0].page_content[:200]}...")
2.3 Playwright(动态网页,支持 JavaScript)
用于抓取需要 JavaScript 渲染的动态网页。
安装:
pip install playwright
playwright install chromium
代码示例:
python
from langchain_community.document_loaders import PlaywrightURLLoader
# 抓取动态网页
loader = PlaywrightURLLoader(
urls=[
"https://example.com/article-1",
"https://example.com/article-2"
],
headless=True, # 无头模式(不显示浏览器窗口)
remove_selectors=["script", "style", "nav", "footer"] # 移除不需要的元素
)
documents = loader.load()
for i, doc in enumerate(documents):
print(f"\n=== 文档 {i+1}: {doc.metadata.get('source', 'Unknown')} ===")
print(f"内容:{doc.page_content[:200]}...")
使用 Playwright 基础 API(更灵活):
python
from playwright.sync_api import sync_playwright
from langchain_core.documents import Document
with sync_playwright() as p:
# 启动浏览器
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# 访问网页
page.goto("https://example.com")
# 等待动态内容加载
page.wait_for_timeout(2000)
# 截图(可选)
# page.screenshot(path="screenshot.png")
# 提取内容
title = page.title()
content = page.inner_text("body")
# 点击元素(需要交互的场景)
# page.click("button.load-more")
browser.close()
# 创建 LangChain Document
doc = Document(
page_content=content,
metadata={"source": "https://example.com", "title": title}
)
print(f"动态页面抓取完成:{doc.page_content[:200]}...")
2.4 Scrapy(专业爬虫框架)
安装:
pip install scrapy
代码示例:
python
import scrapy
from scrapy.crawler import CrawlerProcess
from langchain_core.documents import Document
class NewsSpider(scrapy.Spider):
name = "news"
start_urls = [
"https://example.com/news/1",
"https://example.com/news/2"
]
# 存储抓取的文档
documents = []
def parse(self, response):
title = response.css("h1::text").get()
content = " ".join(response.css("p::text").getall())
doc = Document(
page_content=content,
metadata={"source": response.url, "title": title}
)
self.documents.append(doc)
yield {"title": title, "content": content[:100]}
# 运行爬虫
process = CrawlerProcess(settings={
"USER_AGENT": "Mozilla/5.0",
"LOG_LEVEL": "INFO",
"DOWNLOAD_DELAY": 1 # 延迟 1 秒,避免被封
})
process.crawl(NewsSpider)
process.start()
print(f"\n抓取完成,共 {len(NewsSpider.documents)} 篇文档")
for doc in NewsSpider.documents[:2]:
print(f"标题: {doc.metadata.get('title')}")
print(f"内容: {doc.page_content[:150]}...\n")
三、云服务抓取工具(需要 API Key)
3.1 Apify
获取 API Key:
- 访问 https://apify.com 注册
- 在 Settings → Personal API Tokens 创建 Token
配置环境变量:
setx APIFY_API_TOKEN "xxxxxxxxxxxxxxxxxxxxxxxx"
代码示例:
python
import os
from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper(apify_api_token=os.environ.get("APIFY_API_TOKEN"))
# 调用预构建的爬虫 Actor
results = apify.call_actor(
actor_id="apify/website-content-crawler",
run_input={
"startUrls": [{"url": "https://example.com"}],
"maxCrawledPagesPerStartUrl": 3
},
dataset_mapping_function=lambda item: Document(
page_content=item.get("text", ""),
metadata={"source": item.get("url", "")}
)
)
print(f"抓取了 {len(results)} 个页面")
for doc in results[:2]:
print(f"\n来源: {doc.metadata.get('source')}")
print(f"内容: {doc.page_content[:200]}...")
3.2 FireCrawl(AI 增强抓取)
获取 API Key:
- 访问 https://firecrawl.dev 注册
- 获取 API Key
配置环境变量:
setx FIRECRAWL_API_KEY "xxxxxxxxxxxxxxxxxxxxxxxx"
代码示例:
python
import os
from langchain_community.document_loaders import FireCrawlLoader
# 抓取单个网页
loader = FireCrawlLoader(
api_key=os.environ.get("FIRECRAWL_API_KEY"),
url="https://example.com",
mode="scrape" # 模式:scrape(抓取单页)/ crawl(爬取全站)
)
documents = loader.load()
print(f"抓取了 {len(documents)} 个文档")
for doc in documents:
print(f"\n来源: {doc.metadata.get('sourceURL')}")
print(f"内容: {doc.page_content[:200]}...")
# 全站爬取模式
crawl_loader = FireCrawlLoader(
api_key=os.environ.get("FIRECRAWL_API_KEY"),
url="https://example.com",
mode="crawl",
params={"limit": 10} # 限制最多 10 页
)
四、实战:智能网页信息提取系统
python
import requests
from bs4 import BeautifulSoup
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
# === 1. 定义抓取工具(完全免费,无需 API Key) ===
@tool
def fetch_webpage(url: str) -> str:
"""抓取指定 URL 的网页内容,用于获取最新的网页信息"""
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers, timeout=10)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, "html.parser")
# 移除脚本和样式
for script in soup(["script", "style"]):
script.decompose()
title = soup.find("title").text if soup.find("title") else "无标题"
text = soup.get_text(separator="\n", strip=True)
return f"来源: {url}\n标题: {title}\n\n内容:\n{text[:3000]}"
except Exception as e:
return f"抓取失败: {str(e)}"
# === 2. 初始化大模型(本地 Ollama,免费) ===
llm = ChatOpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama",
model="qwen3.5:4b",
temperature=0.3
)
# === 3. 创建智能体 ===
tools = [fetch_webpage]
prompt = ChatPromptTemplate.from_messages([
("system", """你是一个网页信息提取助手。
当用户询问某个网页的内容时,使用 fetch_webpage 工具抓取网页内容,
然后从抓取的内容中提取用户需要的信息,以清晰的格式回答。"""),
("user", "{input}"),
("placeholder", "{agent_scratchpad}")
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# === 4. 测试运行 ===
if __name__ == "__main__":
test_queries = [
"请抓取 https://example.com 并总结页面的主要内容",
]
for query in test_queries:
print(f"\n{'='*60}")
print(f"查询: {query}")
print("="*60 + "\n")
result = executor.invoke({"input": query})
print(f"\n回答: {result['output']}")
五、实战:批量新闻抓取与摘要系统
python
import requests
from bs4 import BeautifulSoup
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
class NewsCrawler:
def __init__(self):
self.llm = ChatOpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama",
model="qwen3.5:4b",
temperature=0.3
)
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
# 摘要模板
self.summary_prompt = PromptTemplate.from_template("""
请从以下文章内容中提取一个 100 字以内的中文摘要:
文章内容:{content}
摘要:""")
# 构建处理链
self.summary_chain = self.summary_prompt | self.llm | StrOutputParser()
def fetch_article(self, url: str) -> Document:
"""抓取单篇文章"""
print(f"正在抓取: {url}")
response = requests.get(url, headers=self.headers, timeout=15)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").text.strip() if soup.find("h1") else "无标题"
# 提取文章正文(需要根据实际网站调整)
article_body = soup.find("article") or soup.find(class_="content") or soup
paragraphs = article_body.find_all("p")
content = "\n".join([p.text.strip() for p in paragraphs if p.text.strip()])
return Document(
page_content=content,
metadata={"source": url, "title": title}
)
def summarize(self, doc: Document) -> str:
"""使用大模型生成摘要"""
# 限制内容长度,避免超出模型上下文
content = doc.page_content[:3000]
return self.summary_chain.invoke({"content": content})
def process_urls(self, urls: list):
"""批量处理 URL"""
results = []
for url in urls:
try:
doc = self.fetch_article(url)
summary = self.summarize(doc)
results.append({
"title": doc.metadata["title"],
"source": url,
"summary": summary
})
print(f"✓ 处理完成: {doc.metadata['title']}")
except Exception as e:
print(f"✗ 处理失败 {url}: {e}")
return results
# 使用示例
if __name__ == "__main__":
crawler = NewsCrawler()
# 需要抓取的文章 URL(示例)
urls = [
"https://example.com/news/ai-breakthrough",
"https://example.com/news/llm-benchmark",
]
# 批量处理
print("开始抓取和摘要...")
results = crawler.process_urls(urls)
# 输出结果
print("\n" + "="*60)
print("处理结果:")
print("="*60)
for i, result in enumerate(results, 1):
print(f"\n{i}. {result['title']}")
print(f" 来源: {result['source']}")
print(f" 摘要: {result['summary']}")
六、快速选择指南
| 场景 | 推荐工具 | 是否免费 | 优点 | 缺点 |
|---|---|---|---|---|
| 静态网页 | Requests + BS4 | 免费 | 简单快速 | 不支持 JS |
| 动态网页 | Playwright | 免费 | 支持 JS 渲染 | 较重 |
| 大规模爬虫 | Scrapy | 免费 | 高性能、可扩展 | 学习成本 |
| 云爬虫服务 | Apify | 有免费层 | 无需运维 | 需要付费 |
| AI 增强抓取 | FireCrawl | 有免费层 | 智能提取 | 成本较高 |
| 简单抓取 | FireCrawl / Playwright | 免费/有免费层 | 开箱即用 | 功能有限 |
七、常见问题
Q: 为什么不直接用 Requests,还要用 LangChain?
A: LangChain 的价值在于把抓取的内容无缝送入大模型处理,你可以把抓取和 AI 处理集成在一起。
Q: 抓取网站会不会被封禁?
A: 是的。请遵守 robots.txt,设置合理的请求间隔(如 1 秒以上),使用合适的 User-Agent。
Q: 如何处理登录后的内容?
A: Playwright 支持设置 cookies 和模拟登录,Apify 也支持认证配置。
Q: 免费工具能满足生产需求吗?
A: 中小规模的抓取完全可以。大规模、高频率的抓取建议使用云服务。
LangChain 的自动化工具让网页抓取变得简单:
- 免费方案丰富:Requests、BS4、Playwright、Scrapy 都完全免费
- 灵活集成:抓取内容无缝对接大模型处理
- 专业支持:Apify、FireCrawl 提供企业级抓取服务
- 无需 API Key:核心工具完全免费,无需注册
推荐入门路径:先用 Requests + Beautiful Soup 处理静态页面,遇到动态页面再引入 Playwright,大规模需求再考虑云服务。