用 Crawl4AI 从网页里抠数据：CSS、XPath、LLM 三条路线实测

上周帮朋友搭一个竞品价格监控系统，需求很明确：每天爬 20 个电商页面，把商品名和价格抽成 JSON 存数据库。用 BeautifulSoup 写了一版，200 行代码，跑了两天就炸了------页面结构一改，选择器全废。

后来换了 Crawl4AI。这东西是 GitHub 上 33k star 的 LLM 友好爬虫框架，v0.8.x 版本把数据抽取拆成了三条路线：CSS Schema、XPath Schema、LLM 抽取。三种我都跑了一遍，踩了不少坑，记录一下。

环境准备

bash 复制代码

pip install crawl4ai
crawl4ai-setup   # 自动装 Chromium，大概 200MB

装完确认一下：

python 复制代码

import crawl4ai
print(crawl4ai.__version__)  # 0.8.x

如果你在 Docker 里跑，官方有现成镜像：docker pull unclecode/crawl4ai:latest。省得折腾 Chromium 依赖。

第一条路：CSS Schema 抽取（零 LLM 成本）

这是我最推荐的方式。页面结构规整的场景（商品列表、文章列表、表格数据），用这个够了。

核心思路：定义一个 JSON schema，告诉 Crawl4AI "每个商品在哪个 CSS 选择器下，名字在哪，价格在哪"。

python 复制代码

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def extract_products():
    schema = {
        "name": "Products",
        "baseSelector": "div.product-card",
        "fields": [
            {
                "name": "title",
                "selector": "h3.product-title",
                "type": "text"
            },
            {
                "name": "price",
                "selector": "span.price",
                "type": "text"
            },
            {
                "name": "link",
                "selector": "a.product-link",
                "type": "attribute",
                "attribute": "href"
            },
            {
                "name": "image",
                "selector": "img.product-img",
                "type": "attribute",
                "attribute": "src"
            }
        ]
    }

    strategy = JsonCssExtractionStrategy(schema, verbose=True)
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=strategy,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/products",
            config=config
        )
        if result.success:
            data = json.loads(result.extracted_content)
            print(f"抽到 {len(data)} 个商品")
            print(json.dumps(data[0], indent=2, ensure_ascii=False))

asyncio.run(extract_products())

几个关键点：

baseSelector 决定你拿到多少条数据。 它指向页面上重复出现的容器元素。写错了就是 0 条结果，没有报错提示。我第一次把 div.product-card 写成 div.product_card，跑完返回空数组，查了半小时才发现是下划线和连字符搞混了。

type 字段有 5 种： text（取文本内容）、attribute（取 HTML 属性）、html（取整段 HTML）、regex（正则匹配）、nested（嵌套结构）。实际用得最多的是 text 和 attribute。

性能： 抽取 200 个商品条目大概 50ms，不走 LLM，不花钱。

嵌套结构怎么办

电商页面经常有嵌套：一个商品下面挂多条评论，每条评论有用户名、评分、内容。用 nested 类型处理：

python 复制代码

schema = {
    "name": "ProductsWithReviews",
    "baseSelector": "div.product-card",
    "fields": [
        {"name": "title", "selector": "h3.title", "type": "text"},
        {"name": "price", "selector": "span.price", "type": "text"},
        {
            "name": "reviews",
            "selector": "div.review-item",
            "type": "nested",
            "fields": [
                {"name": "user", "selector": "span.reviewer", "type": "text"},
                {"name": "rating", "selector": "span.stars", "type": "text"},
                {"name": "content", "selector": "p.review-text", "type": "text"}
            ]
        }
    ]
}

这样每个商品对象里就会带一个 reviews 数组。比写循环套循环的 BeautifulSoup 代码干净多了。

第二条路：XPath Schema 抽取

有些页面的 HTML 结构不适合用 CSS 选择器------比如没有 class 名的表格、深层嵌套的 XML 风格标签。这时候 XPath 更合适。

python 复制代码

from crawl4ai import JsonXPathExtractionStrategy

schema = {
    "name": "TableData",
    "baseSelector": "//table[@id='data-table']/tbody/tr",
    "fields": [
        {
            "name": "company",
            "selector": ".//td[1]",
            "type": "text"
        },
        {
            "name": "revenue",
            "selector": ".//td[2]",
            "type": "text"
        },
        {
            "name": "growth",
            "selector": ".//td[3]",
            "type": "text"
        }
    ]
}

strategy = JsonXPathExtractionStrategy(schema, verbose=True)

XPath 的优势在于定位精度。//td[1] 直接按位置取第一列，不依赖 class 名。财报数据、政府公开数据这类表格密集型页面，XPath 比 CSS 好用。

踩坑记录：XPath 里的 baseSelector 前面别漏 //。我写成 table[@id='data']/tbody/tr（缺了 //），直接报 lxml 解析错误，错误信息还不太明确。

第三条路：LLM 抽取（非结构化内容的杀手锏）

CSS 和 XPath 都有个前提：页面结构是规整的。碰到新闻正文、论坛帖子、产品描述这种非结构化内容，选择器没法用。这时候要上 LLM。

python 复制代码

import os
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy

class NewsItem(BaseModel):
    headline: str = Field(description="新闻标题")
    summary: str = Field(description="摘要，100字以内")
    sentiment: str = Field(description="情感倾向：positive/negative/neutral")
    key_entities: list[str] = Field(description="涉及的关键实体")

async def extract_news():
    llm_strategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o-mini",
            api_token=os.getenv("OPENAI_API_KEY")
        ),
        schema=NewsItem.model_json_schema(),
        extraction_type="schema",
        instruction="从页面内容中提取所有新闻条目。每条新闻需要标题、100字摘要、情感倾向和关键实体。",
        chunk_token_threshold=2000,
        overlap_rate=0.1,
        apply_chunking=True,
        input_format="fit_markdown",
        extra_args={"temperature": 0.0, "max_tokens": 2000}
    )

    config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com",
            config=config
        )
        if result.success:
            items = json.loads(result.extracted_content)
            print(f"提取了 {len(items)} 条新闻")
            llm_strategy.show_usage()  # 打印 token 消耗

asyncio.run(extract_news())

几个参数要注意：

input_format 选 fit_markdown。 默认是 raw_markdown，但经过 PruningContentFilter 过滤后的 fit_markdown 更干净------去掉了导航栏、页脚、广告这些噪音。token 消耗能降 40% 左右。

chunk_token_threshold 别设太大。 我试过设成 8000，GPT-4o-mini 在长文本上漏抽率明显上升。2000-3000 是个比较稳的范围。overlap_rate 设 0.1，保证分块边界处的内容不会被截断。

temperature 设 0.0。 抽取任务要的是确定性，不是创造力。

成本对比

我跑了同一个页面（Hacker News 首页，30 条帖子）三种方式的对比：

方式	耗时	成本	准确率
CSS Schema	1.2s	$0	100%（结构匹配时）
XPath Schema	1.3s	$0	100%（结构匹配时）
LLM (gpt-4o-mini)	4.8s	~$0.003	93%（偶尔漏抽/多抽）

结论很明确：能用 CSS/XPath 的场景别上 LLM。LLM 留给真正非结构化的内容。

一个高级技巧：让 LLM 帮你生成 Schema

v0.8 新加了一个功能：用 LLM 分析一次页面 HTML，自动生成 CSS Schema。之后就用这个 Schema 反复抽取，不再需要 LLM。

python 复制代码

from crawl4ai import JsonCssExtractionStrategy, LLMConfig

# 抓一段页面 HTML
sample_html = """
<div class="job-listing">
  <h2 class="job-title">后端工程师</h2>
  <span class="company">字节跳动</span>
  <span class="salary">25-50K</span>
  <span class="location">北京</span>
</div>
<div class="job-listing">
  <h2 class="job-title">算法工程师</h2>
  <span class="company">阿里巴巴</span>
  <span class="salary">30-60K</span>
  <span class="location">杭州</span>
</div>
"""

# 一次性成本：LLM 分析 HTML 结构，生成 schema
schema = JsonCssExtractionStrategy.generate_schema(
    sample_html,
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",
        api_token=os.getenv("OPENAI_API_KEY")
    )
)

print(json.dumps(schema, indent=2, ensure_ascii=False))

# 保存 schema，之后直接用，不再调 LLM
with open("job_schema.json", "w") as f:
    json.dump(schema, f, ensure_ascii=False, indent=2)

# 后续抽取：零成本
strategy = JsonCssExtractionStrategy(schema)

这个思路就是把 LLM 的一次性理解能力和 CSS 抽取的零成本反复执行结合起来。批量爬 1000 个同结构页面时，只花一次 LLM 的钱。

处理动态页面

很多现代网页靠 JavaScript 渲染内容。Crawl4AI 内置了等待策略：

python 复制代码

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    extraction_strategy=strategy,
    wait_for="css:div.product-card:nth-child(20)",  # 等到第20个商品出现
    js_code="window.scrollTo(0, document.body.scrollHeight);",  # 模拟滚动触发加载
    delay_before_return_html=2.0,  # 额外等 2 秒
)

wait_for 支持两种写法：

css:选择器 --- 等到指定 CSS 选择器的元素出现
js:表达式 --- 等到 JavaScript 表达式返回 truthy

我遇到过一个坑：有个页面用了虚拟列表（只渲染可视区域的 DOM），scroll 到底也只有 10 条 DOM 节点。这种情况 CSS 抽取拿到的数据永远不完整。最后用 js_code 注入脚本把虚拟列表的全量数据从内存里读出来，绕过了渲染层。

实战：搭一个技术博客聚合器

把前面的知识串起来。目标：抓取 3 个技术博客的最新文章，统一输出 JSON。

python 复制代码

import asyncio
import json
from crawl4ai import (
    AsyncWebCrawler, BrowserConfig, CrawlerRunConfig,
    CacheMode, JsonCssExtractionStrategy
)

BLOG_CONFIGS = [
    {
        "url": "https://engineering.fb.com",
        "schema": {
            "name": "MetaEngBlog",
            "baseSelector": "article.post-card",
            "fields": [
                {"name": "title", "selector": "h2 a", "type": "text"},
                {"name": "link", "selector": "h2 a", "type": "attribute", "attribute": "href"},
                {"name": "date", "selector": "time", "type": "text"},
                {"name": "tags", "selector": "span.tag", "type": "text"}
            ]
        }
    },
    {
        "url": "https://netflixtechblog.com",
        "schema": {
            "name": "NetflixTechBlog",
            "baseSelector": "div.post-item",
            "fields": [
                {"name": "title", "selector": "h3 a", "type": "text"},
                {"name": "link", "selector": "h3 a", "type": "attribute", "attribute": "href"},
                {"name": "summary", "selector": "p.preview", "type": "text"}
            ]
        }
    }
]

async def aggregate_blogs():
    browser_cfg = BrowserConfig(headless=True)
    all_posts = []

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        for blog in BLOG_CONFIGS:
            strategy = JsonCssExtractionStrategy(blog["schema"])
            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                extraction_strategy=strategy,
            )
            result = await crawler.arun(url=blog["url"], config=config)
            if result.success:
                posts = json.loads(result.extracted_content)
                for p in posts:
                    p["source"] = blog["schema"]["name"]
                all_posts.extend(posts)
                print(f"{blog['schema']['name']}: {len(posts)} 篇")

    # 存文件
    with open("tech_blogs.json", "w") as f:
        json.dump(all_posts, f, ensure_ascii=False, indent=2)
    print(f"总计 {len(all_posts)} 篇文章")

asyncio.run(aggregate_blogs())

这个脚本可以直接丢 cron 定时跑。每天抓一次，数据存 JSON，再接一个通知脚本推到飞书群或者邮件，一个简易的技术情报系统就搭好了。

踩坑清单

Chromium 下载失败。 国内网络问题，crawl4ai-setup 可能卡住。解决：手动下载 Chromium 放到 ~/.cache/crawl4ai/ 下面，或者用 Docker 镜像。
抽取结果是空数组。 先用浏览器开发者工具确认选择器对不对。Crawl4AI 不会因为选择器不匹配报错，只会返回空。
动态页面抽不到数据。 加 wait_for 和 delay_before_return_html。实在不行，设 BrowserConfig(headless=False) 开有头模式看看页面到底渲染成什么样。
LLM 抽取结果不稳定。 同一个页面跑两次结果不一样------把 temperature 设 0.0，instruction 写得更精确。如果还不行，换 extraction_type="schema" 配 Pydantic 模型约束输出格式。
内存占用大。 AsyncWebCrawler 默认共用一个浏览器实例。批量爬几百个页面时，浏览器吃内存会越来越多。每 50 个页面重建一次 crawler 实例可以缓解。

什么时候该用 Crawl4AI，什么时候不该

适合的场景：

需要从网页抽取结构化数据给 LLM 或数据库用
页面有 JavaScript 动态渲染
需要 Markdown 转换（做 RAG 数据源）
同一结构的页面要批量抽取

不适合的场景：

简单的静态页面抓 HTML（requests + BeautifulSoup 更轻量）
需要登录态的复杂操作流程（Playwright 原生脚本更灵活）
高并发大规模爬取（Scrapy 的调度器更成熟）

Crawl4AI 的定位是"LLM 应用的数据层工具"。如果你在做 RAG、知识库、竞品监控、内容聚合这类项目，它能省不少事。抽取策略的三条路线------CSS 处理结构化、XPath 处理表格、LLM 处理非结构化------覆盖了大部分数据源场景。

项目地址：github.com/unclecode/c... 文档：docs.crawl4ai.com