正则表达式在爬虫中的应用：匹配 HTML 和 JSON 的技巧

在爬虫开发中，正则表达式是一种强大的工具，可以帮助我们从复杂的文本中提取所需信息。无论是处理 HTML 页面还是 JSON 数据，正则表达式都能发挥重要作用。本文将深入探讨正则表达式在爬虫中的应用，包括如何匹配 HTML 和 JSON 数据，以及贪婪和非贪婪模式的区别。

一、正则表达式基础

正则表达式（Regular Expression）是一种文本模式描述语言，用于匹配和操作字符串。以下是一些常用的正则表达式语法：

匹配任意字符 ：.（除了换行符）
匹配数字 ：\d
匹配非数字 ：\D
匹配字母 ：[a-zA-Z]
匹配多个字符 ：*（0 次或多次）、+（1 次或多次）、?（0 次或 1 次）
匹配特定字符 ：[]（如 [abc] 匹配 a、b 或 c）
匹配范围 ：-（如 [a-z] 匹配小写字母）
匹配开始和结束 ：^（字符串开始）、$（字符串结束）

二、正则表达式匹配 HTML

HTML 页面结构复杂，包含大量的标签和属性。正则表达式可以帮助我们在 HTML 中提取特定的内容。

2.1 提取 HTML 标签内容

假设我们有一个简单的 HTML 页面：

html 复制代码

<div class="content">
  <h1 class="title">欢迎来到爬虫世界</h1>
  <p class="description">这是一个关于爬虫的教程。</p>
  <a href="https://example.com">访问示例网站</a>
</div>

2.1.1 提取标题

javascript 复制代码

const html = `
  <div class="content">
    <h1 class="title">欢迎来到爬虫世界</h1>
    <p class="description">这是一个关于爬虫的教程。</p>
    <a href="https://example.com">访问示例网站</a>
  </div>
`;

const titleRegex = /<h1 class="title">(.*?)<\/h1>/;
const match = html.match(titleRegex);
if (match) {
  console.log('提取的标题：', match[1]); // 输出：提取的标题： 欢迎来到爬虫世界
}

2.1.2 提取链接

javascript 复制代码

const linkRegex = /<a href="(.*?)">(.*?)<\/a>/;
const linkMatch = html.match(linkRegex);
if (linkMatch) {
  console.log('提取的链接：', linkMatch[1]); // 输出：提取的链接： https://example.com
  console.log('提取的链接文本：', linkMatch[2]); // 输出：提取的链接文本： 访问示例网站
}

2.2 提取多个 HTML 元素

假设我们有一个新闻列表：

html 复制代码

<div class="news-list">
  <div class="news-item">
    <h3 class="news-title"><a href="/news/1">新闻标题 1</a></h3>
    <p class="news-summary">新闻摘要 1</p>
    <span class="news-time">2024-10-01</span>
  </div>
  <div class="news-item">
    <h3 class="news-title"><a href="/news/2">新闻标题 2</a></h3>
    <p class="news-summary">新闻摘要 2</p>
    <span class="news-time">2024-10-02</span>
  </div>
</div>

2.2.1 提取所有新闻标题和链接

javascript 复制代码

const newsHtml = `
  <div class="news-list">
    <div class="news-item">
      <h3 class="news-title"><a href="/news/1">新闻标题 1</a></h3>
      <p class="news-summary">新闻摘要 1</p>
      <span class="news-time">2024-10-01</span>
    </div>
    <div class="news-item">
      <h3 class="news-title"><a href="/news/2">新闻标题 2</a></h3>
      <p class="news-summary">新闻摘要 2</p>
      <span class="news-time">2024-10-02</span>
    </div>
  </div>
`;

const newsRegex = /<h3 class="news-title"><a href="(.*?)">(.*?)<\/a><\/h3>/g;
const newsList = [];
let newsMatch;

while ((newsMatch = newsRegex.exec(newsHtml)) !== null) {
  newsList.push({
    link: newsMatch[1],
    title: newsMatch[2]
  });
}

console.log('提取的新闻列表：', newsList);

输出结果：

bash 复制代码

提取的新闻列表： [
  { link: '/news/1', title: '新闻标题 1' },
  { link: '/news/2', title: '新闻标题 2' }
]

2.3 提取 HTML 中的图片链接

javascript 复制代码

const html = `
  <div class="images">
    <img src="image1.jpg" alt="图片 1">
    <img src="image2.png" alt="图片 2">
  </div>
`;

const imgRegex = /<img src="(.*?)" alt=".*?">/g;
const images = [];
let imgMatch;

while ((imgMatch = imgRegex.exec(html)) !== null) {
  images.push(imgMatch[1]);
}

console.log('提取的图片链接：', images); // 输出：提取的图片链接： [ 'image1.jpg', 'image2.png' ]

三、正则表达式匹配 JSON

JSON 数据结构相对简单，但有时我们可能需要从 JSON 字符串中提取特定的字段，尤其是在处理嵌套结构时。

3.1 提取 JSON 中的特定字段

假设我们有一个 JSON 字符串：

javascript 复制代码

{
  "user": {
    "name": "张三",
    "profile": {
      "age": 25,
      "email": "zhangsan@mail.com"
    }
  },
  "posts": [
    {
      "id": 1,
      "title": "我的第一篇博客"
    },
    {
      "id": 2,
      "title": "爬虫入门教程"
    }
  ]
}

3.1.1 提取用户邮箱

javascript 复制代码

const jsonStr = `
  {
    "user": {
      "name": "张三",
      "profile": {
        "age": 25,
        "email": "zhangsan@mail.com"
      }
    },
    "posts": [
      {
        "id": 1,
        "title": "我的第一篇博客"
      },
      {
        "id": 2,
        "title": "爬虫入门教程"
      }
    ]
  }
`;

const emailRegex = /"email"\s*:\s*"([^"]+)"/;
const emailMatch = jsonStr.match(emailRegex);
if (emailMatch) {
  console.log('提取的邮箱：', emailMatch[1]); // 输出：提取的邮箱： zhangsan@mail.com
}

3.1.2 提取所有博客标题

javascript 复制代码

const titleRegex = /"title"\s*:\s*"([^"]+)"/g;
const titles = [];
let titleMatch;

while ((titleMatch = titleRegex.exec(jsonStr)) !== null) {
  titles.push(titleMatch[1]);
}

console.log('提取的博客标题：', titles); // 输出：提取的博客标题： [ '我的第一篇博客', '爬虫入门教程' ]

四、贪婪与非贪婪模式

正则表达式中的贪婪和非贪婪模式决定了匹配的范围。贪婪模式会尽可能多地匹配字符，而非贪婪模式则尽可能少地匹配字符。

4.1 贪婪模式

javascript 复制代码

const html = '<div class="content"><p>段落 1</p><p>段落 2</p></div>';
const greedyRegex = /<div class="content">(.*)<\/div>/;
const match = html.match(greedyRegex);
if (match) {
  console.log('贪婪模式匹配结果：', match[1]); 
  // 输出：贪婪模式匹配结果： <p>段落 1</p><p>段落 2</p>
}

4.2 非贪婪模式

javascript 复制代码

const html = '<div class="content"><p>段落 1</p><p>段落 2</p></div>';
const nonGreedyRegex = /<div class="content">(.*?)<\/div>/;
const match = html.match(nonGreedyRegex);
if (match) {
  console.log('非贪婪模式匹配结果：', match[1]); 
  // 输出：非贪婪模式匹配结果： <p>段落 1</p><p>段落 2</p>
}

在实际应用中，非贪婪模式通常更适用于提取特定内容，因为它会尽可能少地匹配字符，避免过度匹配。

五、总结

正则表达式是爬虫开发中不可或缺的工具，它可以帮助我们从复杂的 HTML 和 JSON 数据中提取所需信息。通过合理使用贪婪和非贪婪模式，我们可以更精确地控制匹配范围，提高数据提取的准确性。