Python 爬虫入门 Day 2 - HTML解析入门（使用 BeautifulSoup）

Python 第二阶段 - 爬虫入门

🎯 今日目标

学会使用 BeautifulSoup 解析 HTML 网页内容
掌握常用的标签选择器与属性提取方法
为爬虫后续提取结构化数据打好基础

📘 学习内容详解

📦 安装 BeautifulSoup 相关库
python 复制代码
```
pip install beautifulsoup4
pip install lxml
```
lxml 是解析器，速度快；也可使用 html.parser（Python 自带）。

🍜 基本用法

python 复制代码

from bs4 import BeautifulSoup
import requests

url = "https://example.com"
html = requests.get(url).text

soup = BeautifulSoup(html, 'lxml')  # 也可换成 'html.parser'

🔍 常用标签查找方法

操作	示例	说明
获取标题内容	soup.title.text	获取标签文本
获取第一个标签	soup.p	第一个 `<p>` 标签
获取所有某类标签	soup.find_all("p")	返回列表
查找指定 class 的标签	soup.find_all(class_="info")	注意 class 是关键字
根据 id 查找标签	soup.find(id="main")
选择 CSS 选择器	soup.select("div > p.info")	返回列表

🧲 获取属性和文本

python 复制代码

tag = soup.find("a")
print(tag['href'])      # 获取属性
print(tag.get('href'))  # 同上
print(tag.text)         # 获取文本内容

💡 今日练习任务

1.用 requests 抓取网页 https://quotes.toscrape.com/

用 BeautifulSoup 完成以下任务：

提取所有名人名言内容（位于）
提取作者名字（）
提取每条名言的标签（class="tag"）

把每条名言信息输出为字典：

python 复制代码

{
  "quote": "...",
  "author": "...",
  "tags": ["tag1", "tag2"]
}

练习脚本：

python 复制代码

# quotes_scraper.py

import requests
from bs4 import BeautifulSoup

def fetch_quotes():
    url = "https://quotes.toscrape.com/"
    headers = {
        'User-Agent': 'Mozilla/5.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"请求失败，状态码：{response.status_code}")
        return

    soup = BeautifulSoup(response.text, "lxml")

    # 查找所有名言块
    quote_blocks = soup.find_all("div", class_="quote")

    quotes_data = []

    for block in quote_blocks:
        quote_text = block.find("span", class_="text").text.strip()
        author = block.find("small", class_="author").text.strip()
        tags = [tag.text.strip() for tag in block.find_all("a", class_="tag")]

        quotes_data.append({
            "quote": quote_text,
            "author": author,
            "tags": tags
        })

    return quotes_data

if __name__ == "__main__":
    data = fetch_quotes()
    for i, quote in enumerate(data, start=1):
        print(f"\n第 {i} 条名言：")
        print(f"内容：{quote['quote']}")
        print(f"作者：{quote['author']}")
        print(f"标签：{', '.join(quote['tags'])}")

输出结果为：

python 复制代码

第 1 条名言：
内容："The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
作者：Albert Einstein
标签：change, deep-thoughts, thinking, world

第 2 条名言：
内容："It is our choices, Harry, that show what we truly are, far more than our abilities."
作者：J.K. Rowling
标签：abilities, choices

第 3 条名言：
内容："There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
作者：Albert Einstein
标签：inspirational, life, live, miracle, miracles

第 4 条名言：
内容："The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
作者：Jane Austen
标签：aliteracy, books, classic, humor

第 5 条名言：
内容："Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
作者：Marilyn Monroe
标签：be-yourself, inspirational

第 6 条名言：
内容："Try not to become a man of success. Rather become a man of value."
作者：Albert Einstein
标签：adulthood, success, value

第 7 条名言：
内容："It is better to be hated for what you are than to be loved for what you are not."
作者：André Gide
标签：life, love

第 8 条名言：
内容："I have not failed. I've just found 10,000 ways that won't work."
作者：Thomas A. Edison
标签：edison, failure, inspirational, paraphrased

第 9 条名言：
内容："A woman is like a tea bag; you never know how strong it is until it's in hot water."
作者：Eleanor Roosevelt
标签：misattributed-eleanor-roosevelt

第 10 条名言：
内容："A day without sunshine is like, you know, night."
作者：Steve Martin
标签：humor, obvious, simile

📎 小贴士

遇到复杂页面，建议先用浏览器 F12 检查结构。
尝试结合 select() 使用 CSS 选择器：soup.select('.quote span.text')

📝 今日总结

初步掌握了 BeautifulSoup 的解析方式和用法
能提取文本、属性、class、id 等元素内容
可初步实现"结构化数据提取"，为爬虫数据存储铺路