Python 第二阶段 - 爬虫入门
🎯 今日目标
- 学会使用 BeautifulSoup 解析 HTML 网页内容
- 掌握常用的标签选择器与属性提取方法
- 为爬虫后续提取结构化数据打好基础
📘 学习内容详解
-
📦 安装 BeautifulSoup 相关库
pythonpip install beautifulsoup4 pip install lxml
lxml
是解析器,速度快;也可使用html.parser
(Python 自带)。 -
🍜 基本用法
pythonfrom bs4 import BeautifulSoup import requests url = "https://example.com" html = requests.get(url).text soup = BeautifulSoup(html, 'lxml') # 也可换成 'html.parser'
-
🔍 常用标签查找方法
操作 | 示例 | 说明 |
---|---|---|
获取标题内容 | soup.title.text | 获取 标签文本 |
获取第一个标签 | soup.p | 第一个 <p> 标签 |
获取所有某类标签 | soup.find_all("p") | 返回列表 |
查找指定 class 的标签 | soup.find_all(class_="info") | 注意 class 是关键字 |
根据 id 查找标签 | soup.find(id="main") | |
选择 CSS 选择器 | soup.select("div > p.info") | 返回列表 |
-
🧲 获取属性和文本
pythontag = soup.find("a") print(tag['href']) # 获取属性 print(tag.get('href')) # 同上 print(tag.text) # 获取文本内容
💡 今日练习任务
1.用 requests 抓取网页 https://quotes.toscrape.com/
- 用 BeautifulSoup 完成以下任务:
- 提取所有名人名言内容(位于 )
- 提取作者名字()
- 提取每条名言的标签(class="tag")
-
把每条名言信息输出为字典:
python{ "quote": "...", "author": "...", "tags": ["tag1", "tag2"] }
练习脚本:
python# quotes_scraper.py import requests from bs4 import BeautifulSoup def fetch_quotes(): url = "https://quotes.toscrape.com/" headers = { 'User-Agent': 'Mozilla/5.0' } response = requests.get(url, headers=headers) if response.status_code != 200: print(f"请求失败,状态码:{response.status_code}") return soup = BeautifulSoup(response.text, "lxml") # 查找所有名言块 quote_blocks = soup.find_all("div", class_="quote") quotes_data = [] for block in quote_blocks: quote_text = block.find("span", class_="text").text.strip() author = block.find("small", class_="author").text.strip() tags = [tag.text.strip() for tag in block.find_all("a", class_="tag")] quotes_data.append({ "quote": quote_text, "author": author, "tags": tags }) return quotes_data if __name__ == "__main__": data = fetch_quotes() for i, quote in enumerate(data, start=1): print(f"\n第 {i} 条名言:") print(f"内容:{quote['quote']}") print(f"作者:{quote['author']}") print(f"标签:{', '.join(quote['tags'])}")
输出结果为:
python第 1 条名言: 内容:"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." 作者:Albert Einstein 标签:change, deep-thoughts, thinking, world 第 2 条名言: 内容:"It is our choices, Harry, that show what we truly are, far more than our abilities." 作者:J.K. Rowling 标签:abilities, choices 第 3 条名言: 内容:"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." 作者:Albert Einstein 标签:inspirational, life, live, miracle, miracles 第 4 条名言: 内容:"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." 作者:Jane Austen 标签:aliteracy, books, classic, humor 第 5 条名言: 内容:"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring." 作者:Marilyn Monroe 标签:be-yourself, inspirational 第 6 条名言: 内容:"Try not to become a man of success. Rather become a man of value." 作者:Albert Einstein 标签:adulthood, success, value 第 7 条名言: 内容:"It is better to be hated for what you are than to be loved for what you are not." 作者:André Gide 标签:life, love 第 8 条名言: 内容:"I have not failed. I've just found 10,000 ways that won't work." 作者:Thomas A. Edison 标签:edison, failure, inspirational, paraphrased 第 9 条名言: 内容:"A woman is like a tea bag; you never know how strong it is until it's in hot water." 作者:Eleanor Roosevelt 标签:misattributed-eleanor-roosevelt 第 10 条名言: 内容:"A day without sunshine is like, you know, night." 作者:Steve Martin 标签:humor, obvious, simile
📎 小贴士
- 遇到复杂页面,建议先用浏览器 F12 检查结构。
- 尝试结合 select() 使用 CSS 选择器:soup.select('.quote span.text')
📝 今日总结
- 初步掌握了 BeautifulSoup 的解析方式和用法
- 能提取文本、属性、class、id 等元素内容
- 可初步实现"结构化数据提取",为爬虫数据存储铺路