爬虫学习总结

通过前几次课，我们学习了爬虫的相关基础知识。

以下是我对爬虫学习做的一些总结：

一、认识爬虫：开启数据抓取之旅

1.1 什么是网络爬虫

网络爬虫就像是一个不知疲倦的 "数据搬运工"，它能按照预先设定的规则，自动在互联网上抓取各类信息。比如搜索引擎的爬虫会抓取网页内容，为用户搜索提供数据支持；电商数据爬虫可以抓取商品价格、评论等信息。它通过向网页服务器发送请求，获取 HTML、JSON 等格式的数据，再解析提取出有用信息。

1.2 爬虫的分类

通用爬虫：广泛抓取网页，像百度、谷歌的搜索引擎爬虫，用于构建庞大的网页索引库。

聚焦爬虫：专注于特定主题或领域，比如只抓取新闻网站的体育新闻内容。

增量式爬虫：只抓取更新或新增的数据，常用于监控网站内容变化。

二、搭建学习环境：为爬虫运行做好准备

2.1 选择编程语言

Python 是爬虫学习的热门选择，因其简洁的语法和丰富的第三方库。学习爬虫之前我们需要安装 Python 环境，可从 Python 官方网站（Download Python | Python.org）下载对应系统的安装包，安装时记得勾选 "Add Python to PATH"。

2.2 安装必要的库

requests 库：用于发送 HTTP 请求。在命令行输入pip install requests进行安装。代码如下：

python 复制代码

import requests
response = requests.get('https://www.example.com')
print(response.text)

BeautifulSoup 库：解析 HTML 和 XML 文档。安装命令为pip install beautifulsoup4。代码如下：

python 复制代码

from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.example.com')
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)

lxml 库：解析速度快，支持 XPath 语法。通过pip install lxml安装。如使用 XPath 提取数据：

python 复制代码

from lxml import etree
import requests
response = requests.get('https://www.example.com')
html = etree.HTML(response.text)
data = html.xpath('//div[@class="content"]/text()')
print(data)

三、爬虫核心技术学习：掌握数据抓取关键

3.1 HTTP 协议基础

HTTP 协议是爬虫与网页服务器沟通的 "语言"。要熟悉常见的请求方法，如 GET（获取资源）、POST（提交数据）；了解请求头中的 User-Agent（标识客户端信息）、Cookie（维持会话状态）等字段的作用。比如，为了伪装成浏览器发送请求，可以这样设置请求头：

python 复制代码

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get('https://www.example.com', headers=headers)

3.2 数据解析技术

正则表达式：通过编写匹配规则提取数据。例如，提取网页中的所有链接：

python 复制代码

import re
import requests
response = requests.get('https://www.example.com')
links = re.findall(r'href="(.*?)"', response.text)
print(links)

XPath：基于 XML 路径语言定位数据。假设要提取网页中所有段落文本，代码如下：

python 复制代码

from lxml import etree
import requests
response = requests.get('https://www.example.com')
html = etree.HTML(response.text)
paragraphs = html.xpath('//p/text()')
print(paragraphs)

CSS 选择器：类似 XPath，用于选择 HTML 元素。使用BeautifulSoup库通过 CSS 选择器提取 class 为 "article" 的 div 内的文本：

python 复制代码

from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.example.com')
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.select('div.article')
for item in data:
    print(item.text)

四、爬虫实战：从理论到实践

4.1 简单静态网页抓取

以抓取一个小说网站的小说章节标题为例：

分析网页结构，找到存放标题的 HTML 标签和属性。
使用requests发送请求，BeautifulSoup解析数据。

python 复制代码

import requests
from bs4 import BeautifulSoup

url = 'https://www.example-novel.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.select('.chapter-title')
for title in titles:
    print(title.text.strip())

4.2 动态网页抓取

对于使用 JavaScript 动态加载数据的网页，可借助Selenium库和浏览器驱动（如 ChromeDriver）。比如抓取一个动态加载图片的网页：

安装Selenium：pip install selenium，并下载对应版本的 ChromeDriver，将其路径添加到系统环境变量。
编写代码：

python 复制代码

from selenium import webdriver
import time

driver = webdriver.Chrome()
url = 'https://www.example-image.com'
driver.get(url)
time.sleep(3)  # 等待页面加载
images = driver.find_elements_by_css_selector('img')
for img in images:
    print(img.get_attribute('src'))
driver.quit()

五、应对挑战：解决爬虫常见问题

5.1 反爬处理

IP 封禁：使用代理 IP 池，定期更换 IP 地址。可以从代理 IP 服务商获取 IP，或使用开源代理 IP 获取工具。

验证码：通过 OCR 技术识别简单验证码，复杂的可使用第三方打码平台，如超级鹰。

请求频率限制：设置合理的抓取间隔，避免短时间内大量请求。

5.2 数据存储

将抓取的数据存储到文件或数据库：

存储为 CSV 文件：

python 复制代码

import csv
data = [['标题1', '内容1'], ['标题2', '内容2']]
with open('data.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(data)

存储到 MySQL 数据库：先安装mysql-connector-python库（pip install mysql-connector-python），再连接数据库并插入数据：

python 复制代码

import mysql.connector

mydb = mysql.connector.connect(
    host="localhost",
    user="yourusername",
    password="yourpassword",
    database="yourdatabase"
)

mycursor = mydb.cursor()
sql = "INSERT INTO yourtable (title, content) VALUES (%s, %s)"
val = [('标题1', '内容1'), ('标题2', '内容2')]
mycursor.executemany(sql, val)
mydb.commit()
mycursor.close()
mydb.close()