Python 网络爬虫从入门到实战：全面解析与项目示例

一、前言

在信息化高度发展的今天，数据成为一种关键资产。互联网上蕴含着海量的信息，而如何高效、自动地从网页中提取数据成为许多行业的刚需。Python 以其简洁易用的语法和强大的生态系统，成为实现网络爬虫任务的首选语言。

本篇文章将带你从原理入门，到实战项目构建，掌握用 Python 进行网页数据抓取的完整流程，适合零基础或有一定编程基础的读者阅读。

二、网络爬虫基础原理

1. 什么是网络爬虫？

网络爬虫（Web Crawler）是通过模拟浏览器访问网页，从中提取所需信息的程序。其基本步骤包括：

发送请求（Request）
接收响应（Response）
提取数据（解析 HTML、JSON 等）
数据保存（存储到文件、数据库等）

2. 网络请求基础

常用的协议：HTTP/HTTPS

常用的请求方法：

GET: 获取网页内容（最常见）
POST: 向服务器提交数据
HEAD/PUT/DELETE: 一般用于接口开发

请求头示例：

makefile 复制代码

http
复制编辑
GET / HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0

三、核心工具库介绍

库名	功能说明
`requests`	发送网络请求，获取网页内容
`bs4`（BeautifulSoup）	解析 HTML/XML 文档
`lxml`	更快的解析器（支持 XPath）
`re`	正则表达式，强力匹配字符串
`selenium`	浏览器自动化操作（应对 JS 渲染页面）
`aiohttp`	异步网络请求库（适合大规模爬取）

四、实战入门：抓取豆瓣书籍排行榜

1. 安装依赖

复制代码

bash
复制编辑
pip install requests beautifulsoup4

2. 示例代码

ini 复制代码

python
复制编辑
import requests
from bs4 import BeautifulSoup

url = 'https://book.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0'
}

res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')

books = soup.find_all('tr', class_='item')
for book in books:
    title = book.find('div', class_='pl2').a.text.strip().replace('\n', '')
    rating = book.find('span', class_='rating_nums').text
    print(f"{title} - 评分：{rating}")

3. 输出示例

python 复制代码

python-repl
复制编辑
活着 - 评分：9.4  
百年孤独 - 评分：9.2  
...

五、进阶数据提取：使用 XPath 与正则

1. 使用 lxml + XPath

css 复制代码

python
复制编辑
from lxml import etree
html = etree.HTML(res.text)
titles = html.xpath('//div[@class="pl2"]/a/text()')

2. 使用正则提取链接

python 复制代码

python
复制编辑
import re
urls = re.findall(r'href="(https://book.douban.com/subject/\d+/)"', res.text)

六、实战项目：爬取知乎热榜并保存为 Excel

1. 目标站点

知乎热榜：www.zhihu.com/hot

2. 代码实现

ini 复制代码

python
复制编辑
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.zhihu.com/hot'
headers = {'User-Agent': 'Mozilla/5.0'}

res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')

titles = soup.select('.HotList-itemTitle')
data = [{'排名': i+1, '标题': title.text} for i, title in enumerate(titles)]

# 保存为 Excel
df = pd.DataFrame(data)
df.to_excel('知乎热榜.xlsx', index=False)
print('热榜保存成功')

七、破解反爬：如何应对限制与封禁

1. 添加 Headers 模拟浏览器

ini 复制代码

python
复制编辑
headers = {'User-Agent': 'Mozilla/5.0'}

2. 使用代理

rust 复制代码

python
复制编辑
proxies = {
    'http': 'http://123.123.123.123:8888',
    'https': 'http://123.123.123.123:8888'
}
requests.get(url, headers=headers, proxies=proxies)

3. 设置请求间隔

lua 复制代码

python
复制编辑
import time
time.sleep(1 + random.random())

4. 使用 Selenium 绕过 JS 加载页面

ini 复制代码

python
复制编辑
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.taobao.com')
html = driver.page_source
driver.quit()

八、多线程与异步爬虫

1. 多线程爬虫（适合 I/O 密集任务）

scss 复制代码

python
复制编辑
import threading

def crawl(url):
    res = requests.get(url)
    print(url, len(res.text))

urls = ['https://example.com/page1', 'https://example.com/page2']
threads = [threading.Thread(target=crawl, args=(url,)) for url in urls]

for t in threads:
    t.start()
for t in threads:
    t.join()

2. 异步爬虫（推荐使用 aiohttp + asyncio）

python 复制代码

python
复制编辑
import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as res:
        html = await res.text()
        print(url, len(html))

async def main():
    urls = ['https://example.com'] * 10
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

asyncio.run(main())

九、爬虫数据存储方案

存储方式	适用场景
CSV / Excel	轻量级数据存储
SQLite	本地轻量数据库
MySQL / Postgres	多人协作项目数据管理
MongoDB	存储 JSON 等非结构化数据

示例保存为 CSV：

ini 复制代码

python
复制编辑
df.to_csv('数据.csv', index=False)

十、完整项目实战：爬取招聘信息并入库

1. 目标站点：51job 搜索页面

ini 复制代码

python
复制编辑
import requests
from bs4 import BeautifulSoup
import csv

url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,Python,2,1.html'
headers = {'User-Agent': 'Mozilla/5.0'}

res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')

# 注意：实际页面结构较复杂，需结合 DevTools 分析标签路径

2. 提取字段并写入 CSV

ini 复制代码

python
复制编辑
with open('招聘信息.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    writer.writerow(['职位', '公司', '地点', '薪资'])

    for job in jobs:
        title = ...
        company = ...
        salary = ...
        writer.writerow([title, company, location, salary])

十一、合法合规与反爬伦理

网络爬虫虽然强大，但不应滥用。在编写爬虫时请务必：

遵守网站的 robots.txt 文件规则
控制请求频率，避免攻击性访问
不爬取涉及个人隐私、敏感数据内容
不用于商业用途（如爬取付费 API）

十二、总结与拓展方向

本文带你系统学习了 Python 网络爬虫的基本原理、常用库、反爬策略及多个实战项目示例。掌握爬虫技术，不仅能为你提供高效的数据采集能力，也为数据分析、自然语言处理、商业情报等场景打下坚实基础。

下一步建议学习方向：

使用 Scrapy 框架构建大型项目
利用异步技术实现高性能爬虫
接入数据库 + 爬虫 + 可视化构建数据平台