DataWhale-零基础络网爬虫技术（三、爬虫进阶技术）

爬虫的进阶技巧

1. 动态网页爬取

使用Selenium：Selenium是一个自动化测试工具，支持多种浏览器，可以模拟用户的浏览器操作，适用于爬取JavaScript渲染的动态网页。例如，可以使用Selenium打开网页，等待页面加载完成后再获取页面内容
使用Headless浏览器：Headless浏览器是一种无头浏览器，即没有图形用户界面的浏览器。它可以在后台运行，模拟用户操作，获取动态网页的内容。常见的Headless浏览器有Chrome Headless、Firefox Headless等

python 复制代码

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# 设置无头浏览器选项
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 无头模式
options.add_argument('--disable-gpu')  # 禁用GPU加速

# 初始化WebDriver
driver = webdriver.Chrome(options=options)

# 打开目标网页
url = 'http://qingfeng.nb'
driver.get(url)

# 等待页面加载完成（可以根据实际情况调整等待时间）
time.sleep(3)

# 获取页面内容
html = driver.page_source

# 使用BeautifulSoup解析页面内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# 提取数据（根据实际页面结构调整选择器）
data = []
table = soup.find('table', {'class': 'your_table_class'})
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append(cols)

# 关闭浏览器
driver.quit()

# 打印提取的数据
for row in data:
    print(row)

2. 绕过反爬虫措施

使用代理：通过代理服务器发送请求，可以隐藏爬虫的真实IP地址，避免被网站封禁。可以使用免费代理或付费代理，构建代理池以提高爬虫的稳定性和效率
更改User-Agent：User-Agent是浏览器向服务器发送的请求头信息之一，用于标识浏览器的类型和版本。通过更改User-Agent，可以伪装成不同的浏览器，避免被网站识别为爬虫
设置合理的请求间隔：模拟正常用户的行为，设置合理的请求间隔，避免对网站服务器造成过大压力，从而降低被封禁的风险
使用验证码破解工具：对于有验证码的网站，可以使用一些验证码破解工具或服务，如打码平台，来解决CAPTCHA和ReCAPTCHA等验证码

python 复制代码

import requests
from fake_useragent import UserAgent

# 使用fake_useragent库生成随机User-Agent
ua = UserAgent()
headers = {
    'User-Agent': ua.random
}

# 使用代理
proxies = {
    'http': 'http://192.168.1.1:8100',
    'https': 'https://192.168.1.1:8101'
}

# 发送请求
url = 'http://qingfeng.nb'
response = requests.get(url, headers=headers, proxies=proxies)

# 检查请求是否成功
if response.status_code == 200:
    print('请求成功')
    # 处理响应内容
    print(response.text)
else:
    print('请求失败，状态码：', response.status_code)

3. 优化爬虫性能

使用多进程或多线程 ：同时处理多个请求，可以提高爬虫的效率和速度。Python中的multiprocessing和threading模块可以实现多进程和多线程
使用异步爬虫 ：异步爬虫可以同时发起多个请求，而不需要等待每个请求的响应，从而提高爬取效率。可以使用asyncio和aiohttp等库来实现异步爬虫
使用缓存：对于一些重复请求的数据，可以使用缓存来存储已经爬取的结果，避免重复请求，提高爬虫的效率
设置爬虫速率：限制请求频率，避免对网站服务器造成过大压力，同时也可以提高爬虫的稳定性

python 复制代码

import requests
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup

# 定义爬取单个页面的函数
def fetch_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # 提取数据（根据实际页面结构调整选择器）
        data = []
        table = soup.find('table', {'class': 'your_table_class'})
        rows = table.find_all('tr')
        for row in rows:
            cols = row.find_all('td')
            cols = [col.text.strip() for col cols in]
            data.append(cols)
        return data
    else:
        return None

# 使用ThreadPoolExecutor实现多线程爬取
urls = ['http://qingfeng.nb']  
results = []
with ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(fetch_page, url) for url in urls]
    for future in futures:
        result = future.result()
        if result:
            results.append(result)

# 打印爬取结果
for result in results:
    print(result)

4. 数据存储与管理

使用数据库：将爬取的数据存储到数据库中，方便后续的数据查询和分析。常见的数据库有MySQL、MongoDB等
数据清洗与预处理：在存储数据之前，对数据进行清洗和预处理，去除无用信息和重复数据，确保数据的准确性和一致性

python 复制代码

import sqlite3
import requests
from bs4 import BeautifulSoup

# 创建数据库连接
conn = sqlite3.connect('stock_data.db')
cursor = conn.cursor()

# 创建表
cursor.execute('''
CREATE TABLE IF NOT EXISTS stock_data (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    column1 TEXT,
    column2 TEXT,
    column3 TEXT
)
''')

# 发送请求并解析数据
url = 'http://qingfeng.nb'
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    data = []
    table = soup.find('table', {'class': 'your_table_class'})
    rows = table.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [col.text.strip() for col in cols]
        data.append(cols)
    
    # 插入数据到数据库
    for row in data:
        cursor.execute('INSERT INTO stock_data (column1, column2, column3) VALUES (?, ?, ?)', row)
    conn.commit()
    print('数据已成功存储到数据库')
else:
    print('请求失败，状态码：', response.status_code)

# 关闭数据库连接
conn.close()

5. 进阶爬虫策略

分布式爬虫：在多台服务器上部署爬虫，可以大规模地爬取数据，提高爬虫的效率和稳定性
云爬虫：利用云平台提供的高计算能力和存储空间，可以更高效地进行大规模数据爬取
人工智能辅助爬虫：利用机器学习算法识别和提取特定信息，提高爬虫的智能化水平

python 复制代码

import asyncio
import aiohttp
from bs4 import BeautifulSoup

# 定义异步爬取单个页面的函数
async def fetch_page(session, url):
    async with session.get(url) as response:
        if response.status == 200:
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            # 提取数据（根据实际页面结构调整选择器）
            data = []
            table = soup.find('table', {'class': 'your_table_class'})
            rows = table.find_all('tr')
            for row in rows:
                cols = row.find_all('td')
                cols = [col.text.strip() for col in cols]
                data.append(cols)
            return data
        else:
            return None

# 使用asyncio和aiohttp实现异步爬取
async def main():
    urls = ['http://qingfeng.nb']  
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_page(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    
    # 打印爬取结果
    for result in results:
        if result:
            print(result)

# 运行异步主函数
asyncio.run(main())

6. 一些其他技巧

学习HTML和CSS：深入了解网页结构，有助于更准确地定位和提取数据
熟悉正则表达式：正则表达式是一种强大的文本处理工具，可以用于提取复杂结构的数据
遵守爬虫礼仪：尊重网站的使用条款，避免过度消耗网站资源，确保爬虫行为合法合规