Python多线程数据爬取程序模版

最近一个项目需要爬取很多项目，前期测试需要小批量进程，后期体量上来了，需要增加很多线程，这就要求我们多线程爬虫数据并且要求随时可拓展性，因为Python它有丰富的库支持，所以我的想法首选肯定是python。

下面是一个使用Python编写的多线程数据爬取应用程序示例。这个程序使用threading和queue模块实现多线程，并通过requests库发送HTTP请求。

python 复制代码

import threading
import queue
import requests
import time
from urllib.parse import urlparse
import csv
import os

# 配置参数
MAX_THREADS = 5  # 最大线程数
REQUEST_TIMEOUT = 10  # 请求超时时间(秒)
OUTPUT_FILE = 'crawled_data.csv'  # 输出文件名
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

class CrawlerThread(threading.Thread):
    def __init__(self, url_queue, data_queue, lock):
        threading.Thread.__init__(self)
        self.url_queue = url_queue
        self.data_queue = data_queue
        self.lock = lock
        self.session = requests.Session()

    def run(self):
        while True:
            try:
                # 从队列获取URL，超时3秒退出
                url = self.url_queue.get(timeout=3)
                self.crawl(url)
                self.url_queue.task_done()
            except queue.Empty:
                break

    def crawl(self, url):
        try:
            # 发送HTTP请求
            response = self.session.get(
                url,
                headers=HEADERS,
                timeout=REQUEST_TIMEOUT,
                allow_redirects=True
            )
            response.raise_for_status()  # 检查HTTP错误

            # 解析域名
            domain = urlparse(url).netloc
            
            # 将结果放入数据队列
            self.data_queue.put({
                'url': url,
                'domain': domain,
                'status': response.status_code,
                'length': len(response.content),
                'success': True
            })
            
            print(f"[SUCCESS] Crawled: {url} ({response.status_code})")
            
        except Exception as e:
            self.data_queue.put({
                'url': url,
                'error': str(e),
                'success': False
            })
            print(f"[ERROR] Failed {url}: {str(e)}")

def data_writer(data_queue, lock, filename):
    """数据写入线程"""
    # 创建CSV文件并写入标题
    with lock:
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=['url', 'domain', 'status', 'length', 'error', 'success'])
            writer.writeheader()
    
    while True:
        try:
            data = data_queue.get(timeout=5)
            with lock:
                with open(filename, 'a', newline='', encoding='utf-8') as f:
                    writer = csv.DictWriter(f, fieldnames=['url', 'domain', 'status', 'length', 'error', 'success'])
                    
                    # 仅保留存在的字段
                    row = {key: data.get(key, '') for key in writer.fieldnames}
                    writer.writerow(row)
                    
            data_queue.task_done()
        except queue.Empty:
            # 检查是否所有任务已完成
            if url_queue.empty():
                break

def main(urls):
    # 创建队列
    url_queue = queue.Queue()
    data_queue = queue.Queue()
    lock = threading.Lock()

    # 填充URL队列
    for url in urls:
        url_queue.put(url)

    # 启动爬虫线程
    crawlers = []
    for _ in range(min(MAX_THREADS, len(urls))):
        t = CrawlerThread(url_queue, data_queue, lock)
        t.daemon = True
        t.start()
        crawlers.append(t)

    # 启动数据写入线程
    writer_thread = threading.Thread(
        target=data_writer, 
        args=(data_queue, lock, OUTPUT_FILE),
        daemon=True
    )
    writer_thread.start()

    # 等待所有URL处理完成
    url_queue.join()
    data_queue.join()

    print(f"\n爬取完成! 结果已保存到 {OUTPUT_FILE}")

if __name__ == "__main__":
    # 示例URL列表 - 替换为实际需要爬取的URL
    urls_to_crawl = [
        'https://www.example.com',
        'https://www.google.com',
        'https://www.github.com',
        'https://www.python.org',
        'https://www.wikipedia.org',
        'https://www.openai.com',
        'https://www.amazon.com',
        'https://www.microsoft.com'
    ]
    
    print(f"开始爬取 {len(urls_to_crawl)} 个URL...")
    start_time = time.time()
    
    main(urls_to_crawl)
    
    duration = time.time() - start_time
    print(f"总耗时: {duration:.2f} 秒")

主要功能说明：

1、多线程架构：

使用线程池模式（CrawlerThread类）
单独的数据写入线程（避免频繁文件I/O阻塞爬虫线程）

2、组件：

URL队列：管理待爬取URL
数据队列：收集爬取结果
线程锁：确保文件写入安全

3、错误处理：

请求超时处理
HTTP错误状态码处理
异常捕获和记录

4、数据记录：

成功爬取：URL、域名、状态码、内容长度
失败记录：错误信息
CSV格式输出（可轻松导入Excel或数据库）

5、配置选项：

最大线程数（根据网络条件和目标网站调整）
请求超时时间
自定义请求头（模拟浏览器）
输出文件名

使用说明：

1、安装依赖：

复制代码

pip install requests

2、修改配置：

在urls_to_crawl列表中添加实际要爬取的URL
调整MAX_THREADS（建议5-10个线程）
修改OUTPUT_FILE改变输出路径

3、运行程序：

复制代码

python crawler.py

注意事项：

1、请求频率控制：

可在CrawlerThread.crawl()中添加延时：

bash 复制代码

time.sleep(0.5)  # 添加0.5秒延迟

2、代理支持：

添加代理轮换功能：

ini 复制代码

proxies = {'http': 'http://proxy_ip:port', 'https': 'https://proxy_ip:port'}
response = self.session.get(url, proxies=proxies, ...)

这个程序提供了多线程爬虫的基本框架，可以根据具体的实际情况改进行扩展和修改。对于大规模爬取，个人推荐还是使用Scrapy框架或异步IO库（如aiohttp）。