第一个Python金融爬虫

一、前言

作为专业的爬虫工程师,我将带领您完成第一个Python金融数据爬虫的实现。本文档将从零开始,详细介绍金融爬虫的开发流程、技术要点和最佳实践。

二、环境准备

2.1 创建虚拟环境

bash 复制代码
# 创建项目目录
mkdir python_spider
cd python_spider

# 创建虚拟环境
python -m venv venv

# 激活虚拟环境 (Windows)
venv\Scripts\activate

# 激活虚拟环境 (macOS/Linux)
source venv/bin/activate

2.2 安装必要依赖

bash 复制代码
# 安装核心依赖库
pip install requests pandas

三、爬虫基础概念

3.1 HTTP请求基础

  • GET请求: 获取数据
  • POST请求: 提交数据
  • 请求头(Headers): 包含客户端信息
  • 状态码(Status Codes): 200成功, 404未找到, 500服务器错误

3.2 数据解析技术

  • JSON解析: 直接解析接口返回的结构化数据
  • 字段映射: 将原始字段转换为易读中文字段

3.3 数据存储

  • CSV文件: 简单表格数据存储
  • JSON文件: 结构化数据存储
  • 数据库: SQLite/MySQL持久化存储

四、第一个金融爬虫:股票数据获取

4.1 目标分析

我们通过东方财富的公开接口获取股票基本信息,包括:

  • 股票代码与名称
  • 最新价、涨跌幅、涨跌额
  • 成交量与成交额
  • 振幅、换手率
  • 总市值与流通市值

4.2 技术选型

  • 请求库: requests
  • 解析: JSON 结构解析
  • 数据存储: pandas + CSV

4.3 代码实现

完整代码见 c01_hello.py

接口参数示例(东方财富 clist/get):

  • pn: 页码,例如 1
  • pz: 每页条数,例如 20
  • po: 排序方向,1 升序
  • np: 返回格式,1 为 JSON 数组
  • ut: 接口标识,示例 bd1d9ddb04089700cf9c27f6f7426281
  • fltt: 浮点处理,2
  • invt: 列表格式,2
  • fid: 排序字段,例如 f12
  • fs: 市场与板块过滤,示例 m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23
  • fields: 返回字段,例如 f12,f14,f2,f3,f4,f5,f6,f7,f8,f20,f21

常用字段映射:

  • f12 → 股票代码
  • f14 → 股票名称
  • f2 → 最新价
  • f3 → 涨跌幅(百分比数值)
  • f4 → 涨跌额
  • f5 → 成交量
  • f6 → 成交额
  • f7 → 振幅(百分比数值)
  • f8 → 换手率(百分比数值)
  • f20 → 总市值
  • f21 → 流通市值

五、爬虫开发流程

5.1 步骤分解

  1. 目标分析: 明确接口与字段
  2. 请求发送: 构造 HTTP 请求获取 JSON
  3. 数据解析: 提取并映射核心字段
  4. 数据清洗: 处理缺失值与格式
  5. 数据存储: 保存为 CSV 文件
  6. 错误处理: 捕获网络与解析异常
  7. 性能优化: 控制请求频率与重试策略

5.2 最佳实践

  1. 遵守robots协议: 尊重网站的爬虫规则
  2. 设置合理延时: 避免对服务器造成压力
  3. 使用User-Agent: 模拟浏览器行为
  4. 异常处理: 确保程序健壮性
  5. 数据验证: 确保数据质量

六、常见问题与解决方案

6.1 反爬机制应对

  • IP限制: 使用代理IP池
  • 验证码: 使用OCR或第三方验证码服务
  • JavaScript渲染: 使用Selenium或Playwright
  • 请求频率限制: 添加随机延时

6.2 数据质量保证

  • 字段校验: 检查数据完整性
  • 去重处理: 避免重复数据
  • 异常值处理: 过滤不合理数据
  • 数据备份: 定期备份重要数据

七、进阶学习方向

7.1 技术进阶

  • 异步爬虫: 使用aiohttp提高效率
  • 分布式爬虫: 使用Scrapy-Redis
  • 浏览器自动化: Selenium/Playwright
  • API逆向工程: 分析JavaScript接口

7.2 数据应用

  • 数据分析: 使用pandas进行数据探索
  • 数据可视化: 使用matplotlib/plotly
  • 机器学习: 构建预测模型
  • 实时监控: 构建数据监控系统

八、总结

第一个Python金融爬虫的实现涵盖了爬虫开发的核心流程:从环境准备、目标分析、代码实现到数据存储。通过这个实例,您应该掌握了:

  1. 基本的HTTP请求发送
  2. HTML页面解析技术
  3. 数据提取和清洗方法
  4. 文件存储操作
  5. 异常处理和程序健壮性设计

记住,爬虫开发不仅要关注技术实现,更要重视合规性和数据质量。在实际项目中,请务必遵守相关法律法规和网站的使用条款。

九、参考资料

  1. Requests官方文档
  2. BeautifulSoup官方文档
  3. Pandas官方文档
  4. HTTP状态码说明

4.4 运行与输出

bash 复制代码
python c01_hello.py
  • 运行后生成 stock_data.csv(UTF-8-SIG 编码)
  • 日志输出到 stock_spider.log
  • 控制台展示采集摘要与前 5 条预览

完整代码

python 复制代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
第一个Python金融数据爬虫 - 股票数据采集

功能说明:
1. 从东方财富网获取股票列表数据
2. 解析股票基本信息(代码、名称、价格、涨跌幅等)
3. 数据清洗和格式化
4. 保存到CSV文件
5. 包含完整的异常处理和日志记录

作者:专业爬虫工程师
创建时间:2025年
版本:v1.0
"""

import requests
import pandas as pd
import time
import random
import logging
from typing import List, Dict, Optional

# 配置日志系统
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('stock_spider.log', encoding='utf-8'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)


class StockSpider:
    """股票数据爬虫类"""
    
    def __init__(self):
        """初始化爬虫配置"""
        # 目标URL - 东方财富网股票列表页
        self.base_url = "https://push2.eastmoney.com/api/qt/clist/get"
        
        # 请求头设置,模拟浏览器行为
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'application/json, text/plain, */*',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Connection': 'keep-alive',
            'Referer': 'http://quote.eastmoney.com/center/gridlist.html',
        }
        
        # 请求超时时间(秒)
        self.timeout = 10
        
        # 请求间隔时间(秒),避免过于频繁请求
        self.delay = random.uniform(1.0, 3.0)
        
        logger.info("股票爬虫初始化完成")
    
    def make_request(self, url: str, params: Optional[Dict] = None) -> Optional[Dict]:
        """
        发送HTTP请求获取网页内容
        
        参数:
            url: 请求的URL地址
            params: 请求参数
            
        返回:
            str: 网页HTML内容,失败时返回None
        """
        try:
            # 添加随机延时,避免被反爬
            time.sleep(self.delay)
            
            logger.info(f"正在请求URL: {url}")
            
            # 发送GET请求
            response = requests.get(
                url,
                headers=self.headers,
                params=params,
                timeout=self.timeout
            )
            
            # 检查响应状态码
            response.raise_for_status()
            
            # 设置编码为UTF-8
            response.encoding = 'utf-8'
            
            logger.info(f"请求成功,状态码: {response.status_code}")
            return response.json()
            
        except requests.exceptions.RequestException as e:
            logger.error(f"请求失败: {e}")
            return None
        except Exception as e:
            logger.error(f"未知错误: {e}")
            return None
    
    def parse_stock_data(self, data: Dict) -> List[Dict]:
        """
        解析HTML内容,提取股票数据
        
        参数:
            html: 网页HTML内容
            
        返回:
            List[Dict]: 股票数据列表
        """
        stocks: List[Dict] = []
        try:
            if not data or 'data' not in data or not data['data']:
                return stocks
            diff = data['data'].get('diff') or []
            for item in diff:
                try:
                    stock_data = {
                        '股票代码': str(item.get('f12') or '').strip(),
                        '股票名称': str(item.get('f14') or '').strip(),
                        '最新价': self._clean_number(item.get('f2')),
                        '涨跌幅': self._clean_number(item.get('f3')),
                        '涨跌额': self._clean_number(item.get('f4')),
                        '成交量': self._clean_number(item.get('f5')),
                        '成交额': self._clean_number(item.get('f6')),
                        '振幅': self._clean_number(item.get('f7')),
                        '换手率': self._clean_number(item.get('f8')),
                        '总市值': self._clean_number(item.get('f20')),
                        '流通市值': self._clean_number(item.get('f21')),
                        '采集时间': time.strftime('%Y-%m-%d %H:%M:%S')
                    }
                    stocks.append(stock_data)
                except Exception as e:
                    logger.warning(f"解析单条数据失败: {e}")
                    continue
            logger.info(f"成功解析 {len(stocks)} 条股票数据")
        except Exception as e:
            logger.error(f"解析数据失败: {e}")
        return stocks
    
    def _clean_number(self, text: str, is_percent: bool = False) -> Optional[float]:
        """
        清洗数字字符串,转换为数值类型
        
        参数:
            text: 原始文本
            is_percent: 是否为百分比
            
        返回:
            float: 转换后的数值,转换失败返回None
        """
        if text is None:
            return None
        try:
            if isinstance(text, (int, float)):
                return float(text)
            s = str(text).strip()
            if not s or s == '-':
                return None
            cleaned = s.replace('%', '').replace(',', '').replace('亿', 'e8').replace('万', 'e4')
            value = float(cleaned)
            if is_percent and '%' in s:
                value = value / 100.0
            return value
        except (ValueError, TypeError):
            logger.warning(f"无法转换数字: {text}")
            return None
    
    def save_to_csv(self, data: List[Dict], filename: str = 'stock_data.csv'):
        """
        保存数据到CSV文件
        
        参数:
            data: 要保存的数据列表
            filename: 输出文件名
        """
        try:
            if not data:
                logger.warning("没有数据可保存")
                return
            
            # 使用pandas创建DataFrame并保存
            df = pd.DataFrame(data)
            
            # 设置 CSV 文件格式
            df.to_csv(
                filename,
                index=False,          # 不保存行索引
                encoding='utf-8-sig', # 支持中文
                quoting=1            # 引用所有字段
            )
            
            logger.info(f"数据已保存到 {filename},共 {len(data)} 条记录")
            
            # 显示数据摘要
            print("\n=== 数据采集摘要 ===")
            print(f"采集时间: {time.strftime('%Y-%m-%d %H:%M:%S')}")
            print(f"股票数量: {len(data)}")
            print(f"文件路径: {filename}")
            
            if len(data) > 0:
                print("\n前5条数据预览:")
                print(df.head().to_string(index=False))
            
        except Exception as e:
            logger.error(f"保存CSV文件失败: {e}")
    
    def run(self):
        """运行爬虫主程序"""
        logger.info("开始运行股票爬虫")
        
        try:
            # 构造请求参数(东方财富JSON接口)
            params = {
                'pn': '1',
                'pz': '20',
                'po': '1',
                'np': '1',
                'ut': 'bd1d9ddb04089700cf9c27f6f7426281',
                'fltt': '2',
                'invt': '2',
                'fid': 'f12',
                'fs': 'm:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23',
                'fields': 'f12,f14,f2,f3,f4,f5,f6,f7,f8,f20,f21'
            }
            
            # 发送请求获取网页内容
            data = self.make_request(self.base_url, params)
            if not data:
                logger.error("获取数据失败")
                return
            
            stock_data = self.parse_stock_data(data)
            
            if not stock_data:
                logger.warning("未解析到股票数据")
                return
            
            # 保存数据到CSV文件
            self.save_to_csv(stock_data)
            
            logger.info("爬虫运行完成")
            
        except KeyboardInterrupt:
            logger.info("用户中断程序")
        except Exception as e:
            logger.error(f"爬虫运行异常: {e}")


def main():
    """主函数"""
    print("=" * 60)
    print("            Python金融数据爬虫 - 股票数据采集")
    print("=" * 60)
    print("功能说明:")
    print("• 从东方财富网获取实时股票数据")
    print("• 解析股票基本信息并保存到CSV文件")
    print("• 包含完整的错误处理和日志记录")
    print("• 遵守爬虫伦理和法律法规")
    print("=" * 60)
    
    # 创建爬虫实例
    spider = StockSpider()
    
    # 运行爬虫
    spider.run()
    
    print("\n" + "=" * 60)
    print("程序执行完成!")
    print("请查看 stock_data.csv 文件获取采集的数据")
    print("详细日志请查看 stock_spider.log 文件")
    print("=" * 60)


if __name__ == "__main__":
    # 程序入口
    main()

爬取到的数据

csv 复制代码
"股票代码","股票名称","最新价","涨跌幅","涨跌额","成交量","成交额","振幅","换手率","总市值","流通市值","采集时间"
"689009","九号公司-WD","60.06","-0.03","-0.02","28701.0","172355676.0","1.48","0.52","43098847472.0","33218698253.0","2025-11-13 10:26:21"
"688981","中芯国际","119.8","0.14","0.17","161767.0","1925898746.0","2.01","0.81","958411133733.0","239547593370.0","2025-11-13 10:26:21"
"688819","天能股份","35.38","3.97","1.35","31263.0","109227460.0","5.05","0.32","34392898000.0","34392898000.0","2025-11-13 10:26:21"
"688800","瑞可达","75.8","0.15","0.11","60873.0","461235153.0","5.56","2.96","15590114593.0","15590114593.0","2025-11-13 10:26:21"
"688799","华纳药厂","51.0","1.94","0.97","14050.0","70906365.0","3.62","1.07","6697320000.0","6697320000.0","2025-11-13 10:26:21"
"688798","艾为电子","78.39","-0.62","-0.49","11935.0","93311189.0","1.93","0.88","18274953776.0","10640912584.0","2025-11-13 10:26:21"
"688793","倍轻松","30.47","-0.23","-0.07","3587.0","10927514.0","1.47","0.42","2618756917.0","2618756917.0","2025-11-13 10:26:21"
"688789","宏华数科","76.72","-0.7","-0.54","2737.0","20984898.0","1.16","0.15","13767506191.0","13767506191.0","2025-11-13 10:26:21"
"688788","科思科技","65.6","-1.81","-1.21","7539.0","49942234.0","4.01","0.48","10290961165.0","10290961165.0","2025-11-13 10:26:21"
"688787","海天瑞声","107.58","1.03","1.1","3532.0","37874168.0","2.45","0.59","6489782864.0","6489782864.0","2025-11-13 10:26:21"
"688786","悦安新材","28.02","0.47","0.13","4132.0","11588119.0","1.79","0.29","4029276084.0","4029276084.0","2025-11-13 10:26:21"
"688783","西安奕材-U","27.62","2.71","0.73","90876.0","245398065.0","3.87","5.52","111524036000.0","4546737725.0","2025-11-13 10:26:21"
"688779","五矿新能","8.89","12.96","1.02","1059212.0","905284519.0","15.63","5.49","17150758830.0","17150758830.0","2025-11-13 10:26:21"
"688778","厦钨新能","77.05","4.82","3.54","37059.0","281562734.0","5.48","0.73","38886447945.0","38886447945.0","2025-11-13 10:26:21"
"688777","中控技术","49.96","1.03","0.51","25343.0","126287186.0","1.13","0.32","39527828769.0","39129759329.0","2025-11-13 10:26:21"
"688776","国光电气","101.5","0.5","0.5","32622.0","338982185.0","6.83","3.01","11000917029.0","11000917029.0","2025-11-13 10:26:21"
"688775","影石创新","265.81","-0.95","-2.56","5044.0","132662631.0","2.38","1.65","106589810000.0","8107622056.0","2025-11-13 10:26:21"
"688772","珠海冠宇","25.99","2.0","0.51","85094.0","217072951.0","3.89","0.75","29422469437.0","29422469437.0","2025-11-13 10:26:21"
"688768","容知日新","42.95","-0.99","-0.43","3460.0","14889130.0","1.29","0.4","3779115223.0","3747969430.0","2025-11-13 10:26:21"
"688767","博拓生物","43.07","-1.42","-0.62","4945.0","21454613.0","2.7","0.33","6431786695.0","6431786695.0","2025-11-13 10:26:21"
相关推荐
ZStack开发者社区2 小时前
VMware替代 | ZStack ZSphere虚拟化平台金融级高可用能力解析
服务器·jvm·金融·云计算
nvd113 小时前
Python 迭代器 (Iterator) vs. 生成器 (Generator)
开发语言·python
老罗-Mason3 小时前
Apache Flink运行环境搭建
python·flink·apache
Blossom.1183 小时前
大模型量化压缩实战:从FP16到INT4的生产级精度保持之路
开发语言·人工智能·python·深度学习·神经网络·目标检测·机器学习
linuxxx1104 小时前
Django 缓存详解与应用方法
python·缓存·django
野生工程师4 小时前
【Python爬虫基础-3】数据解析
开发语言·爬虫·python
道19934 小时前
python实现电脑手势识别截图
开发语言·python
shixian10304114 小时前
conda安装Django+pg运行环境
python·django·conda