文件与数据处理：CSV/JSON/Excel/Parquet 高效操作与内存优化

文章目录

- [1. 问题的本质：不是数据太大，是方法不对](#1. 问题的本质：不是数据太大，是方法不对)
- [2. CSV 处理：流式与分块的两种哲学](#2. CSV 处理：流式与分块的两种哲学)
- - [2.1 csv.reader 逐行流式读取](#2.1 csv.reader 逐行流式读取)
  - [2.2 Pandas chunksize：分块迭代的甜点](#2.2 Pandas chunksize：分块迭代的甜点)
- [3. Pandas 内存优化：精确定义数据类型](#3. Pandas 内存优化：精确定义数据类型)
- [4. JSON 大文件处理：流式解析与快速序列化](#4. JSON 大文件处理：流式解析与快速序列化)
- - [4.1 ijson：SAX 风格的流式 JSON 解析](#4.1 ijson：SAX 风格的流式 JSON 解析)
  - [4.2 orjson：序列化性能的质变](#4.2 orjson：序列化性能的质变)
- [5. Excel 批量处理：善用只读模式](#5. Excel 批量处理：善用只读模式)
- [6. Parquet 列式存储：数据分析的终极格式](#6. Parquet 列式存储：数据分析的终极格式)
- [7. 压缩格式选型：速度与体积的权衡](#7. 压缩格式选型：速度与体积的权衡)
- [8. 并行文件处理：利用多核 CPU](#8. 并行文件处理：利用多核 CPU)
- [9. 数据质量校验：入库前的最后一道防线](#9. 数据质量校验：入库前的最后一道防线)
- [10. 实战：500MB 销售 CSV 的处理全流程](#10. 实战：500MB 销售 CSV 的处理全流程)
- 总结

1. 问题的本质：不是数据太大，是方法不对

拿到一个 500MB 的 CSV 文件，直觉反应是 pd.read_csv("sales_2024.csv")，然后等 Pandas 加载。等到任务管理器中 Python 进程的内存占用从 200MB 飙到 3.2GB，终端输出 MemoryError 的那一刻------问题不在于"这台笔记本内存不够"，而在于"整表加载"的策略从一开始就有优化空间。
#mermaid-svg-ZIym07r0fqfhSTt2{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZIym07r0fqfhSTt2 .error-icon{fill:#552222;}#mermaid-svg-ZIym07r0fqfhSTt2 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZIym07r0fqfhSTt2 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZIym07r0fqfhSTt2 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZIym07r0fqfhSTt2 .marker.cross{stroke:#333333;}#mermaid-svg-ZIym07r0fqfhSTt2 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZIym07r0fqfhSTt2 p{margin:0;}#mermaid-svg-ZIym07r0fqfhSTt2 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 .cluster-label text{fill:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 .cluster-label span{color:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 .cluster-label span p{background-color:transparent;}#mermaid-svg-ZIym07r0fqfhSTt2 .label text,#mermaid-svg-ZIym07r0fqfhSTt2 span{fill:#333;color:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 .node rect,#mermaid-svg-ZIym07r0fqfhSTt2 .node circle,#mermaid-svg-ZIym07r0fqfhSTt2 .node ellipse,#mermaid-svg-ZIym07r0fqfhSTt2 .node polygon,#mermaid-svg-ZIym07r0fqfhSTt2 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZIym07r0fqfhSTt2 .rough-node .label text,#mermaid-svg-ZIym07r0fqfhSTt2 .node .label text,#mermaid-svg-ZIym07r0fqfhSTt2 .image-shape .label,#mermaid-svg-ZIym07r0fqfhSTt2 .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZIym07r0fqfhSTt2 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZIym07r0fqfhSTt2 .rough-node .label,#mermaid-svg-ZIym07r0fqfhSTt2 .node .label,#mermaid-svg-ZIym07r0fqfhSTt2 .image-shape .label,#mermaid-svg-ZIym07r0fqfhSTt2 .icon-shape .label{text-align:center;}#mermaid-svg-ZIym07r0fqfhSTt2 .node.clickable{cursor:pointer;}#mermaid-svg-ZIym07r0fqfhSTt2 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZIym07r0fqfhSTt2 .arrowheadPath{fill:#333333;}#mermaid-svg-ZIym07r0fqfhSTt2 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZIym07r0fqfhSTt2 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZIym07r0fqfhSTt2 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZIym07r0fqfhSTt2 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZIym07r0fqfhSTt2 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZIym07r0fqfhSTt2 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZIym07r0fqfhSTt2 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZIym07r0fqfhSTt2 .cluster text{fill:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 .cluster span{color:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZIym07r0fqfhSTt2 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZIym07r0fqfhSTt2 rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZIym07r0fqfhSTt2 .icon-shape,#mermaid-svg-ZIym07r0fqfhSTt2 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZIym07r0fqfhSTt2 .icon-shape p,#mermaid-svg-ZIym07r0fqfhSTt2 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZIym07r0fqfhSTt2 .icon-shape .label rect,#mermaid-svg-ZIym07r0fqfhSTt2 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZIym07r0fqfhSTt2 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZIym07r0fqfhSTt2 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZIym07r0fqfhSTt2 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 整表加载
分块迭代
流式读取
500MB CSV 文件
选择读取策略
Pandas read_csv

内存占用: 2.1GB
chunksize=50000

内存占用: 180MB
csv.reader 逐行

内存占用: <10MB
OOM 风险
稳定运行
稳定运行

但需手写聚合逻辑

三种策略的选择依据不是"哪个更高级"，而是"当前任务需要什么"。整表加载适合探索性分析（数据量小于可用内存的 1/4），分块迭代适合需要分组聚合的场景，流式读取适合只做过滤不需要全局聚合的场景。

2. CSV 处理：流式与分块的两种哲学

2.1 csv.reader 逐行流式读取

标准库 csv 模块的 reader 是一个生成器，每次只返回一行数据，内存占用始终维持在单行级别：

python 复制代码

import csv
from collections import Counter

category_counter = Counter()
total_revenue = 0.0
row_count = 0

with open("sales_2024.csv", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        # 流式过滤：只处理"已完成"的订单
        if row["status"] != "completed":
            continue
        category_counter[row["category"]] += 1
        total_revenue += float(row["revenue"])
        row_count += 1

print(f"完成订单数: {row_count}")
print(f"总收入: {total_revenue:.2f}")
print(f"品类分布: {category_counter.most_common(5)}")

这种方式在处理纯粹的过滤和累计统计时效率极高。但如果任务是多列分组聚合（类似 SQL 的 GROUP BY category, region），csv.reader 的代码复杂度会急剧上升------需要手写分组字典和聚合逻辑。

2.2 Pandas chunksize：分块迭代的甜点

chunksize 是 Pandas 处理大文件的核心武器。它将整个文件分割成多个 DataFrame 块，每次只加载一块到内存：

python 复制代码

import pandas as pd

chunk_results = []
total_rows = 0
max_memory_mb = 0

for chunk in pd.read_csv("sales_2024.csv", chunksize=50000):
    # 当前块的内存占用
    mem_mb = chunk.memory_usage(deep=True).sum() / 1024 / 1024
    max_memory_mb = max(max_memory_mb, mem_mb)

    # 当前块的聚合结果
    agg = (
        chunk[chunk["status"] == "completed"]
        .groupby("category")
        .agg(total_revenue=("revenue", "sum"), order_count=("order_id", "count"))
    )
    chunk_results.append(agg)
    total_rows += len(chunk)

# 合并所有分块的结果
final = pd.concat(chunk_results).groupby("category").sum()
print(f"处理完成：{total_rows} 行，峰值内存 {max_memory_mb:.0f} MB")

chunksize 的值选取有讲究：太小（如 1000）会导致过多的 I/O 次数和分组开销，太大（如 200000）则接近整表加载。经验值是 50000-100000 行------大多数 CSV 文件在这个粒度下单块内存不超过 50MB，同时 I/O 次数控制在合理范围内。

3. Pandas 内存优化：精确定义数据类型

Pandas 默认推断的数据类型往往比实际需要的大得多。一个典型的例子：int64 类型的列实际取值范围是 0-100，用 int8 就足够了------内存占用降低 87.5%。

python 复制代码

import numpy as np

# 优化前：Pandas 的默认 dtype
df = pd.read_csv("sales_2024.csv")
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")
# 输出：~2100 MB

# 优化后：精确指定 dtype
df_optimized = pd.read_csv("sales_2024.csv", dtype={
    "order_id": "int32",           # 默认 int64 → int32，降 50%
    "customer_id": "int32",
    "product_id": "int32",
    "status": "category",          # 重复的字符串 → category，降 90%+
    "category": "category",
    "region": "category",
    "payment_method": "category",
    "revenue": "float32",          # 默认 float64 → float32，降 50%
    "quantity": "int8",            # 数量 1-100，int8 足够
    "rating": "int8",              # 评分 1-5，int8 足够
})
print(df_optimized.memory_usage(deep=True).sum() / 1024**2, "MB")
# 输出：~310 MB

category 类型是将字符串列转换为整数编码，底层维护一个"编码→原始值"的字典映射。它在列中重复值越多效果越好------如果 100 万行数据只有"已完成、已退款、处理中"三个值，category 会将 100 万个字符串对象替换为 100 万个 int8 编码，内存从 ~60MB 降到 ~1MB。
#mermaid-svg-6YUGagr8ZcaQpAtQ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6YUGagr8ZcaQpAtQ .error-icon{fill:#552222;}#mermaid-svg-6YUGagr8ZcaQpAtQ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6YUGagr8ZcaQpAtQ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .marker.cross{stroke:#333333;}#mermaid-svg-6YUGagr8ZcaQpAtQ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6YUGagr8ZcaQpAtQ p{margin:0;}#mermaid-svg-6YUGagr8ZcaQpAtQ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .cluster-label text{fill:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .cluster-label span{color:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .cluster-label span p{background-color:transparent;}#mermaid-svg-6YUGagr8ZcaQpAtQ .label text,#mermaid-svg-6YUGagr8ZcaQpAtQ span{fill:#333;color:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .node rect,#mermaid-svg-6YUGagr8ZcaQpAtQ .node circle,#mermaid-svg-6YUGagr8ZcaQpAtQ .node ellipse,#mermaid-svg-6YUGagr8ZcaQpAtQ .node polygon,#mermaid-svg-6YUGagr8ZcaQpAtQ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .rough-node .label text,#mermaid-svg-6YUGagr8ZcaQpAtQ .node .label text,#mermaid-svg-6YUGagr8ZcaQpAtQ .image-shape .label,#mermaid-svg-6YUGagr8ZcaQpAtQ .icon-shape .label{text-anchor:middle;}#mermaid-svg-6YUGagr8ZcaQpAtQ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .rough-node .label,#mermaid-svg-6YUGagr8ZcaQpAtQ .node .label,#mermaid-svg-6YUGagr8ZcaQpAtQ .image-shape .label,#mermaid-svg-6YUGagr8ZcaQpAtQ .icon-shape .label{text-align:center;}#mermaid-svg-6YUGagr8ZcaQpAtQ .node.clickable{cursor:pointer;}#mermaid-svg-6YUGagr8ZcaQpAtQ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .arrowheadPath{fill:#333333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6YUGagr8ZcaQpAtQ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6YUGagr8ZcaQpAtQ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6YUGagr8ZcaQpAtQ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6YUGagr8ZcaQpAtQ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .cluster text{fill:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ .cluster span{color:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6YUGagr8ZcaQpAtQ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6YUGagr8ZcaQpAtQ rect.text{fill:none;stroke-width:0;}#mermaid-svg-6YUGagr8ZcaQpAtQ .icon-shape,#mermaid-svg-6YUGagr8ZcaQpAtQ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6YUGagr8ZcaQpAtQ .icon-shape p,#mermaid-svg-6YUGagr8ZcaQpAtQ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6YUGagr8ZcaQpAtQ .icon-shape .label rect,#mermaid-svg-6YUGagr8ZcaQpAtQ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6YUGagr8ZcaQpAtQ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6YUGagr8ZcaQpAtQ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6YUGagr8ZcaQpAtQ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} category类型
0
0: completed
0
1
1: refunded
0
object类型
completed
存储完整字符串
completed
存储完整字符串
refunded
存储完整字符串
completed
存储完整字符串

category 类型的两个限制需要注意：不支持合并不同类别的 category 列（编码冲突），以及不适用于值唯一性高的列（如 order_id）------此时编码字典本身的内存开销可能超过原始的字符串对象。

4. JSON 大文件处理：流式解析与快速序列化

4.1 ijson：SAX 风格的流式 JSON 解析

标准库 json.load() 会将整个 JSON 文件加载到内存并构造完整的 Python 对象树。对于几百 MB 的 JSON（如 API 导出的日志文件），ijson 的事件驱动解析可以避免一次性加载全部数据：

python 复制代码

import ijson

# 文件内容：[{"id": 1, "name": "..."}, {"id": 2, ...}, ...]
with open("api_logs_2024.json", "rb") as f:
    records = ijson.items(f, "item")  # 逐条解析数组中的每个对象
    large_orders = []
    for record in records:
        if record.get("amount", 0) > 10000:
            large_orders.append({
                "id": record["id"],
                "amount": record["amount"],
                "created_at": record["created_at"],
            })

ijson 的 API 设计类似于 XML 的 SAX 解析器------不在内存中构建完整的 DOM 树，而是逐个事件触发回调。ijson.items(f, "item") 告诉解析器"关注根数组中的每一个元素"，解析器只保留当前元素的数据。

4.2 orjson：序列化性能的质变

当需要将处理结果输出为 JSON 时，标准库 json.dumps() 是性能瓶颈。orjson 是 Rust 编写的快速 JSON 库，对于 Python 原生类型的序列化比标准库快 3-5 倍：

python 复制代码

import orjson

data = [{"id": i, "name": f"product_{i}", "tags": ["a", "b", "c"]} for i in range(100000)]

# 标准库
import json
_ = json.dumps(data)           # ~450ms

# orjson（返回 bytes，需要 decode）
result = orjson.dumps(data)    # ~100ms
text = result.decode("utf-8")

orjson 的额外优势：自动将 datetime 对象序列化为 ISO 8601 格式；支持 numpy 类型的直接序列化（不需要 .tolist() 转换）；默认按 key 排序输出的 JSON 对象（便于 diff 比较）。

ujson（UltraJSON）是另一个选择，性能与 orjson 接近但在 datetime 和 Decimal 类型的处理上不如 orjson 稳健。新项目推荐默认使用 orjson。

5. Excel 批量处理：善用只读模式

openpyxl 的默认模式会将整个 .xlsx 文件加载到内存，一个 50MB 的 Excel 文件在内存中可能膨胀到 200-500MB。read_only=True 模式以流式方式逐行读取：

python 复制代码

from openpyxl import load_workbook
import csv

wb = load_workbook("sales_report.xlsx", read_only=True, data_only=True)
ws = wb.active

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for row in ws.iter_rows(values_only=True):
        writer.writerow(row)

wb.close()

关键参数：

read_only=True：不加载样式、公式缓存等非必要数据，内存占用降低 10 倍
data_only=True：读取公式的计算结果而非公式本身
iter_rows(values_only=True)：返回值的元组而非 Cell 对象，减少对象创建开销

当 Excel 文件超过 100MB 时，openpyxl 的性能会显著下降。这个阈值是切换策略的信号------考虑将 Excel 先转换为 CSV 再处理，或者切换到 pyxlsb（二进制 xlsb 格式）或 calamine（Rust 引擎）等更高效的读取器。

6. Parquet 列式存储：数据分析的终极格式

Parquet 是为分析型查询（OLAP）设计的列式存储格式。与 CSV 的行式存储不同，Parquet 将每一列的数据独立存储，这使得"只读取需要的列"和"在文件层面过滤行"成为可能。

python 复制代码

import pyarrow.parquet as pq
import pyarrow as pa

# CSV → Parquet 转换
table = pa.csv.read_csv("sales_2024.csv")
pq.write_table(table, "sales_2024.parquet", compression="zstd")

# 列裁剪：只读取需要的 3 列（而不是全部 15 列）
columns = ["order_id", "revenue", "category"]
df = pq.read_table("sales_2024.parquet", columns=columns).to_pandas()

# 谓词下推：在文件层面过滤，只读取满足条件的行组
import pyarrow.dataset as ds
dataset = ds.dataset("sales_2024.parquet")
filtered = dataset.to_table(
    filter=ds.field("revenue") > 1000,
    columns=["order_id", "category", "revenue"],
)

谓词下推是 Parquet 的核心性能优势。Parquet 文件内部按行组（Row Group）组织，每个行组存储该范围内列的最小/最大值统计。当过滤条件为 revenue > 1000 时，PyArrow 会读取每个行组的统计信息，遇到最大值为 500 的行组直接跳过------整个行组都不会被读入内存。对于有索引列（如日期分区）的数据，谓词下推能将扫描的数据量从全量降到 1-5%。
#mermaid-svg-Ubr15iGyY3RLI0uJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Ubr15iGyY3RLI0uJ .error-icon{fill:#552222;}#mermaid-svg-Ubr15iGyY3RLI0uJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Ubr15iGyY3RLI0uJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .marker.cross{stroke:#333333;}#mermaid-svg-Ubr15iGyY3RLI0uJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Ubr15iGyY3RLI0uJ p{margin:0;}#mermaid-svg-Ubr15iGyY3RLI0uJ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .cluster-label text{fill:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .cluster-label span{color:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .cluster-label span p{background-color:transparent;}#mermaid-svg-Ubr15iGyY3RLI0uJ .label text,#mermaid-svg-Ubr15iGyY3RLI0uJ span{fill:#333;color:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .node rect,#mermaid-svg-Ubr15iGyY3RLI0uJ .node circle,#mermaid-svg-Ubr15iGyY3RLI0uJ .node ellipse,#mermaid-svg-Ubr15iGyY3RLI0uJ .node polygon,#mermaid-svg-Ubr15iGyY3RLI0uJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .rough-node .label text,#mermaid-svg-Ubr15iGyY3RLI0uJ .node .label text,#mermaid-svg-Ubr15iGyY3RLI0uJ .image-shape .label,#mermaid-svg-Ubr15iGyY3RLI0uJ .icon-shape .label{text-anchor:middle;}#mermaid-svg-Ubr15iGyY3RLI0uJ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .rough-node .label,#mermaid-svg-Ubr15iGyY3RLI0uJ .node .label,#mermaid-svg-Ubr15iGyY3RLI0uJ .image-shape .label,#mermaid-svg-Ubr15iGyY3RLI0uJ .icon-shape .label{text-align:center;}#mermaid-svg-Ubr15iGyY3RLI0uJ .node.clickable{cursor:pointer;}#mermaid-svg-Ubr15iGyY3RLI0uJ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .arrowheadPath{fill:#333333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ubr15iGyY3RLI0uJ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Ubr15iGyY3RLI0uJ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ubr15iGyY3RLI0uJ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Ubr15iGyY3RLI0uJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .cluster text{fill:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ .cluster span{color:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Ubr15iGyY3RLI0uJ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Ubr15iGyY3RLI0uJ rect.text{fill:none;stroke-width:0;}#mermaid-svg-Ubr15iGyY3RLI0uJ .icon-shape,#mermaid-svg-Ubr15iGyY3RLI0uJ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ubr15iGyY3RLI0uJ .icon-shape p,#mermaid-svg-Ubr15iGyY3RLI0uJ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Ubr15iGyY3RLI0uJ .icon-shape .label rect,#mermaid-svg-Ubr15iGyY3RLI0uJ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ubr15iGyY3RLI0uJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Ubr15iGyY3RLI0uJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Ubr15iGyY3RLI0uJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} max 500 < 1000
Parquet 文件

8 个行组
读取 Row Group 统计信息
RG1: revenue $0-500$
RG2: revenue $300-2000$
RG3: revenue $800-5000$
跳过 RG1

不读入内存
读取 RG2

7. 压缩格式选型：速度与体积的权衡

四种主流压缩格式各有侧重：

格式	压缩比	压缩速度	解压速度	最佳场景
gzip	中等 (5:1)	慢	中等	长期归档，兼容性优先
snappy	较低 (3:1)	快	最快	实时数据传输
zstd	高 (6:1)	中等	快	数据分析存储（推荐）
lz4	较低 (3:1)	最快	最快	内存中的缓存压缩

Parquet 文件写入时指定压缩格式：

python 复制代码

pq.write_table(table, "data.parquet", compression="zstd", compression_level=3)

对于 CSV 文件的批量压缩，使用 Python 标准库：

python 复制代码

import gzip, lzma

# gzip 压缩
with gzip.open("sales.csv.gz", "wt", compresslevel=6) as f:
    f.write(csv_content)

# xz/lzma 压缩（压缩比最高）
with lzma.open("sales.csv.xz", "wt") as f:
    f.write(csv_content)

推荐策略：实时处理（如日志流）用 snappy 或 lz4；数据湖和 OLAP 查询用 zstd（压缩比高且解压快）；跨团队共享的归档文件用 gzip（兼容性最好）。

8. 并行文件处理：利用多核 CPU

多个独立文件的处理是最容易并行的场景------文件之间没有依赖关系，天然适合多进程：

python 复制代码

from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path

def process_file(filepath: Path) -> dict:
    """处理单个 CSV 文件并返回统计"""
    import pandas as pd
    df = pd.read_csv(filepath, usecols=["category", "revenue"], dtype={
        "category": "category",
        "revenue": "float32",
    })
    return {
        "file": filepath.name,
        "rows": len(df),
        "total_revenue": df["revenue"].sum(),
        "categories": df["category"].nunique(),
    }

csv_files = list(Path("data/2024/").glob("*.csv"))
results = []

with ProcessPoolExecutor(max_workers=4) as executor:
    future_to_file = {executor.submit(process_file, f): f for f in csv_files}
    for future in as_completed(future_to_file):
        results.append(future.result())

for r in results:
    print(f"{r['file']}: {r['rows']} 行, 收入 {r['total_revenue']:.0f}, {r['categories']} 品类")

关键优化点：

usecols 只读取需要的列，I/O 量减少 60-80%
dtype 在读取时就指定类型，避免 Pandas 先推断再转换的双重开销
max_workers 设为 CPU 核数（os.cpu_count()），超了会导致上下文切换抵消并行收益

9. 数据质量校验：入库前的最后一道防线

数据处理完成后，入库之前需要验证数据质量。Pandera 提供了基于类型注解的 Schema 校验：

python 复制代码

import pandera as pa
from pandera.typing import DataFrame, Series

class SalesSchema(pa.DataFrameModel):
    order_id: Series[int] = pa.Field(ge=1, unique=True)
    customer_id: Series[int] = pa.Field(ge=1)
    product_id: Series[int] = pa.Field(ge=1)
    status: Series[str] = pa.Field(isin=["completed", "refunded", "pending"])
    category: Series[str]
    revenue: Series[float] = pa.Field(ge=0)
    quantity: Series[int] = pa.Field(ge=1, le=100)
    order_date: Series[pa.DateTime]

    class Config:
        coerce = True  # 自动转换类型（如 str → datetime）

try:
    SalesSchema.validate(df_cleaned, lazy=True)
    print("校验通过")
except pa.errors.SchemaErrors as e:
    print(f"校验失败：{e.failure_cases}")
    e.failure_cases.to_csv("validation_errors.csv")

lazy=True 让 Pandera 收集所有错误后一次性报告，而不是遇到第一个错误就停止------这对于批量处理场景尤为重要，一次运行就能看到所有数据质量问题，不需要"修一个、再跑一次"的循环。

10. 实战：500MB 销售 CSV 的处理全流程

将以上所有技巧串联为一条完整的处理管道：

python 复制代码

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

CHUNKSIZE = 50000
COLUMNS = ["order_id", "customer_id", "status", "category", "revenue", "quantity", "order_date"]
DTYPE = {
    "order_id": "int32",
    "customer_id": "int32",
    "status": "category",
    "category": "category",
    "revenue": "float32",
    "quantity": "int8",
}

results = []
peak_memory = 0

for i, chunk in enumerate(pd.read_csv(
    "sales_2024.csv",
    chunksize=CHUNKSIZE,
    usecols=COLUMNS,
    dtype=DTYPE,
    parse_dates=["order_date"],
)):
    # 内存监控
    mem = chunk.memory_usage(deep=True).sum() / 1024**2
    peak_memory = max(peak_memory, mem)

    # 数据清洗（在分块内完成，避免全量加载）
    chunk = chunk[chunk["revenue"] > 0]          # 剔除异常值
    chunk = chunk[chunk["status"] != "refunded"] # 过滤退款订单
    chunk["month"] = chunk["order_date"].dt.to_period("M")  # 增加派生列

    # 聚合统计
    agg = chunk.groupby(["month", "category"], observed=True).agg(
        total_revenue=("revenue", "sum"),
        order_count=("order_id", "count"),
        avg_quantity=("quantity", "mean"),
    )
    results.append(agg)

    if (i + 1) % 20 == 0:
        print(f"已处理 {(i + 1) * CHUNKSIZE} 行，峰值内存 {peak_memory:.0f} MB")

# 合并所有分块结果
final = pd.concat(results).groupby(["month", "category"]).sum().reset_index()

# 输出为 Parquet
table = pa.Table.from_pandas(final)
pq.write_table(table, "sales_2024_agg.parquet", compression="zstd")

print(f"处理完成")
print(f"  输入: sales_2024.csv (500 MB)")
print(f"  输出: sales_2024_agg.parquet (压缩后约 8 MB)")
print(f"  峰值内存: {peak_memory:.0f} MB (优化前约 2100 MB)")

这条管道从 500MB CSV 出发，经过类型优化、分块处理、清洗聚合，最终输出一份压缩后约 8MB 的 Parquet 聚合报表，峰值内存从 2100MB 降至不到 200MB。整个过程不需要 Spark，不需要数据库，只需要标准 Python 环境和正确的工程策略。关于 CSV 分块处理和 Parquet 格式的选择策略，在前述章节中已有详细分析。

总结

面对大文件数据处理，"换更强大的机器"是最后的选择而不是第一选择。正确的策略顺序是：先限制数据类型（dtype）→ 再分块读取（chunksize）→ 后并行处理（ProcessPoolExecutor）。每一步都有量化的内存收益：

优化步骤	峰值内存（500MB CSV）	说明
未优化（整表加载）	~2100 MB	Pandas 默认推断类型
dtype 优化	~310 MB	int64→int32, str→category
chunksize + dtype	~180 MB	分块迭代
Parquet + 谓词下推	~120 MB	跳过不满足过滤的行组

文件处理是 Python 开发的日常任务，也是很多性能问题的根源。掌握类型优化、分块策略和格式选择，能让一台普通的笔记本拥有处理 GB 级数据集的能力。

如果这篇文章中的数据处理技巧对日常工作有帮助，欢迎点赞、收藏、关注。持续输出高质量技术内容离不开读者的支持。