Python 并发编程实战:多线程、协程与多进程全解析
🎯 适合人群 :有 Java / Go 并发经验、切换 Python 后对 GIL 和协程感到困惑的后端工程师
⏱️ 阅读时间 :约 60 分钟
💬 一句话定位:从 GIL 底层机制到三种并发工具的完整用法,同步原语、异步模式、进程通信一网打尽------用数据工程场景贯穿始终
从 Java 或 Go 切换到 Python 做并发编程,几乎每个人都会踩同一个坑:写出来的多线程代码,性能比单线程还差。
这不是你的问题,是 Python 的设计决定。
但"Python 多线程没用"这个结论只对了一半------理解清楚 GIL 的边界,选对工具,Python 的并发同样可以写出高性能的系统。
一、Python 并发的"特殊性"------先做认知校正
1.1 GIL:绕不过去的全局解释器锁
GIL(Global Interpreter Lock) 是 CPython 解释器中的一把全局互斥锁。它的规则只有一条:任意时刻,只允许一个线程执行 Python 字节码。
为什么要有 GIL?CPython 用引用计数管理内存:每个对象记录有多少引用指向它,归零时立即释放。引用计数不是原子操作,多线程同时修改会导致计数错乱,进而造成内存泄漏或 double-free 崩溃。GIL 是一个简单粗暴但有效的解法:直接不让多个线程同时跑,就不会有竞争条件了。
用图来对比 Java 和 Python 的多线程调度模型:
#mermaid-svg-BlajRKj2giDGZP1g{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-BlajRKj2giDGZP1g .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-BlajRKj2giDGZP1g .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-BlajRKj2giDGZP1g .error-icon{fill:#552222;}#mermaid-svg-BlajRKj2giDGZP1g .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-BlajRKj2giDGZP1g .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-BlajRKj2giDGZP1g .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-BlajRKj2giDGZP1g .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-BlajRKj2giDGZP1g .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-BlajRKj2giDGZP1g .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-BlajRKj2giDGZP1g .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-BlajRKj2giDGZP1g .marker{fill:#333333;stroke:#333333;}#mermaid-svg-BlajRKj2giDGZP1g .marker.cross{stroke:#333333;}#mermaid-svg-BlajRKj2giDGZP1g svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-BlajRKj2giDGZP1g p{margin:0;}#mermaid-svg-BlajRKj2giDGZP1g .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-BlajRKj2giDGZP1g .cluster-label text{fill:#333;}#mermaid-svg-BlajRKj2giDGZP1g .cluster-label span{color:#333;}#mermaid-svg-BlajRKj2giDGZP1g .cluster-label span p{background-color:transparent;}#mermaid-svg-BlajRKj2giDGZP1g .label text,#mermaid-svg-BlajRKj2giDGZP1g span{fill:#333;color:#333;}#mermaid-svg-BlajRKj2giDGZP1g .node rect,#mermaid-svg-BlajRKj2giDGZP1g .node circle,#mermaid-svg-BlajRKj2giDGZP1g .node ellipse,#mermaid-svg-BlajRKj2giDGZP1g .node polygon,#mermaid-svg-BlajRKj2giDGZP1g .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-BlajRKj2giDGZP1g .rough-node .label text,#mermaid-svg-BlajRKj2giDGZP1g .node .label text,#mermaid-svg-BlajRKj2giDGZP1g .image-shape .label,#mermaid-svg-BlajRKj2giDGZP1g .icon-shape .label{text-anchor:middle;}#mermaid-svg-BlajRKj2giDGZP1g .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-BlajRKj2giDGZP1g .rough-node .label,#mermaid-svg-BlajRKj2giDGZP1g .node .label,#mermaid-svg-BlajRKj2giDGZP1g .image-shape .label,#mermaid-svg-BlajRKj2giDGZP1g .icon-shape .label{text-align:center;}#mermaid-svg-BlajRKj2giDGZP1g .node.clickable{cursor:pointer;}#mermaid-svg-BlajRKj2giDGZP1g .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-BlajRKj2giDGZP1g .arrowheadPath{fill:#333333;}#mermaid-svg-BlajRKj2giDGZP1g .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-BlajRKj2giDGZP1g .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-BlajRKj2giDGZP1g .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-BlajRKj2giDGZP1g .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-BlajRKj2giDGZP1g .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-BlajRKj2giDGZP1g .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-BlajRKj2giDGZP1g .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-BlajRKj2giDGZP1g .cluster text{fill:#333;}#mermaid-svg-BlajRKj2giDGZP1g .cluster span{color:#333;}#mermaid-svg-BlajRKj2giDGZP1g div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-BlajRKj2giDGZP1g .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-BlajRKj2giDGZP1g rect.text{fill:none;stroke-width:0;}#mermaid-svg-BlajRKj2giDGZP1g .icon-shape,#mermaid-svg-BlajRKj2giDGZP1g .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-BlajRKj2giDGZP1g .icon-shape p,#mermaid-svg-BlajRKj2giDGZP1g .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-BlajRKj2giDGZP1g .icon-shape .label rect,#mermaid-svg-BlajRKj2giDGZP1g .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-BlajRKj2giDGZP1g .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-BlajRKj2giDGZP1g .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-BlajRKj2giDGZP1g :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Python 多线程(伪并行)
线程 1
GIL
(同一时刻只有一个线程持有)
线程 2
线程 3
CPU 核心(同一时刻只用一个)
Java 多线程(真正并行)
线程 1
执行 Java 代码
CPU 核心 1
线程 2
执行 Java 代码
CPU 核心 2
线程 3
执行 Java 代码
CPU 核心 3
🤔 我的理解:Java 的多线程是真正的并行------多个线程可以同时在不同 CPU 核心上跑。Python 的多线程更像是"交替跑"------线程们排队争抢一把锁,同一时刻只有一个线程在真正执行 Python 代码。
1.2 GIL 的释放机制:check interval
GIL 不是永远抱着不放。CPython 有一套"强制释放"机制------check interval(检查间隔):
- Python 3.2 以前:每执行 100 条字节码指令,强制触发一次 GIL 切换检查
- Python 3.2 以后 :改为基于时间,默认每 5ms 强制检查一次(
sys.getswitchinterval()查看,sys.setswitchinterval()修改)
python
import sys
sys.getswitchinterval() # 0.005(5ms)
# 测试:降低间隔,让线程切换更频繁(调试用,生产别这么做)
sys.setswitchinterval(0.001) # 1ms
除了 check interval,以下两种情况 GIL 也会主动释放:
- I/O 操作期间:线程发起网络请求、读写文件时,会主动释放 GIL,让其他线程趁机运行
- 调用释放 GIL 的 C 扩展 :NumPy 的矩阵运算、
time.sleep()、hashlib等底层 C 代码执行时,GIL 暂时释放
⚠️ 一个容易误解的点 :NumPy 的 C 层计算会释放 GIL,所以 NumPy 密集运算 + 多线程是有效的。但纯 Python 的 CPU 密集循环不会释放 GIL,多线程没用。
python
import numpy as np
import threading
import time
# NumPy 矩阵乘法:C 层执行,会释放 GIL → 多线程有效
def numpy_work():
a = np.random.rand(2000, 2000)
np.dot(a, a)
# 纯 Python 循环:不释放 GIL → 多线程无效
def pure_python_work():
result = 0
for i in range(10_000_000):
result += i
# 可以自己测一下,numpy_work 的多线程版本会比串行快,pure_python_work 不会
1.3 三种并发工具一览
| 工具 | 并发模型 | 适用场景 | 绕过 GIL? |
|---|---|---|---|
threading |
操作系统线程,GIL 交替执行 | I/O 密集,有遗留同步代码 | ❌(I/O 等待时释放) |
asyncio |
单线程事件循环,协程协作式调度 | I/O 密集,高并发 | ❌(单线程,不需要) |
multiprocessing |
多进程,各自独立的 GIL | CPU 密集,数据并行 | ✅(独立进程) |
📝 本文涉及的示例代码均基于 Python 3.11+。部分新特性(TaskGroup、ExceptionGroup)需要 Python 3.11 及以上。
二、threading:多线程详解
数据工程中最常见的 I/O 密集任务:批量调用外部 API 拉数据、并发读写文件、异步查询数据库。这类任务的特点是大量时间在等待,而不是在计算。
2.1 Thread 的生命周期
#mermaid-svg-lttDwa8YUbOng71k{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-lttDwa8YUbOng71k .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-lttDwa8YUbOng71k .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-lttDwa8YUbOng71k .error-icon{fill:#552222;}#mermaid-svg-lttDwa8YUbOng71k .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-lttDwa8YUbOng71k .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-lttDwa8YUbOng71k .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-lttDwa8YUbOng71k .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-lttDwa8YUbOng71k .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-lttDwa8YUbOng71k .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-lttDwa8YUbOng71k .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-lttDwa8YUbOng71k .marker{fill:#333333;stroke:#333333;}#mermaid-svg-lttDwa8YUbOng71k .marker.cross{stroke:#333333;}#mermaid-svg-lttDwa8YUbOng71k svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-lttDwa8YUbOng71k p{margin:0;}#mermaid-svg-lttDwa8YUbOng71k defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-lttDwa8YUbOng71k g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-lttDwa8YUbOng71k g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-lttDwa8YUbOng71k g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-lttDwa8YUbOng71k g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-lttDwa8YUbOng71k g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-lttDwa8YUbOng71k .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-lttDwa8YUbOng71k .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-lttDwa8YUbOng71k .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-lttDwa8YUbOng71k .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-lttDwa8YUbOng71k .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-lttDwa8YUbOng71k .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-lttDwa8YUbOng71k .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-lttDwa8YUbOng71k .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lttDwa8YUbOng71k .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-lttDwa8YUbOng71k .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lttDwa8YUbOng71k .edgeLabel .label text{fill:#333;}#mermaid-svg-lttDwa8YUbOng71k .label div .edgeLabel{color:#333;}#mermaid-svg-lttDwa8YUbOng71k .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-lttDwa8YUbOng71k .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-lttDwa8YUbOng71k .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-lttDwa8YUbOng71k .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-lttDwa8YUbOng71k .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-lttDwa8YUbOng71k .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lttDwa8YUbOng71k .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lttDwa8YUbOng71k #statediagram-barbEnd{fill:#333333;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lttDwa8YUbOng71k .cluster-label,#mermaid-svg-lttDwa8YUbOng71k .nodeLabel{color:#131300;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-lttDwa8YUbOng71k .note-edge{stroke-dasharray:5;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-note text{fill:black;}#mermaid-svg-lttDwa8YUbOng71k .statediagram-note .nodeLabel{color:black;}#mermaid-svg-lttDwa8YUbOng71k .statediagram .edgeLabel{color:red;}#mermaid-svg-lttDwa8YUbOng71k #dependencyStart,#mermaid-svg-lttDwa8YUbOng71k #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-lttDwa8YUbOng71k .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-lttDwa8YUbOng71k :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Thread()
start()
获得 GIL
GIL 被抢走 / check interval
I/O 等待 / sleep / Lock
I/O 完成 / 被唤醒 / 拿到锁
run() 执行完毕
新建
就绪
运行
阻塞
结束
python
import threading
import time
def worker(name: str, duration: float):
print(f"[{name}] 开始,线程 ID: {threading.current_thread().ident}")
time.sleep(duration)
print(f"[{name}] 结束")
# 基本用法
t = threading.Thread(target=worker, args=("任务A", 1.0), daemon=True)
t.start()
t.join(timeout=5) # 最多等 5 秒
print(f"线程是否还活着: {t.is_alive()}")
# 继承 Thread(适合需要携带更多状态的场景)
class DataFetchThread(threading.Thread):
def __init__(self, source: str):
super().__init__(daemon=True)
self.source = source
self.result = None
self.error = None
def run(self):
try:
# 模拟数据拉取
time.sleep(0.5)
self.result = {"source": self.source, "data": [1, 2, 3]}
except Exception as e:
self.error = e
threads = [DataFetchThread(f"source_{i}") for i in range(3)]
for t in threads: t.start()
for t in threads: t.join()
for t in threads:
if t.error:
print(f"❌ {t.source}: {t.error}")
else:
print(f"✅ {t.source}: {t.result}")
💡 daemon 线程 :设置
daemon=True后,主程序退出时这些线程会被强制终止,不需要手动 join。适合后台任务(心跳、日志刷新等),但要注意它们没有机会做清理工作。
2.2 ThreadPoolExecutor:推荐的线程池用法
直接用 Thread 管理线程很繁琐,concurrent.futures.ThreadPoolExecutor 是更高层、更推荐的方式:
python
import time
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed, Future
DATA_SOURCES = [
{"name": "淘宝", "url": "https://httpbin.org/delay/1"},
{"name": "京东", "url": "https://httpbin.org/delay/1"},
{"name": "拼多多", "url": "https://httpbin.org/delay/1"},
{"name": "抖音", "url": "https://httpbin.org/delay/1"},
{"name": "快手", "url": "https://httpbin.org/delay/1"},
]
def fetch_sales_data(source: dict) -> dict:
"""同步拉取单个数据源的销售数据"""
start = time.time()
response = requests.get(source["url"], timeout=10)
elapsed = time.time() - start
return {
"source": source["name"],
"status": response.status_code,
"elapsed": round(elapsed, 2),
}
def fetch_all_threading(sources: list, max_workers: int = 5) -> list:
"""用线程池并发拉取所有数据源"""
results = []
start = time.time()
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# submit 提交任务,返回 Future 对象
future_to_source = {
executor.submit(fetch_sales_data, source): source
for source in sources
}
# as_completed 按完成顺序迭代(不是提交顺序)
for future in as_completed(future_to_source):
source = future_to_source[future]
try:
result = future.result()
results.append(result)
print(f"✅ {source['name']} 完成,耗时 {result['elapsed']}s")
except Exception as e:
print(f"❌ {source['name']} 失败:{e}")
results.append({"source": source["name"], "error": str(e)})
total = round(time.time() - start, 2)
print(f"\n总耗时:{total}s(串行约需 5s)")
return results
Future 对象详解
executor.submit() 返回的 Future 是任务的"句柄",比结果本身更灵活:
python
from concurrent.futures import ThreadPoolExecutor, Future
import time
def slow_task(x: int) -> int:
time.sleep(x)
if x == 3:
raise ValueError("x 不能是 3")
return x * 10
with ThreadPoolExecutor(max_workers=4) as executor:
futures: list[Future] = [executor.submit(slow_task, i) for i in range(5)]
# 立即检查状态(非阻塞)
for f in futures:
print(f.running(), f.done()) # 正在执行?已完成?
# 添加回调:任务完成后自动调用(在完成线程中执行,注意线程安全)
def on_done(future: Future):
if future.exception():
print(f"回调:任务失败 → {future.exception()}")
else:
print(f"回调:任务成功 → {future.result()}")
for f in futures:
f.add_done_callback(on_done)
# result() 阻塞等待,可以设置超时
try:
result = futures[0].result(timeout=5)
except TimeoutError:
print("超时了")
except Exception as e:
print(f"任务异常:{e}")
💡
as_completed比executor.map更灵活:它按任务完成顺序返回,可以"谁先完成谁先处理",不等最慢的那个。executor.map按提交顺序返回,如果第一个最慢,后面全被阻塞。
2.3 同步原语:锁、信号量、事件
多线程共享状态时,需要同步原语来避免竞争条件。
Lock 和 RLock
python
import threading
# ---- Lock:互斥锁 ----
lock = threading.Lock()
results = []
def good_worker(data):
processed = do_something(data)
with lock: # 进入临界区
results.append(processed)
# 离开临界区,锁自动释放
# ❌ 糟糕:RLock 和 Lock 的误用
def bad_nested(lock):
with lock:
with lock: # Lock 不可重入,这里会死锁!
pass
# ✅ 正确:需要重入时用 RLock(Reentrant Lock)
rlock = threading.RLock()
def safe_nested():
with rlock:
with rlock: # RLock 同一线程可多次获取,不会死锁
pass
Semaphore:控制并发数量
Semaphore 是比 Lock 更通用的同步原语,允许同时有 N 个线程进入临界区。常用于限流:
python
import threading
import time
# 限制同时最多 3 个线程访问外部 API(避免被限速)
api_semaphore = threading.Semaphore(3)
def call_external_api(task_id: int):
with api_semaphore:
print(f"任务 {task_id} 开始调用 API")
time.sleep(0.5) # 模拟 API 耗时
print(f"任务 {task_id} 完成")
threads = [threading.Thread(target=call_external_api, args=(i,)) for i in range(10)]
for t in threads: t.start()
for t in threads: t.join()
# 观察输出:始终最多 3 个"开始调用 API" 同时出现
Event:线程间的信号通知
python
import threading
import time
# 场景:数据加载完成后,通知所有等待的处理线程开始工作
data_ready = threading.Event()
shared_data = None
def data_loader():
global shared_data
print("开始加载数据...")
time.sleep(2)
shared_data = [1, 2, 3, 4, 5]
data_ready.set() # 发出信号:数据已就绪
print("数据加载完成,信号已发出")
def data_processor(name: str):
print(f"[{name}] 等待数据...")
data_ready.wait() # 阻塞,直到收到信号
print(f"[{name}] 收到信号,开始处理:{shared_data}")
loader = threading.Thread(target=data_loader)
processors = [threading.Thread(target=data_processor, args=(f"处理器{i}",)) for i in range(3)]
for p in processors: p.start()
loader.start()
loader.join()
for p in processors: p.join()
Condition:精细的条件等待
Condition 是更强大的同步原语,允许线程等待某个条件成立,适合实现生产者-消费者模式:
python
import threading
from collections import deque
class BoundedQueue:
"""有界队列:生产者-消费者模型"""
def __init__(self, maxsize: int):
self.maxsize = maxsize
self._queue = deque()
self._cond = threading.Condition()
def put(self, item):
with self._cond:
while len(self._queue) >= self.maxsize:
print("队列满,生产者等待...")
self._cond.wait() # 释放锁并等待
self._queue.append(item)
self._cond.notify_all() # 通知消费者有新数据
def get(self):
with self._cond:
while len(self._queue) == 0:
print("队列空,消费者等待...")
self._cond.wait()
item = self._queue.popleft()
self._cond.notify_all() # 通知生产者有空位
return item
# 使用
import time
q = BoundedQueue(maxsize=3)
def producer():
for i in range(6):
q.put(i)
print(f"生产: {i}")
time.sleep(0.1)
def consumer(name: str):
for _ in range(3):
item = q.get()
print(f"[{name}] 消费: {item}")
time.sleep(0.3)
t_prod = threading.Thread(target=producer)
t_cons1 = threading.Thread(target=consumer, args=("消费者A",))
t_cons2 = threading.Thread(target=consumer, args=("消费者B",))
t_prod.start(); t_cons1.start(); t_cons2.start()
t_prod.join(); t_cons1.join(); t_cons2.join()
2.4 线程本地存储:threading.local
有时候需要每个线程有自己独立的变量副本(比如数据库连接),这就是 threading.local 的用武之地:
python
import threading
import time
# 场景:每个线程维护自己的数据库连接,避免连接被多线程共享导致的问题
thread_local = threading.local()
def get_db_connection():
"""获取当前线程的数据库连接,不存在则创建"""
if not hasattr(thread_local, 'db_conn'):
# 每个线程第一次调用时创建各自的连接
thread_local.db_conn = f"Connection-{threading.current_thread().name}"
print(f"[{threading.current_thread().name}] 创建新连接: {thread_local.db_conn}")
return thread_local.db_conn
def worker_task(task_id: int):
conn = get_db_connection() # 拿到本线程的连接
time.sleep(0.1)
conn2 = get_db_connection() # 同一线程,复用已有连接
assert conn is conn2 # 同一个线程内,始终是同一个连接对象
print(f"任务 {task_id} 使用连接: {conn}")
threads = [threading.Thread(target=worker_task, args=(i,), name=f"Thread-{i}") for i in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# 输出:每个线程只创建一次连接,互不干扰
2.5 线程安全的队列
queue.Queue 是线程安全的队列,内部自带锁,是多线程任务分发的首选:
python
import threading
import queue
import time
def producer(q: queue.Queue, items: list):
for item in items:
q.put(item)
print(f"生产: {item}")
time.sleep(0.05)
q.put(None) # 哨兵值,通知消费者结束
def consumer(q: queue.Queue, name: str):
while True:
try:
item = q.get(timeout=3) # 最多等 3 秒
except queue.Empty:
print(f"[{name}] 超时,退出")
break
if item is None:
q.put(None) # 把哨兵传给下一个消费者
break
# 处理任务
time.sleep(0.1)
print(f"[{name}] 处理: {item}")
q.task_done() # 标记任务完成
task_queue = queue.Queue(maxsize=10)
tasks = list(range(20))
prod = threading.Thread(target=producer, args=(task_queue, tasks))
cons_list = [threading.Thread(target=consumer, args=(task_queue, f"消费者{i}")) for i in range(3)]
prod.start()
for c in cons_list: c.start()
prod.join()
for c in cons_list: c.join()
task_queue.join() # 等待队列中所有 task_done 被调用
print("所有任务处理完毕")
三、asyncio:异步编程详解
3.1 Event Loop 调度原理
asyncio 采用完全不同的并发模型:单线程 + 协作式调度。
核心角色:
- Event Loop(事件循环):调度中心,不断检查哪些协程可以继续运行
- Coroutine(协程) :用
async def定义的函数,可以在await处主动让出控制权 - Task(任务) :被提交给 Event Loop 的协程,
create_task()后立即被调度 - Future:代表一个异步操作的最终结果,Task 是 Future 的子类
网络 I/O 协程:拉拼多多数据 协程:拉京东数据 协程:拉淘宝数据 Event Loop(单线程) 网络 I/O 协程:拉拼多多数据 协程:拉京东数据 协程:拉淘宝数据 Event Loop(单线程) #mermaid-svg-tILP7F8u9BtalgEN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-tILP7F8u9BtalgEN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-tILP7F8u9BtalgEN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-tILP7F8u9BtalgEN .error-icon{fill:#552222;}#mermaid-svg-tILP7F8u9BtalgEN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-tILP7F8u9BtalgEN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-tILP7F8u9BtalgEN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-tILP7F8u9BtalgEN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-tILP7F8u9BtalgEN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-tILP7F8u9BtalgEN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-tILP7F8u9BtalgEN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-tILP7F8u9BtalgEN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-tILP7F8u9BtalgEN .marker.cross{stroke:#333333;}#mermaid-svg-tILP7F8u9BtalgEN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-tILP7F8u9BtalgEN p{margin:0;}#mermaid-svg-tILP7F8u9BtalgEN .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-tILP7F8u9BtalgEN text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-tILP7F8u9BtalgEN .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-tILP7F8u9BtalgEN .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-tILP7F8u9BtalgEN .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-tILP7F8u9BtalgEN .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-tILP7F8u9BtalgEN #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-tILP7F8u9BtalgEN .sequenceNumber{fill:white;}#mermaid-svg-tILP7F8u9BtalgEN #sequencenumber{fill:#333;}#mermaid-svg-tILP7F8u9BtalgEN #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-tILP7F8u9BtalgEN .messageText{fill:#333;stroke:none;}#mermaid-svg-tILP7F8u9BtalgEN .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-tILP7F8u9BtalgEN .labelText,#mermaid-svg-tILP7F8u9BtalgEN .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-tILP7F8u9BtalgEN .loopText,#mermaid-svg-tILP7F8u9BtalgEN .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-tILP7F8u9BtalgEN .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-tILP7F8u9BtalgEN .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-tILP7F8u9BtalgEN .noteText,#mermaid-svg-tILP7F8u9BtalgEN .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-tILP7F8u9BtalgEN .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-tILP7F8u9BtalgEN .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-tILP7F8u9BtalgEN .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-tILP7F8u9BtalgEN .actorPopupMenu{position:absolute;}#mermaid-svg-tILP7F8u9BtalgEN .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-tILP7F8u9BtalgEN .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-tILP7F8u9BtalgEN .actor-man circle,#mermaid-svg-tILP7F8u9BtalgEN line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-tILP7F8u9BtalgEN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} t=0ms,同时提交 5 个协程 t≈1000ms,各接口陆续返回 总耗时 ≈ 最慢那个接口的时间 运行 发起 HTTP 请求(非阻塞) await,让出控制权 运行 发起 HTTP 请求(非阻塞) await,让出控制权 运行 发起 HTTP 请求(非阻塞) await,让出控制权 C2 的响应到了 恢复运行,处理响应 C1 的响应到了 恢复运行,处理响应
关键理解:协程在 await 处主动让出控制权,而不是被操作系统强制切换。这就是"协作式调度"的含义------比多线程更轻量,因为没有 OS 调度开销和上下文切换成本。
3.2 async/await 基础
python
import asyncio
# async def 定义协程函数,调用它返回协程对象(不会立即执行)
async def say_hello(name: str, delay: float):
await asyncio.sleep(delay) # 让出控制权,Event Loop 去跑其他协程
print(f"Hello, {name}!")
return f"done: {name}"
# 运行协程的三种方式:
# 1. asyncio.run():程序入口,创建并运行 Event Loop
async def main():
# 2. await:直接等待一个协程(串行)
result = await say_hello("Alice", 1.0)
# 3. create_task():并发运行多个协程(并行提交,不等待)
task1 = asyncio.create_task(say_hello("Bob", 0.5))
task2 = asyncio.create_task(say_hello("Charlie", 0.3))
# 等待所有任务完成
results = await asyncio.gather(task1, task2)
print(results) # ['done: Bob', 'done: Charlie']
asyncio.run(main())
3.3 gather、wait 和 TaskGroup
asyncio 提供了多种方式来并发运行多个协程,各有不同的错误处理语义:
python
import asyncio
async def risky_task(name: str, should_fail: bool):
await asyncio.sleep(0.5)
if should_fail:
raise ValueError(f"{name} 失败了")
return f"{name} 成功"
async def demo_gather():
# gather:等待全部完成,return_exceptions=True 时异常不会打断其他任务
results = await asyncio.gather(
risky_task("A", False),
risky_task("B", True), # 这个会失败
risky_task("C", False),
return_exceptions=True # 异常作为普通返回值,不抛出
)
for r in results:
if isinstance(r, Exception):
print(f"❌ 失败: {r}")
else:
print(f"✅ {r}")
async def demo_wait():
# wait:更精细的控制,可以按"第一个完成"或"全部完成"等方式等待
tasks = [asyncio.create_task(risky_task(f"任务{i}", i == 2)) for i in range(5)]
done, pending = await asyncio.wait(
tasks,
return_when=asyncio.FIRST_EXCEPTION # 一旦有异常就返回
)
print(f"已完成: {len(done)},待完成: {len(pending)}")
# 取消未完成的任务
for t in pending:
t.cancel()
async def demo_taskgroup():
# TaskGroup(Python 3.11+):推荐的新方式,任一任务失败则取消其他任务
try:
async with asyncio.TaskGroup() as tg:
t1 = tg.create_task(risky_task("X", False))
t2 = tg.create_task(risky_task("Y", True)) # 失败
t3 = tg.create_task(risky_task("Z", False))
# 所有任务都成功才走到这里
except* ValueError as eg:
# except* 捕获 ExceptionGroup(Python 3.11+ 语法)
print(f"有 {len(eg.exceptions)} 个任务失败:")
for e in eg.exceptions:
print(f" - {e}")
asyncio.run(demo_gather())
🤔 用哪个? Python 3.11+ 推荐用
TaskGroup------语义更清晰,任一子任务失败会自动取消其他任务,不会有"其他任务还在跑但你不知道"的问题。老版本用gather(return_exceptions=True)。
3.4 超时控制和任务取消
python
import asyncio
async def slow_api_call(name: str) -> str:
await asyncio.sleep(5)
return f"{name} 的数据"
async def fetch_with_timeout():
# 方式一:asyncio.wait_for 设置超时
try:
result = await asyncio.wait_for(
slow_api_call("淘宝"),
timeout=2.0
)
except asyncio.TimeoutError:
print("超时了,使用默认值")
result = {}
# 方式二:asyncio.timeout(Python 3.11+)------ 更 Pythonic
async with asyncio.timeout(2.0):
try:
result = await slow_api_call("京东")
except asyncio.TimeoutError:
print("京东超时")
async def demo_cancel():
"""主动取消任务"""
task = asyncio.create_task(slow_api_call("拼多多"))
await asyncio.sleep(1) # 等 1 秒
task.cancel() # 发送取消信号
try:
await task # 等待取消完成
except asyncio.CancelledError:
print("任务已被取消")
# 这里可以做清理工作,然后决定是否重新 raise
asyncio.run(fetch_with_timeout())
asyncio.run(demo_cancel())
3.5 asyncio 的同步原语
asyncio 也有自己的一套同步原语,用法和 threading 类似,但它们是协程安全的(不是线程安全):
python
import asyncio
# ---- asyncio.Lock:互斥锁 ----
lock = asyncio.Lock()
counter = 0
async def increment():
global counter
async with lock:
current = counter
await asyncio.sleep(0) # 让出控制权,模拟竞争
counter = current + 1
# ---- asyncio.Semaphore:限流 ----
# 场景:批量调用 API,每秒最多 10 个并发请求
rate_limiter = asyncio.Semaphore(10)
async def controlled_request(session, url: str):
async with rate_limiter: # 超过 10 个并发时,后续协程在这里等待
async with session.get(url) as resp:
return await resp.json()
# ---- asyncio.Queue:协程间数据传递 ----
async def producer(q: asyncio.Queue):
for i in range(10):
await q.put(i)
print(f"生产: {i}")
await asyncio.sleep(0.1)
await q.put(None) # 哨兵
async def consumer(q: asyncio.Queue, name: str):
while True:
item = await q.get()
if item is None:
await q.put(None) # 传给其他消费者
break
print(f"[{name}] 消费: {item}")
await asyncio.sleep(0.2)
q.task_done()
async def pipeline():
q = asyncio.Queue(maxsize=5) # 有界队列,生产者过快时会等待
await asyncio.gather(
producer(q),
consumer(q, "消费者A"),
consumer(q, "消费者B"),
)
asyncio.run(pipeline())
3.6 async for 和 async with
Python 的魔术方法在异步语境里也有对应的异步版本:
python
import asyncio
# ---- async with:异步上下文管理器 ----
class AsyncDBConnection:
async def __aenter__(self):
print("异步连接数据库")
await asyncio.sleep(0.1) # 模拟建立连接耗时
return self
async def __aexit__(self, *args):
print("关闭数据库连接")
await asyncio.sleep(0.05)
async def query(self, sql: str):
await asyncio.sleep(0.1)
return [{"id": 1, "name": "test"}]
async def fetch_orders():
async with AsyncDBConnection() as conn:
result = await conn.query("SELECT * FROM orders")
return result
# ---- async for:异步迭代器 ----
class AsyncPagedAPI:
"""模拟分页 API,每页需要发一次网络请求"""
def __init__(self, total_pages: int):
self.total_pages = total_pages
self.current_page = 0
def __aiter__(self):
return self
async def __anext__(self):
if self.current_page >= self.total_pages:
raise StopAsyncIteration
self.current_page += 1
await asyncio.sleep(0.1) # 模拟网络请求
return {"page": self.current_page, "data": list(range(10))}
async def fetch_all_pages():
all_data = []
async for page in AsyncPagedAPI(total_pages=5):
print(f"获取第 {page['page']} 页,{len(page['data'])} 条数据")
all_data.extend(page["data"])
return all_data
# ---- 异步生成器(更简洁的写法)----
async def async_page_generator(total_pages: int):
for page in range(1, total_pages + 1):
await asyncio.sleep(0.1)
yield {"page": page, "data": list(range(10))}
async def use_async_generator():
async for page in async_page_generator(5):
print(f"页 {page['page']}: {len(page['data'])} 条")
asyncio.run(use_async_generator())
3.7 asyncio 实战:并发拉取多数据源
python
import asyncio
import aiohttp # pip install aiohttp
import time
from typing import Optional
DATA_SOURCES = [
{"name": "淘宝", "url": "https://httpbin.org/delay/1"},
{"name": "京东", "url": "https://httpbin.org/delay/1"},
{"name": "拼多多", "url": "https://httpbin.org/delay/1"},
{"name": "抖音", "url": "https://httpbin.org/delay/1"},
{"name": "快手", "url": "https://httpbin.org/delay/1"},
]
async def fetch_with_retry(
session: aiohttp.ClientSession,
source: dict,
max_retries: int = 3,
timeout: float = 5.0
) -> dict:
"""带重试和超时的数据拉取"""
for attempt in range(max_retries):
try:
async with asyncio.timeout(timeout):
async with session.get(source["url"]) as response:
data = await response.json()
return {
"source": source["name"],
"status": response.status,
"data": data,
}
except asyncio.TimeoutError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # 指数退避
print(f"⚠️ {source['name']} 超时,{wait_time}s 后重试(第 {attempt + 1} 次)")
await asyncio.sleep(wait_time)
else:
return {"source": source["name"], "error": "超时,已重试 3 次"}
except aiohttp.ClientError as e:
return {"source": source["name"], "error": str(e)}
async def fetch_all_async(
sources: list,
concurrency: int = 5 # 控制最大并发数,避免被限速
) -> list:
"""并发拉取所有数据源,带并发限制"""
semaphore = asyncio.Semaphore(concurrency)
start = time.time()
async def fetch_with_limit(session, source):
async with semaphore:
return await fetch_with_retry(session, source)
connector = aiohttp.TCPConnector(limit=concurrency)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch_with_limit(session, source) for source in sources]
results = await asyncio.gather(*tasks, return_exceptions=True)
total = round(time.time() - start, 2)
successes = sum(1 for r in results if isinstance(r, dict) and "error" not in r)
print(f"\n总耗时:{total}s,成功 {successes}/{len(sources)}")
return results
if __name__ == "__main__":
asyncio.run(fetch_all_async(DATA_SOURCES))
3.8 threading vs asyncio 直接对比
| 维度 | threading | asyncio |
|---|---|---|
| 并发模型 | 多线程,OS 调度切换 | 单线程,协作式切换 |
| 内存开销 | 每线程约 1-8MB 栈空间 | 协程极轻量,约 1KB |
| 最大并发数 | 受系统线程数限制(通常几百) | 理论上数万并发 |
| 线程安全 | 需要手动加锁 | 单线程,await 之间无竞争 |
| 遗留同步代码 | ✅ 可直接用 | ❌ 需改写为 async |
| 调试难度 | 高(竞争条件难复现) | 低(执行顺序可预测) |
| 适用场景 | 有大量遗留同步库 | 新项目,高并发 I/O |
选哪个的简单判断:
- 用了
requests、pymysql这类同步库 ,短期内不打算重写 → 用threading - 新项目,或者愿意用
aiohttp、asyncpg这类异步库 → 用asyncio
四、multiprocessing:多进程详解
4.1 为什么多线程和协程都不适合 CPU 密集
来看一个 CPU 密集任务:对大量数据做复杂计算(比如 ETL 中的数据清洗、特征工程)。
python
import time
import threading
def cpu_intensive(n: int) -> int:
"""模拟 CPU 密集计算"""
result = 0
for i in range(n):
result += i * i
return result
# 串行
start = time.time()
for _ in range(4):
cpu_intensive(10_000_000)
print(f"串行:{time.time() - start:.2f}s")
# 多线程(几乎没有提速,GIL 让 4 个线程无法真正并行)
start = time.time()
threads = [threading.Thread(target=cpu_intensive, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(f"多线程:{time.time() - start:.2f}s") # 和串行差不多,甚至更慢
4.2 多进程:每个进程有独立的 GIL
multiprocessing 绕过 GIL 的方式很直接:开多个进程,每个进程有独立的 Python 解释器和独立的 GIL,真正并行运行在多个 CPU 核心上。
#mermaid-svg-OwKwqxxsVTMB46mS{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-OwKwqxxsVTMB46mS .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-OwKwqxxsVTMB46mS .error-icon{fill:#552222;}#mermaid-svg-OwKwqxxsVTMB46mS .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-OwKwqxxsVTMB46mS .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-OwKwqxxsVTMB46mS .marker{fill:#333333;stroke:#333333;}#mermaid-svg-OwKwqxxsVTMB46mS .marker.cross{stroke:#333333;}#mermaid-svg-OwKwqxxsVTMB46mS svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-OwKwqxxsVTMB46mS p{margin:0;}#mermaid-svg-OwKwqxxsVTMB46mS .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-OwKwqxxsVTMB46mS .cluster-label text{fill:#333;}#mermaid-svg-OwKwqxxsVTMB46mS .cluster-label span{color:#333;}#mermaid-svg-OwKwqxxsVTMB46mS .cluster-label span p{background-color:transparent;}#mermaid-svg-OwKwqxxsVTMB46mS .label text,#mermaid-svg-OwKwqxxsVTMB46mS span{fill:#333;color:#333;}#mermaid-svg-OwKwqxxsVTMB46mS .node rect,#mermaid-svg-OwKwqxxsVTMB46mS .node circle,#mermaid-svg-OwKwqxxsVTMB46mS .node ellipse,#mermaid-svg-OwKwqxxsVTMB46mS .node polygon,#mermaid-svg-OwKwqxxsVTMB46mS .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-OwKwqxxsVTMB46mS .rough-node .label text,#mermaid-svg-OwKwqxxsVTMB46mS .node .label text,#mermaid-svg-OwKwqxxsVTMB46mS .image-shape .label,#mermaid-svg-OwKwqxxsVTMB46mS .icon-shape .label{text-anchor:middle;}#mermaid-svg-OwKwqxxsVTMB46mS .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-OwKwqxxsVTMB46mS .rough-node .label,#mermaid-svg-OwKwqxxsVTMB46mS .node .label,#mermaid-svg-OwKwqxxsVTMB46mS .image-shape .label,#mermaid-svg-OwKwqxxsVTMB46mS .icon-shape .label{text-align:center;}#mermaid-svg-OwKwqxxsVTMB46mS .node.clickable{cursor:pointer;}#mermaid-svg-OwKwqxxsVTMB46mS .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-OwKwqxxsVTMB46mS .arrowheadPath{fill:#333333;}#mermaid-svg-OwKwqxxsVTMB46mS .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-OwKwqxxsVTMB46mS .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-OwKwqxxsVTMB46mS .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-OwKwqxxsVTMB46mS .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-OwKwqxxsVTMB46mS .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-OwKwqxxsVTMB46mS .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-OwKwqxxsVTMB46mS .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-OwKwqxxsVTMB46mS .cluster text{fill:#333;}#mermaid-svg-OwKwqxxsVTMB46mS .cluster span{color:#333;}#mermaid-svg-OwKwqxxsVTMB46mS div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-OwKwqxxsVTMB46mS .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-OwKwqxxsVTMB46mS rect.text{fill:none;stroke-width:0;}#mermaid-svg-OwKwqxxsVTMB46mS .icon-shape,#mermaid-svg-OwKwqxxsVTMB46mS .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-OwKwqxxsVTMB46mS .icon-shape p,#mermaid-svg-OwKwqxxsVTMB46mS .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-OwKwqxxsVTMB46mS .icon-shape .label rect,#mermaid-svg-OwKwqxxsVTMB46mS .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-OwKwqxxsVTMB46mS .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-OwKwqxxsVTMB46mS .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-OwKwqxxsVTMB46mS :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 工作进程池
主进程
主进程
GIL_1
工作进程 1
GIL_2
CPU 核心 1
工作进程 2
GIL_3
CPU 核心 2
工作进程 3
GIL_4
CPU 核心 3
工作进程 4
GIL_5
CPU 核心 4
4.3 Pool:最常用的进程池接口
multiprocessing.Pool 提供了比 ProcessPoolExecutor 更丰富的接口:
python
import multiprocessing
import time
import os
def process_chunk(chunk: list) -> dict:
"""处理一批数据,每个工作进程独立执行"""
result = sum(x ** 2 for x in chunk)
return {"pid": os.getpid(), "chunk_size": len(chunk), "result": result}
if __name__ == "__main__":
data = list(range(1_000_000))
cpu_count = os.cpu_count()
# 把数据切分成 cpu_count 份
chunk_size = len(data) // cpu_count
chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]
with multiprocessing.Pool(processes=cpu_count) as pool:
# map:按顺序返回结果,阻塞直到全部完成
start = time.time()
results = pool.map(process_chunk, chunks)
print(f"map 耗时:{time.time() - start:.2f}s,结果数:{len(results)}")
# imap:惰性版本,结果一个一个返回,节省内存
results_iter = pool.imap(process_chunk, chunks)
for result in results_iter:
pass # 可以边处理边接收
# starmap:任务函数有多个参数时用
def process_with_config(chunk: list, config: dict) -> dict:
return {"chunk_size": len(chunk), "config": config}
task_args = [(chunk, {"debug": True}) for chunk in chunks]
results = pool.starmap(process_with_config, task_args)
# apply_async:提交单个任务,非阻塞
future = pool.apply_async(process_chunk, args=(data[:100],))
result = future.get(timeout=10) # 等待结果
print(f"单任务结果: {result}")
4.4 ProcessPoolExecutor:更现代的接口
concurrent.futures.ProcessPoolExecutor 和 ThreadPoolExecutor 接口一致,更推荐用于新代码:
python
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import time
def process_single_file(file_path: str) -> dict:
"""
处理单个数据文件的 ETL 逻辑
每个工作进程独立执行,互不干扰
"""
start = time.time()
path = Path(file_path)
row_count = 0
with open(path, "r", encoding="utf-8") as f:
for line in f:
cleaned = line.strip()
if cleaned:
_ = sum(ord(c) ** 2 for c in cleaned[:50]) # 模拟 CPU 计算
row_count += 1
return {
"file": path.name,
"rows": row_count,
"elapsed": round(time.time() - start, 3),
"pid": os.getpid(),
}
def parallel_etl(file_paths: list[str], max_workers: int = None) -> list[dict]:
if max_workers is None:
max_workers = os.cpu_count()
results = []
start = time.time()
print(f"🚀 启动 {max_workers} 个工作进程处理 {len(file_paths)} 个文件...")
with ProcessPoolExecutor(max_workers=max_workers) as executor:
future_to_file = {
executor.submit(process_single_file, fp): fp
for fp in file_paths
}
for future in as_completed(future_to_file):
try:
result = future.result()
results.append(result)
print(f"✅ {result['file']}(PID={result['pid']},{result['rows']} 行,{result['elapsed']}s)")
except Exception as e:
print(f"❌ {future_to_file[future]} 处理失败:{e}")
print(f"\n总耗时:{round(time.time() - start, 2)}s")
return results
4.5 进程间通信
多进程的代价是:进程间不共享内存,通信需要序列化(pickle)。
#mermaid-svg-aaDthnoTbRh58jy5{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-aaDthnoTbRh58jy5 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-aaDthnoTbRh58jy5 .error-icon{fill:#552222;}#mermaid-svg-aaDthnoTbRh58jy5 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-aaDthnoTbRh58jy5 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-aaDthnoTbRh58jy5 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-aaDthnoTbRh58jy5 .marker.cross{stroke:#333333;}#mermaid-svg-aaDthnoTbRh58jy5 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-aaDthnoTbRh58jy5 p{margin:0;}#mermaid-svg-aaDthnoTbRh58jy5 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-aaDthnoTbRh58jy5 .cluster-label text{fill:#333;}#mermaid-svg-aaDthnoTbRh58jy5 .cluster-label span{color:#333;}#mermaid-svg-aaDthnoTbRh58jy5 .cluster-label span p{background-color:transparent;}#mermaid-svg-aaDthnoTbRh58jy5 .label text,#mermaid-svg-aaDthnoTbRh58jy5 span{fill:#333;color:#333;}#mermaid-svg-aaDthnoTbRh58jy5 .node rect,#mermaid-svg-aaDthnoTbRh58jy5 .node circle,#mermaid-svg-aaDthnoTbRh58jy5 .node ellipse,#mermaid-svg-aaDthnoTbRh58jy5 .node polygon,#mermaid-svg-aaDthnoTbRh58jy5 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-aaDthnoTbRh58jy5 .rough-node .label text,#mermaid-svg-aaDthnoTbRh58jy5 .node .label text,#mermaid-svg-aaDthnoTbRh58jy5 .image-shape .label,#mermaid-svg-aaDthnoTbRh58jy5 .icon-shape .label{text-anchor:middle;}#mermaid-svg-aaDthnoTbRh58jy5 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-aaDthnoTbRh58jy5 .rough-node .label,#mermaid-svg-aaDthnoTbRh58jy5 .node .label,#mermaid-svg-aaDthnoTbRh58jy5 .image-shape .label,#mermaid-svg-aaDthnoTbRh58jy5 .icon-shape .label{text-align:center;}#mermaid-svg-aaDthnoTbRh58jy5 .node.clickable{cursor:pointer;}#mermaid-svg-aaDthnoTbRh58jy5 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-aaDthnoTbRh58jy5 .arrowheadPath{fill:#333333;}#mermaid-svg-aaDthnoTbRh58jy5 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-aaDthnoTbRh58jy5 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-aaDthnoTbRh58jy5 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-aaDthnoTbRh58jy5 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-aaDthnoTbRh58jy5 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-aaDthnoTbRh58jy5 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-aaDthnoTbRh58jy5 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-aaDthnoTbRh58jy5 .cluster text{fill:#333;}#mermaid-svg-aaDthnoTbRh58jy5 .cluster span{color:#333;}#mermaid-svg-aaDthnoTbRh58jy5 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-aaDthnoTbRh58jy5 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-aaDthnoTbRh58jy5 rect.text{fill:none;stroke-width:0;}#mermaid-svg-aaDthnoTbRh58jy5 .icon-shape,#mermaid-svg-aaDthnoTbRh58jy5 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-aaDthnoTbRh58jy5 .icon-shape p,#mermaid-svg-aaDthnoTbRh58jy5 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-aaDthnoTbRh58jy5 .icon-shape .label rect,#mermaid-svg-aaDthnoTbRh58jy5 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-aaDthnoTbRh58jy5 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-aaDthnoTbRh58jy5 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-aaDthnoTbRh58jy5 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} pickle 序列化
pickle 反序列化
进程 1
Queue / Pipe
进程 2
python
from multiprocessing import Process, Queue, Pipe, Value, Array
import ctypes
# ---- Queue:最通用,多对多 ----
def worker_queue(q: Queue, data: list):
result = sum(x ** 2 for x in data)
q.put({"result": result, "count": len(data)})
q = Queue()
p = Process(target=worker_queue, args=(q, list(range(1000))))
p.start()
p.join()
print(q.get()) # {"result": ..., "count": 1000}
# ---- Pipe:点对点,比 Queue 快 ----
def worker_pipe(conn):
data = conn.recv() # 接收数据
result = sum(x ** 2 for x in data)
conn.send(result) # 发送结果
conn.close()
parent_conn, child_conn = Pipe()
p = Process(target=worker_pipe, args=(child_conn,))
p.start()
parent_conn.send(list(range(1000))) # 发送数据给子进程
result = parent_conn.recv() # 接收结果
p.join()
# ---- 共享内存:大数据量,避免 pickle 开销 ----
# 方式一:Value 和 Array(基础类型)
shared_counter = Value(ctypes.c_int, 0)
shared_array = Array(ctypes.c_double, [1.0, 2.0, 3.0])
# 方式二:shared_memory(Python 3.8+,适合 numpy 数组)
from multiprocessing import shared_memory
import numpy as np
# 主进程创建共享内存
shm = shared_memory.SharedMemory(create=True, size=1024 * 1024 * 10) # 10MB
np_array = np.ndarray((1000, 1000), dtype=np.float64, buffer=shm.buf)
np_array[:] = np.random.rand(1000, 1000) # 写入数据
def worker_shared_mem(shm_name: str, shape: tuple):
"""工作进程通过名字找到共享内存,零拷贝读取"""
existing_shm = shared_memory.SharedMemory(name=shm_name)
array = np.ndarray(shape, dtype=np.float64, buffer=existing_shm.buf)
result = np.sum(array) # 直接读取,不需要 pickle 传输
existing_shm.close()
return result
# 主进程把共享内存名字传给子进程(只传名字,不传数据)
with ProcessPoolExecutor(max_workers=4) as executor:
future = executor.submit(worker_shared_mem, shm.name, (1000, 1000))
print(f"共享内存计算结果:{future.result():.2f}")
# 清理
shm.close()
shm.unlink() # 删除共享内存
4.6 进程池的注意事项
python
from concurrent.futures import ProcessPoolExecutor
import multiprocessing
# ---- 注意 1:任务函数必须可 pickle ----
# ❌ 糟糕:lambda 不可 pickle
with ProcessPoolExecutor() as executor:
executor.submit(lambda x: x**2, 5) # PicklingError!
# ✅ 正确:用普通顶层函数
def square(x):
return x ** 2
with ProcessPoolExecutor() as executor:
executor.submit(square, 5)
# ❌ 糟糕:在实例方法里提交自身(self 不一定可 pickle)
class Processor:
def __init__(self, config):
self.config = config
self.db = create_connection() # 不可 pickle
def run(self, data):
with ProcessPoolExecutor() as executor:
executor.submit(self.process, data) # self 里有 db,报错!
def process(self, data):
return data
# ✅ 正确:传可 pickle 的数据,在工作进程里重新初始化连接
def process_with_config(data, config: dict):
conn = create_connection(config) # 每个工作进程自己建连接
result = conn.process(data)
conn.close()
return result
# ---- 注意 2:进程启动方式 ----
# macOS Python 3.12+ / Windows 默认 spawn,Linux 默认 fork
# spawn 更安全(不复制父进程状态),但启动慢
# fork 快,但会复制锁、文件句柄等,可能死锁
if __name__ == "__main__": # spawn 模式必须有这个保护!
ctx = multiprocessing.get_context("spawn")
with ctx.Pool(4) as pool:
results = pool.map(square, range(10))
# ---- 注意 3:进程数不是越多越好 ----
import os
# CPU 密集:进程数 = CPU 核心数(最优)
cpu_workers = os.cpu_count()
# 超过核心数反而因为进程切换开销变慢
# ❌ 糟糕
with ProcessPoolExecutor(max_workers=100) as executor: # 8 核机器开 100 进程,浪费
pass
# ✅ 正确
with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
pass
五、混合模式与选型决策
5.1 asyncio + ProcessPoolExecutor:异步协调 + CPU 并行
真实的数据管道往往同时有 I/O 密集和 CPU 密集两种操作:
#mermaid-svg-p9DbzqvhI669S3DY{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-p9DbzqvhI669S3DY .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-p9DbzqvhI669S3DY .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-p9DbzqvhI669S3DY .error-icon{fill:#552222;}#mermaid-svg-p9DbzqvhI669S3DY .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-p9DbzqvhI669S3DY .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-p9DbzqvhI669S3DY .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-p9DbzqvhI669S3DY .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-p9DbzqvhI669S3DY .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-p9DbzqvhI669S3DY .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-p9DbzqvhI669S3DY .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-p9DbzqvhI669S3DY .marker{fill:#333333;stroke:#333333;}#mermaid-svg-p9DbzqvhI669S3DY .marker.cross{stroke:#333333;}#mermaid-svg-p9DbzqvhI669S3DY svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-p9DbzqvhI669S3DY p{margin:0;}#mermaid-svg-p9DbzqvhI669S3DY .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-p9DbzqvhI669S3DY .cluster-label text{fill:#333;}#mermaid-svg-p9DbzqvhI669S3DY .cluster-label span{color:#333;}#mermaid-svg-p9DbzqvhI669S3DY .cluster-label span p{background-color:transparent;}#mermaid-svg-p9DbzqvhI669S3DY .label text,#mermaid-svg-p9DbzqvhI669S3DY span{fill:#333;color:#333;}#mermaid-svg-p9DbzqvhI669S3DY .node rect,#mermaid-svg-p9DbzqvhI669S3DY .node circle,#mermaid-svg-p9DbzqvhI669S3DY .node ellipse,#mermaid-svg-p9DbzqvhI669S3DY .node polygon,#mermaid-svg-p9DbzqvhI669S3DY .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-p9DbzqvhI669S3DY .rough-node .label text,#mermaid-svg-p9DbzqvhI669S3DY .node .label text,#mermaid-svg-p9DbzqvhI669S3DY .image-shape .label,#mermaid-svg-p9DbzqvhI669S3DY .icon-shape .label{text-anchor:middle;}#mermaid-svg-p9DbzqvhI669S3DY .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-p9DbzqvhI669S3DY .rough-node .label,#mermaid-svg-p9DbzqvhI669S3DY .node .label,#mermaid-svg-p9DbzqvhI669S3DY .image-shape .label,#mermaid-svg-p9DbzqvhI669S3DY .icon-shape .label{text-align:center;}#mermaid-svg-p9DbzqvhI669S3DY .node.clickable{cursor:pointer;}#mermaid-svg-p9DbzqvhI669S3DY .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-p9DbzqvhI669S3DY .arrowheadPath{fill:#333333;}#mermaid-svg-p9DbzqvhI669S3DY .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-p9DbzqvhI669S3DY .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-p9DbzqvhI669S3DY .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-p9DbzqvhI669S3DY .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-p9DbzqvhI669S3DY .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-p9DbzqvhI669S3DY .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-p9DbzqvhI669S3DY .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-p9DbzqvhI669S3DY .cluster text{fill:#333;}#mermaid-svg-p9DbzqvhI669S3DY .cluster span{color:#333;}#mermaid-svg-p9DbzqvhI669S3DY div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-p9DbzqvhI669S3DY .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-p9DbzqvhI669S3DY rect.text{fill:none;stroke-width:0;}#mermaid-svg-p9DbzqvhI669S3DY .icon-shape,#mermaid-svg-p9DbzqvhI669S3DY .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-p9DbzqvhI669S3DY .icon-shape p,#mermaid-svg-p9DbzqvhI669S3DY .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-p9DbzqvhI669S3DY .icon-shape .label rect,#mermaid-svg-p9DbzqvhI669S3DY .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-p9DbzqvhI669S3DY .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-p9DbzqvhI669S3DY .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-p9DbzqvhI669S3DY :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} asyncio 并发拉取
(I/O 密集)
ProcessPool 并行清洗
(CPU 密集)
asyncio 并发写入
(I/O 密集)
多个数据源
原始数据
清洗后数据
数据仓库
python
import asyncio
import aiohttp
from concurrent.futures import ProcessPoolExecutor
import os
# CPU 密集:数据清洗(运行在工作进程中)
def clean_and_transform(raw_data: dict) -> dict:
"""这个函数跑在独立进程里,可以充分利用多核"""
source = raw_data["source"]
records = raw_data.get("records", [])
cleaned = []
for record in records:
# 复杂计算:去重、格式转换、异常值处理
processed = {
"id": record.get("id"),
"value": sum(ord(c) for c in str(record)) % 1000,
"source": source,
}
cleaned.append(processed)
return {"source": source, "count": len(cleaned), "data": cleaned}
# I/O 密集:异步拉取数据(运行在 Event Loop 的主线程)
async def fetch_data(session: aiohttp.ClientSession, source: dict) -> dict:
async with session.get(source["url"]) as resp:
return {
"source": source["name"],
"records": [{"id": i, "val": i * 2} for i in range(100)],
}
async def data_pipeline(sources: list) -> list:
"""
完整数据管道:
1. asyncio 并发拉取多个数据源(I/O 密集)
2. ProcessPoolExecutor 并行清洗数据(CPU 密集)
"""
loop = asyncio.get_event_loop()
# 第一阶段:并发拉取(I/O 密集)
print("📥 开始并发拉取数据...")
async with aiohttp.ClientSession() as session:
fetch_tasks = [fetch_data(session, s) for s in sources]
raw_results = await asyncio.gather(*fetch_tasks)
print(f"✅ 拉取完成,共 {len(raw_results)} 个数据源")
# 第二阶段:并行清洗(CPU 密集)
print("⚙️ 开始并行清洗数据...")
with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
# run_in_executor 把同步函数提交给进程池,返回可 await 的 Future
clean_tasks = [
loop.run_in_executor(executor, clean_and_transform, raw)
for raw in raw_results
]
cleaned_results = await asyncio.gather(*clean_tasks)
total_rows = sum(r["count"] for r in cleaned_results)
print(f"✅ 清洗完成,共处理 {total_rows} 条记录")
return cleaned_results
if __name__ == "__main__":
sources = [
{"name": f"数据源_{i}", "url": "https://httpbin.org/json"}
for i in range(5)
]
asyncio.run(data_pipeline(sources))
💡
loop.run_in_executor(executor, func, *args)是连接 asyncio 和进程/线程池的桥梁。它把同步函数提交给执行器异步运行,返回可await的 Future,Event Loop 不会被阻塞。
5.2 threading + asyncio:在异步代码里运行同步阻塞任务
有时候无法改写同步代码(遗留库、第三方 SDK),但又想在异步代码里调用:
python
import asyncio
import requests # 同步库
async def fetch_with_sync_lib(url: str) -> str:
"""在 asyncio 里安全地调用同步阻塞函数"""
loop = asyncio.get_event_loop()
# run_in_executor 默认使用线程池,不会阻塞 Event Loop
response = await loop.run_in_executor(
None, # None 表示用默认线程池
requests.get, # 同步函数
url # 参数
)
return response.text
# 多个同步调用并发执行
async def fetch_multiple(urls: list) -> list:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(None, requests.get, url)
for url in urls
]
responses = await asyncio.gather(*tasks)
return [r.text for r in responses]
asyncio.run(fetch_multiple(["https://httpbin.org/get"] * 5))
5.3 选型决策树
#mermaid-svg-MCcUyfBPXAJTiv52{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MCcUyfBPXAJTiv52 .error-icon{fill:#552222;}#mermaid-svg-MCcUyfBPXAJTiv52 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MCcUyfBPXAJTiv52 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MCcUyfBPXAJTiv52 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MCcUyfBPXAJTiv52 .marker.cross{stroke:#333333;}#mermaid-svg-MCcUyfBPXAJTiv52 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MCcUyfBPXAJTiv52 p{margin:0;}#mermaid-svg-MCcUyfBPXAJTiv52 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 .cluster-label text{fill:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 .cluster-label span{color:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 .cluster-label span p{background-color:transparent;}#mermaid-svg-MCcUyfBPXAJTiv52 .label text,#mermaid-svg-MCcUyfBPXAJTiv52 span{fill:#333;color:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 .node rect,#mermaid-svg-MCcUyfBPXAJTiv52 .node circle,#mermaid-svg-MCcUyfBPXAJTiv52 .node ellipse,#mermaid-svg-MCcUyfBPXAJTiv52 .node polygon,#mermaid-svg-MCcUyfBPXAJTiv52 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-MCcUyfBPXAJTiv52 .rough-node .label text,#mermaid-svg-MCcUyfBPXAJTiv52 .node .label text,#mermaid-svg-MCcUyfBPXAJTiv52 .image-shape .label,#mermaid-svg-MCcUyfBPXAJTiv52 .icon-shape .label{text-anchor:middle;}#mermaid-svg-MCcUyfBPXAJTiv52 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-MCcUyfBPXAJTiv52 .rough-node .label,#mermaid-svg-MCcUyfBPXAJTiv52 .node .label,#mermaid-svg-MCcUyfBPXAJTiv52 .image-shape .label,#mermaid-svg-MCcUyfBPXAJTiv52 .icon-shape .label{text-align:center;}#mermaid-svg-MCcUyfBPXAJTiv52 .node.clickable{cursor:pointer;}#mermaid-svg-MCcUyfBPXAJTiv52 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-MCcUyfBPXAJTiv52 .arrowheadPath{fill:#333333;}#mermaid-svg-MCcUyfBPXAJTiv52 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-MCcUyfBPXAJTiv52 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-MCcUyfBPXAJTiv52 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MCcUyfBPXAJTiv52 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-MCcUyfBPXAJTiv52 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MCcUyfBPXAJTiv52 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-MCcUyfBPXAJTiv52 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-MCcUyfBPXAJTiv52 .cluster text{fill:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 .cluster span{color:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-MCcUyfBPXAJTiv52 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-MCcUyfBPXAJTiv52 rect.text{fill:none;stroke-width:0;}#mermaid-svg-MCcUyfBPXAJTiv52 .icon-shape,#mermaid-svg-MCcUyfBPXAJTiv52 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MCcUyfBPXAJTiv52 .icon-shape p,#mermaid-svg-MCcUyfBPXAJTiv52 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-MCcUyfBPXAJTiv52 .icon-shape .label rect,#mermaid-svg-MCcUyfBPXAJTiv52 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MCcUyfBPXAJTiv52 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-MCcUyfBPXAJTiv52 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-MCcUyfBPXAJTiv52 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否,新项目
任务是什么类型?
I/O 密集
网络/文件/数据库
CPU 密集
数值计算/数据处理
有大量遗留同步代码?
requests/pymysql 等
threading
ThreadPoolExecutor
asyncio
aiohttp/asyncpg
同时有 I/O + CPU?
asyncio + ProcessPoolExecutor
混合模式
纯 Python 计算
multiprocessing
ProcessPoolExecutor
NumPy/SciPy 计算
NumPy C 层释放 GIL
ThreadPoolExecutor 也有效
5.4 性能参考对比
以"并发处理 10 个各耗时 1 秒的 I/O 任务"为例:
| 方案 | 理论耗时 | 内存开销 | 适合任务数 |
|---|---|---|---|
| 串行 | ~10s | 极低 | 不适合并发 |
| threading(10 线程) | ~1s | 中(~80MB) | 数十到数百 |
| asyncio | ~1s | 极低(<5MB) | 数百到数万 |
| multiprocessing(4 进程) | ~3s | 高(进程启动 50-100ms/个) | 不适合 I/O 密集 |
以"并行处理 4 个各需 2 秒的 CPU 密集任务,8 核机器"为例:
| 方案 | 理论耗时 | 说明 |
|---|---|---|
| 串行 | ~8s | 基准 |
| threading(4 线程) | ~8s | GIL 导致无效 |
| multiprocessing(4 进程) | ~2s | 真正并行,约 4x 加速 |
| asyncio | ~8s | 单线程,CPU 密集无效 |
📊 实际测试建议:进程启动有固定开销(约 50-100ms),任务量太小时多进程反而更慢。经验值:单个任务执行时间 > 200ms,多进程才值得。
六、常见坑与最佳实践
坑 1:在 async def 里调用同步阻塞 I/O
python
import asyncio
import requests
# ❌ 糟糕:阻塞整个 Event Loop,所有协程被卡死
async def bad_fetch(url: str):
response = requests.get(url) # 同步阻塞!Event Loop 无法调度其他协程
return response.text
# ✅ 正确方案一:换用异步库
import aiohttp
async def good_fetch_aiohttp(url: str):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
# ✅ 正确方案二:用 run_in_executor 把同步调用丢到线程池
async def good_fetch_executor(url: str):
loop = asyncio.get_event_loop()
response = await loop.run_in_executor(None, requests.get, url)
return response.text
坑 2:asyncio.sleep vs time.sleep
python
import asyncio, time
# ❌ 糟糕:time.sleep 是同步阻塞,卡死 Event Loop
async def bad_wait():
time.sleep(1) # 整个 Event Loop 被冻结 1 秒,其他协程无法运行
# ✅ 正确:asyncio.sleep 会让出控制权
async def good_wait():
await asyncio.sleep(1) # Event Loop 可以在这 1 秒里调度其他协程
坑 3:忘记 await 导致协程不执行
python
# ❌ 糟糕:忘记 await,协程对象被创建但从未执行
async def main():
result = fetch_data() # 只创建了协程对象!Python 3.10+ 会发 RuntimeWarning
print(result) # 打印的是 <coroutine object fetch_data at 0x...>
# ✅ 正确
async def main():
result = await fetch_data()
print(result)
坑 4:create_task 在协程被 await 前就取消
python
import asyncio
# ❌ 糟糕:task 创建后没有被引用,可能被垃圾回收
async def bad():
asyncio.create_task(some_coroutine()) # 没有保留引用!
# ✅ 正确:保存 task 引用,或用 TaskGroup 管理
background_tasks = set()
async def good():
task = asyncio.create_task(some_coroutine())
background_tasks.add(task) # 防止被 GC
task.add_done_callback(background_tasks.discard)
坑 5:ProcessPoolExecutor 的 pickle 限制
python
from concurrent.futures import ProcessPoolExecutor
class DataProcessor:
def __init__(self, config):
self.config = config
self.db_conn = create_db_connection() # ❌ 数据库连接不能 pickle!
def process(self, data):
return self.db_conn.query(data)
processor = DataProcessor(config)
# ❌ 这会报 PicklingError,因为 db_conn 不可序列化
with ProcessPoolExecutor() as executor:
executor.submit(processor.process, data)
# ✅ 正确:在工作进程内部创建连接
def process_with_connection(data, config):
conn = create_db_connection(config) # 每个工作进程自己建连接
result = conn.query(data)
conn.close()
return result
with ProcessPoolExecutor() as executor:
executor.submit(process_with_connection, data, config)
坑 6:Lock 顺序不一致导致死锁
python
import threading
lock_a = threading.Lock()
lock_b = threading.Lock()
# ❌ 糟糕:两个线程以相反顺序获取锁,必然死锁
def thread1():
with lock_a:
with lock_b: # 等 lock_b
pass
def thread2():
with lock_b:
with lock_a: # 等 lock_a(而 lock_a 被 thread1 持有)
pass
# ✅ 正确:始终以相同顺序获取锁
def thread1_safe():
with lock_a:
with lock_b:
pass
def thread2_safe():
with lock_a: # 和 thread1_safe 一样的顺序
with lock_b:
pass
坑 7:多进程的 fork 安全问题
python
import multiprocessing
# ❌ 在 macOS/Linux 上,fork 会复制父进程的状态
# 如果父进程持有锁(数据库连接池、日志锁等),子进程继承了锁但永远无法释放 → 死锁
# ✅ 显式指定启动方式(跨平台一致性)
if __name__ == "__main__":
multiprocessing.set_start_method("spawn") # 安全,但比 fork 慢(每次重新导入模块)
# 或在 Pool/ProcessPoolExecutor 中单独指定
ctx = multiprocessing.get_context("spawn")
with ctx.Pool(4) as pool:
results = pool.map(process_func, data_list)
坑 8:线程数/进程数不是越多越好
python
import os
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
# ❌ 糟糕:盲目开大量进程/线程
with ProcessPoolExecutor(max_workers=100) as executor: # 8 核机器用 100 进程,大量时间在切换
pass
with ThreadPoolExecutor(max_workers=1000) as executor: # 1000 线程,内存和调度开销巨大
pass
# ✅ CPU 密集:进程数 = CPU 核心数
with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
pass
# ✅ I/O 密集:线程数可以多一些,但也有上限
# ThreadPoolExecutor 默认值是 min(32, os.cpu_count() + 4)
io_workers = min(32, os.cpu_count() * 4) # 经验值
with ThreadPoolExecutor(max_workers=io_workers) as executor:
pass
七、Python 3.13 的无 GIL 模式(展望)
Python 3.13 引入了实验性的自由线程(Free-Threaded)模式 ,通过编译时选项 --disable-gil 移除 GIL:
bash
# 安装支持自由线程的 Python 3.13(实验性)
# pyenv install 3.13t (t 表示 free-threaded)
python
import sys
# 检查是否在自由线程模式下运行
print(sys._is_gil_enabled()) # Python 3.13+ 可用
# 自由线程模式下,多线程 CPU 密集任务可以真正并行
import threading
import time
results = []
def cpu_task():
result = sum(i * i for i in range(10_000_000))
results.append(result)
# 在自由线程模式下,这段代码可以充分利用多核
start = time.time()
threads = [threading.Thread(target=cpu_task) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(f"耗时:{time.time() - start:.2f}s") # 自由线程模式下约为串行的 1/4
⚠️ 注意:自由线程模式目前(3.13)是实验性的,有以下已知问题:
- 单线程性能下降约 5-10%(细粒度锁的开销)
- 部分 C 扩展尚未兼容
- 生产环境不推荐使用
PEP 703 的目标是在未来版本(3.14/3.15)让自由线程成为默认模式。这将是 Python 并发编程的重大转折点。
八、综合实战:构建一个完整的数据采集管道
把所有知识综合起来,实现一个接近生产的多数据源采集系统:
python
"""
场景:电商 BI 系统的数据采集管道
1. 从 5 个渠道并发拉取当日订单数据(asyncio,I/O 密集)
2. 对每个渠道的数据做并行清洗和特征提取(ProcessPool,CPU 密集)
3. 用信号量限制并发数,避免打爆下游 API
4. 所有错误单独记录,不影响其他渠道
"""
import asyncio
import os
import time
from concurrent.futures import ProcessPoolExecutor
from dataclasses import dataclass, field
from typing import Optional
import random
# ---- 数据模型 ----
@dataclass
class RawOrder:
order_id: str
channel: str
amount: float
items: list
@dataclass
class PipelineResult:
channel: str
success: bool
record_count: int = 0
error: Optional[str] = None
elapsed: float = 0.0
# ---- CPU 密集:数据清洗(工作进程) ----
def clean_orders(raw_orders: list[dict], channel: str) -> dict:
"""
数据清洗:去重、格式统一、异常值过滤
这里是 CPU 密集操作,跑在独立进程里
"""
seen_ids = set()
cleaned = []
for order in raw_orders:
# 去重
if order["order_id"] in seen_ids:
continue
seen_ids.add(order["order_id"])
# 异常值过滤
if order["amount"] <= 0 or order["amount"] > 100000:
continue
# 格式转换(模拟 CPU 计算)
features = {
"order_id": order["order_id"],
"channel": channel,
"amount": round(order["amount"], 2),
"item_count": len(order.get("items", [])),
"amount_hash": hash(str(order["amount"])) % 10000, # 模拟特征提取
}
cleaned.append(features)
return {
"channel": channel,
"original_count": len(raw_orders),
"cleaned_count": len(cleaned),
"data": cleaned,
}
# ---- I/O 密集:数据拉取(协程) ----
async def fetch_channel_data(
channel: str,
semaphore: asyncio.Semaphore,
timeout: float = 10.0
) -> Optional[list[dict]]:
"""
模拟从某个电商渠道拉取数据
实际场景替换为 aiohttp 调用
"""
async with semaphore:
try:
async with asyncio.timeout(timeout):
# 模拟网络延迟
await asyncio.sleep(random.uniform(0.5, 2.0))
# 模拟偶发失败
if random.random() < 0.1:
raise ConnectionError(f"{channel} API 返回 500")
# 模拟返回订单数据
orders = [
{
"order_id": f"{channel}-{i:06d}",
"channel": channel,
"amount": random.uniform(10, 5000),
"items": [f"item_{j}" for j in range(random.randint(1, 5))],
}
for i in range(random.randint(100, 1000))
]
return orders
except asyncio.TimeoutError:
raise TimeoutError(f"{channel} 请求超时")
# ---- 主管道 ----
async def run_pipeline(channels: list[str]) -> list[PipelineResult]:
loop = asyncio.get_event_loop()
results = []
semaphore = asyncio.Semaphore(3) # 最多同时 3 个并发请求
start = time.time()
# 第一阶段:并发拉取所有渠道数据
print(f"📥 开始拉取 {len(channels)} 个渠道的数据...")
fetch_tasks = {
channel: asyncio.create_task(fetch_channel_data(channel, semaphore))
for channel in channels
}
raw_data_map: dict[str, Optional[list]] = {}
for channel, task in fetch_tasks.items():
try:
raw_data_map[channel] = await task
print(f" ✅ {channel}: 拉取 {len(raw_data_map[channel])} 条")
except Exception as e:
raw_data_map[channel] = None
results.append(PipelineResult(channel=channel, success=False, error=str(e)))
print(f" ❌ {channel}: {e}")
# 过滤掉失败的渠道
successful_data = {k: v for k, v in raw_data_map.items() if v is not None}
print(f"\n⚙️ 开始并行清洗 {len(successful_data)} 个渠道的数据...")
# 第二阶段:并行清洗(CPU 密集)
with ProcessPoolExecutor(max_workers=min(os.cpu_count(), len(successful_data))) as executor:
clean_tasks = [
loop.run_in_executor(executor, clean_orders, raw_data, channel)
for channel, raw_data in successful_data.items()
]
cleaned_results = await asyncio.gather(*clean_tasks, return_exceptions=True)
for result in cleaned_results:
if isinstance(result, Exception):
print(f" ❌ 清洗失败: {result}")
else:
ratio = result["cleaned_count"] / result["original_count"] * 100
print(f" ✅ {result['channel']}: {result['original_count']} → {result['cleaned_count']} 条(保留 {ratio:.0f}%)")
results.append(PipelineResult(
channel=result["channel"],
success=True,
record_count=result["cleaned_count"],
elapsed=round(time.time() - start, 2),
))
total = round(time.time() - start, 2)
success_count = sum(1 for r in results if r.success)
print(f"\n🎉 管道完成!总耗时 {total}s,成功 {success_count}/{len(channels)} 个渠道")
return results
if __name__ == "__main__":
channels = ["淘宝", "京东", "拼多多", "抖音", "快手", "唯品会", "苏宁"]
asyncio.run(run_pipeline(channels))
九、总结
#mermaid-svg-nttrDgqoPdk6tSPH{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-nttrDgqoPdk6tSPH .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-nttrDgqoPdk6tSPH .error-icon{fill:#552222;}#mermaid-svg-nttrDgqoPdk6tSPH .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-nttrDgqoPdk6tSPH .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-nttrDgqoPdk6tSPH .marker{fill:#333333;stroke:#333333;}#mermaid-svg-nttrDgqoPdk6tSPH .marker.cross{stroke:#333333;}#mermaid-svg-nttrDgqoPdk6tSPH svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-nttrDgqoPdk6tSPH p{margin:0;}#mermaid-svg-nttrDgqoPdk6tSPH .edge{stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .section--1 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section--1 path,#mermaid-svg-nttrDgqoPdk6tSPH .section--1 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section--1 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section--1 path{fill:hsl(240, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section--1 text{fill:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon--1{font-size:40px;color:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge--1{stroke:hsl(240, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth--1{stroke-width:17;}#mermaid-svg-nttrDgqoPdk6tSPH .section--1 line{stroke:hsl(60, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-0 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-0 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-0 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-0 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-0 path{fill:hsl(60, 100%, 73.5294117647%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-0 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-0{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-0{stroke:hsl(60, 100%, 73.5294117647%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-0{stroke-width:14;}#mermaid-svg-nttrDgqoPdk6tSPH .section-0 line{stroke:hsl(240, 100%, 83.5294117647%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-1 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-1 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-1 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-1 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-1 path{fill:hsl(80, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-1 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-1{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-1{stroke:hsl(80, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-1{stroke-width:11;}#mermaid-svg-nttrDgqoPdk6tSPH .section-1 line{stroke:hsl(260, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-2 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-2 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-2 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-2 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-2 path{fill:hsl(270, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-2 text{fill:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-2{font-size:40px;color:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-2{stroke:hsl(270, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-2{stroke-width:8;}#mermaid-svg-nttrDgqoPdk6tSPH .section-2 line{stroke:hsl(90, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-3 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-3 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-3 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-3 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-3 path{fill:hsl(300, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-3 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-3{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-3{stroke:hsl(300, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-3{stroke-width:5;}#mermaid-svg-nttrDgqoPdk6tSPH .section-3 line{stroke:hsl(120, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-4 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-4 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-4 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-4 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-4 path{fill:hsl(330, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-4 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-4{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-4{stroke:hsl(330, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-4{stroke-width:2;}#mermaid-svg-nttrDgqoPdk6tSPH .section-4 line{stroke:hsl(150, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-5 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-5 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-5 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-5 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-5 path{fill:hsl(0, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-5 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-5{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-5{stroke:hsl(0, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-5{stroke-width:-1;}#mermaid-svg-nttrDgqoPdk6tSPH .section-5 line{stroke:hsl(180, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-6 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-6 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-6 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-6 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-6 path{fill:hsl(30, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-6 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-6{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-6{stroke:hsl(30, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-6{stroke-width:-4;}#mermaid-svg-nttrDgqoPdk6tSPH .section-6 line{stroke:hsl(210, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-7 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-7 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-7 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-7 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-7 path{fill:hsl(90, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-7 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-7{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-7{stroke:hsl(90, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-7{stroke-width:-7;}#mermaid-svg-nttrDgqoPdk6tSPH .section-7 line{stroke:hsl(270, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-8 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-8 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-8 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-8 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-8 path{fill:hsl(150, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-8 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-8{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-8{stroke:hsl(150, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-8{stroke-width:-10;}#mermaid-svg-nttrDgqoPdk6tSPH .section-8 line{stroke:hsl(330, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-9 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-9 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-9 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-9 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-9 path{fill:hsl(180, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-9 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-9{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-9{stroke:hsl(180, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-9{stroke-width:-13;}#mermaid-svg-nttrDgqoPdk6tSPH .section-9 line{stroke:hsl(0, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-10 rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-10 path,#mermaid-svg-nttrDgqoPdk6tSPH .section-10 circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-10 polygon,#mermaid-svg-nttrDgqoPdk6tSPH .section-10 path{fill:hsl(210, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-10 text{fill:black;}#mermaid-svg-nttrDgqoPdk6tSPH .node-icon-10{font-size:40px;color:black;}#mermaid-svg-nttrDgqoPdk6tSPH .section-edge-10{stroke:hsl(210, 100%, 76.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .edge-depth-10{stroke-width:-16;}#mermaid-svg-nttrDgqoPdk6tSPH .section-10 line{stroke:hsl(30, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled,#mermaid-svg-nttrDgqoPdk6tSPH .disabled circle,#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:lightgray;}#mermaid-svg-nttrDgqoPdk6tSPH .disabled text{fill:#efefef;}#mermaid-svg-nttrDgqoPdk6tSPH .section-root rect,#mermaid-svg-nttrDgqoPdk6tSPH .section-root path,#mermaid-svg-nttrDgqoPdk6tSPH .section-root circle,#mermaid-svg-nttrDgqoPdk6tSPH .section-root polygon{fill:hsl(240, 100%, 46.2745098039%);}#mermaid-svg-nttrDgqoPdk6tSPH .section-root text{fill:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .section-root span{color:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .section-2 span{color:#ffffff;}#mermaid-svg-nttrDgqoPdk6tSPH .icon-container{height:100%;display:flex;justify-content:center;align-items:center;}#mermaid-svg-nttrDgqoPdk6tSPH .edge{fill:none;}#mermaid-svg-nttrDgqoPdk6tSPH .mindmap-node-label{dy:1em;alignment-baseline:middle;text-anchor:middle;dominant-baseline:middle;text-align:center;}#mermaid-svg-nttrDgqoPdk6tSPH :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Python 并发
GIL
引用计数保护
5ms check interval
I/O 和 C 扩展时释放
3.13 无 GIL 实验
threading
Thread 生命周期
ThreadPoolExecutor
Lock / RLock
Semaphore 限流
Event 信号
Condition 条件等待
threading.local 线程隔离
queue.Queue 线程安全队列
asyncio
Event Loop
async/await
Task / Future
gather / TaskGroup
超时控制
Semaphore 并发限制
async for / async with
asyncio.Queue
multiprocessing
Pool.map / starmap
ProcessPoolExecutor
Queue / Pipe 通信
shared_memory 共享内存
spawn vs fork
混合模式
asyncio + ProcessPool
run_in_executor
| 工具 | 核心机制 | 最适合 | 不适合 | 关键参数 |
|---|---|---|---|---|
threading |
OS 线程 + GIL 交替 | 遗留同步代码的 I/O 密集 | CPU 密集计算 | max_workers = min(32, cpu+4) |
asyncio |
单线程事件循环 | 高并发 I/O,新项目 | 遗留同步代码,CPU 密集 | Semaphore 控制并发数 |
multiprocessing |
多进程,各自独立 GIL | CPU 密集,数据并行 | 高并发 I/O,进程通信频繁 | max_workers = cpu_count() |
🎯 一句话总结:GIL 决定了 Python 多线程只是"交替跑"而非"并行跑";协程用"主动让出"代替"被动切换",在高并发 I/O 场景极为高效;真正需要并行计算时,唯有多进程能打破 GIL------而 Python 3.13 的无 GIL 实验,正在悄悄改变这一格局。