Python 中的并发 —— 进程池

摘要：Python提供了ProcessPoolExecutor实现高效进程池管理，通过concurrent.futures模块可创建预实例化进程组。相比线程池，ProcessPoolExecutor更适合CPU密集型任务，能绕过GIL限制，显著提升执行效率。使用时可选择直接实例化或上下文管理器方式，支持submit()提交单个任务或map()批量处理。对比实验显示，在处理800万次循环时，ProcessPoolExecutor耗时仅1.55秒，而ThreadPoolExecutor需3.84秒。开发者应根据任务类型（CPU/IO密集型）合理选择执行器类型以优化性能。

[Python 中的并发 ------ 进程池](#Python 中的并发 —— 进程池)

[Python 的 concurrent.futures 模块](#Python 的 concurrent.futures 模块)

[Executor 类](#Executor 类)

[进程池执行器（ProcessPoolExecutor）------ 具体子类](#进程池执行器（ProcessPoolExecutor）—— 具体子类)

如何创建进程池执行器？

[示例：ProcessPoolExecutor 的使用](#示例：ProcessPoolExecutor 的使用)

[以上下文管理器方式实例化 ProcessPoolExecutor](#以上下文管理器方式实例化 ProcessPoolExecutor)

[示例：结合上下文管理器使用 ProcessPoolExecutor](#示例：结合上下文管理器使用 ProcessPoolExecutor)

[Executor.map () 函数的使用](#Executor.map () 函数的使用)

[示例：Executor.map 函数的使用](#示例：Executor.map 函数的使用)

[何时使用 ProcessPoolExecutor 与 ThreadPoolExecutor？](#何时使用 ProcessPoolExecutor 与 ThreadPoolExecutor？)

[示例：ProcessPoolExecutor 与 ThreadPoolExecutor 对比](#示例：ProcessPoolExecutor 与 ThreadPoolExecutor 对比)

[示例：基于 ThreadPoolExecutor 的代码](#示例：基于 ThreadPoolExecutor 的代码)

Python 中的并发 ------ 进程池

进程池的创建和使用方式，与我们创建和使用线程池的方式完全相同。进程池可定义为一组预先实例化且处于空闲状态的进程，这些进程随时准备接收任务执行。当我们需要执行大量任务时，创建进程池比为每个任务单独实例化新进程更优。

Python 的 concurrent.futures 模块

Python 标准库中包含concurrent.futures模块，该模块在 Python 3.2 版本中被引入，为开发者提供了启动异步任务的高层级接口。它是构建在 Python 线程和多进程模块之上的抽象层，能为开发者提供基于线程池或进程池运行任务的统一接口。

在后续内容中，我们将介绍concurrent.futures模块的不同子类。

Executor 类

Executor是 Pythonconcurrent.futures模块中的一个抽象类，无法直接使用，需要借助它的以下两个具体子类来实现功能：

线程池执行器（ThreadPoolExecutor）

进程池执行器（ProcessPoolExecutor）

进程池执行器（ProcessPoolExecutor）------ 具体子类

它是Executor类的具体子类之一，基于多进程机制实现，能为任务提交提供一个进程池。该进程池会将任务分配给可用的进程，并调度进程执行任务。

如何创建进程池执行器？

借助concurrent.futures模块及其Executor具体子类，我们可以轻松创建进程池。首先，需要根据业务需求构造指定进程数量的ProcessPoolExecutor（默认进程数为 5），之后再向该进程池提交待执行的任务即可。

示例：ProcessPoolExecutor 的使用

我们沿用创建线程池时的示例，唯一的区别是将ThreadPoolExecutor替换为ProcessPoolExecutor。

main.py 代码

python 复制代码

from concurrent.futures import ProcessPoolExecutor
from time import sleep

def task(message):
    sleep(2)
    return message

def main():
    executor = ProcessPoolExecutor(5)
    future = executor.submit(task, ("Completed"))
    print(future.done())
    sleep(2)
    print(future.done())
    print(future.result())

if __name__ == '__main__':
    main()

输出结果运行代码后，验证输出如下：

plaintext

复制代码

False
False
Completed

代码解析 ：上述示例中，我们创建了一个包含 5 个进程的ProcessPoolExecutor，并向其提交了一个任务 ------ 该任务会休眠 2 秒后返回指定信息。从输出结果可以看到，任务在 2 秒内未完成，因此第一次调用done()方法返回False；2 秒后任务执行完成，再次调用done()返回True，此时通过调用future对象的result()方法即可获取任务执行结果。

以上下文管理器方式实例化 ProcessPoolExecutor

实例化ProcessPoolExecutor的另一种方式是使用上下文管理器，其功能与上述方式一致，核心优势是代码的语法更简洁、可读性更高。

通过以下代码即可实现上下文管理器方式的实例化：

示例：结合上下文管理器使用 ProcessPoolExecutor

为便于理解，我们仍沿用创建线程池时的示例。示例中首先导入concurrent.futures模块，然后定义一个load_url()函数用于加载指定的 URL 地址；接着创建包含 5 个进程的ProcessPoolExecutor，并以上下文管理器的方式调用它，最终通过result()方法获取future对象的执行结果。

main.py 代码

python 复制代码

import concurrent.futures
from concurrent.futures import ProcessPoolExecutor
import urllib.request

URLS = [
    'http://www.foxnews.com/',
    'http://www.cnn.com/',
    'http://europe.wsj.com/',
    'http://www.bbc.co.uk/',
    'http://some-made-up-domain.com/'
]

def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout = timeout) as conn:
        return conn.read()

def main():
    with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
            except Exception as exc:
                print('%r 执行抛出异常: %s' % (url, exc))
            else:
                print('%r 页面大小为: %d 字节' % (url, len(data)))

if __name__ == '__main__':
    main()

输出结果上述 Python 脚本的执行输出如下：

plaintext

复制代码

'http://www.cnn.com/' 执行抛出异常: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] 证书验证失败: 无法获取本地颁发者证书 (_ssl.c:1081)>
'http://europe.wsj.com/' 执行抛出异常: 无法序列化 'BufferedReader' 实例
'http://some-made-up-domain.com/' 页面大小为: 56053 字节
'http://www.bbc.co.uk/' 页面大小为: 758529 字节
'http://www.foxnews.com/' 页面大小为: 783055 字节

Executor.map () 函数的使用

Python 中的map()函数应用广泛，其中一个核心用途是对可迭代对象中的每个元素执行指定函数。同理，我们可以将迭代器中的所有元素映射到一个函数中，并将这些映射后的任务作为独立作业提交给ProcessPoolExecutor执行。

通过以下 Python 脚本示例，可理解该函数的使用方式。

示例：Executor.map 函数的使用

我们沿用通过Executor.map()创建线程池的示例，本示例中，map()函数会对values数组中的每个数值执行square()平方函数。

main.py 代码

输出结果上述 Python 脚本的执行输出如下：

plaintext

复制代码

何时使用 ProcessPoolExecutor 与 ThreadPoolExecutor？

在学习了Executor类的两个子类（ThreadPoolExecutor和ProcessPoolExecutor）后，我们需要明确二者的适用场景：

处理CPU 密集型任务 时，选择ProcessPoolExecutor；
处理I/O 密集型任务 时，选择ThreadPoolExecutor。

使用ProcessPoolExecutor时，我们无需担心 Python 的全局解释器锁（GIL）限制，因为它基于多进程机制实现；此外，相较于ThreadPoolExecutor，它的执行耗时更短。

通过以下 Python 脚本示例，可直观看到二者的差异。

示例：ProcessPoolExecutor 与 ThreadPoolExecutor 对比

基于 ProcessPoolExecutor 的代码（main.py）

python

运行

复制代码

import time
import concurrent.futures

value = [8000000,7000000]

def counting(n):
    start = time.time()
    while n > 0:
        n -= 1
    return time.time() - start

def main():
    start = time.time()
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for number, time_taken in zip(value, executor.map(counting, value)):
            print('数字: {} 执行耗时: {}'.format(number, time_taken))
    print('总执行耗时: {}'.format(time.time() - start))

if __name__ == '__main__':
    main()

输出结果运行代码后，验证输出如下：

plaintext

复制代码

数字: 8000000 执行耗时: 1.5509998798370361
数字: 7000000 执行耗时: 1.3259999752044678
总执行耗时: 2.0840001106262207

示例：基于 ThreadPoolExecutor 的代码

main.py 代码

python 复制代码

import time
import concurrent.futures

value = [8000000,7000000]

def counting(n):
    start = time.time()
    while n > 0:
        n -= 1
    return time.time() - start

def main():
    start = time.time()
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for number, time_taken in zip(value, executor.map(counting, value)):
            print('数字: {} 执行耗时: {}'.format(number, time_taken))
    print('总执行耗时: {}'.format(time.time() - start))

if __name__ == '__main__':
    main()

输出结果运行代码后，验证输出如下：

plaintext

复制代码

数字: 8000000 执行耗时: 3.8420000076293945
数字: 7000000 执行耗时: 3.6010000705718994
总执行耗时: 3.8480000495910645

从上述两个程序的输出结果中，我们能清晰看到使用ProcessPoolExecutor和ThreadPoolExecutor执行 CPU 密集型任务时的耗时差异。