多GPU并行处理[任务分配、进程调度、资源管理、负载均衡]

bash 复制代码

1. 多GPU并行处理设计
设计思路: 实现基于多GPU的并行任务处理，每个GPU运行独立的任务，以加速整体的处理速度。
实现机制:
进程隔离: 利用multiprocessing.Process为每个GPU创建独立的工作进程。
GPU资源限制: 通过设置CUDA_VISIBLE_DEVICES环境变量，确保每个进程仅能访问其对应的GPU。
任务互斥: 每个GPU拥有一个Lock对象，确保同一时间只有一个任务在特定的GPU上运行。
2. 动态任务分配与负载均衡
设计思路: 通过动态分配任务至队列，实现任务的均匀分布，确保负载均衡。
实现机制:
任务队列: 使用Manager().Queue()创建共享队列，允许多进程安全地存取任务。
设备ID计算: 通过calculate_device_id函数，基于文件路径的哈希值和GPU总数，计算出任务应分配至的GPU，确保任务均匀分配。
3. 进程间通信与同步
设计思路: 确保多进程间的安全通信，避免数据竞争和死锁。
实现机制:
任务获取原子性: 利用Lock对象保护任务获取操作，确保任务获取的原子性。
进程同步: 使用task_queue.join()等待所有任务完成，确保主进程不会在所有子任务完成前退出。
优雅退出: 通过向队列中放置None信号，通知工作进程可以安全退出，实现进程间的优雅终止。
4. 异常处理与资源管理
设计思路: 提供异常处理机制，确保资源的有效管理。
实现机制:
异常捕获: 在worker函数中，使用try-except结构捕获Empty异常，处理队列为空的情况。
资源节约: 通过检查输出文件的存在性，避免重复处理，节省计算资源。
5. 性能优化与监控
设计思路: 优化任务处理流程，提供执行状态的实时反馈。
实现机制:
进度监控: 利用tqdm.write在控制台输出任务执行信息，提供直观的进度反馈。
效率提升: 通过合理的任务分配和进程设计，最大化利用多GPU资源，提升整体处理效率。
总结
该代码的关键设计聚焦于多GPU环境下的并行任务处理，通过精细的进程管理、资源调度、负载均衡策略以及异常处理机制，确保了系统的高效、稳定运行。同时，通过进程间通信和同步机制，以及性能优化措施，进一步提升了系统的整体性能和用户体验。

python 复制代码

# 多gpu调度
# python multi_swap_10s_v2.py
import os
import subprocess
from tqdm import tqdm
import hashlib
from multiprocessing import Process, Lock, Manager, Queue
from queue import Empty  # 用于检查队列是否为空


# Locks for each GPU to ensure only one task runs at a time per GPU
gpu_locks = [Lock(), Lock()]
# A shared queue for all tasks using Manager's Queue
task_queue = Manager().Queue()

def worker(gpu_id, lock):
    os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)  # Set the CUDA_VISIBLE_DEVICES for this process
    while True:
        # Try to acquire the lock and get a task atomically
        with lock:
            try:
                cmd = task_queue.get_nowait()
            except Empty:
                # No more tasks available, exit the worker
                break

        # Update the progress bar outside the lock to avoid contention
        tqdm.write(f"GPU {gpu_id} starting task: {' '.join(cmd)}")

        # Run the subprocess
        subprocess.run(cmd)

        # Worker finishes when it exits the loop

def calculate_device_id(vid_file, img_file):
    # Calculate a hash of the file paths to determine the device ID
    hash_object = hashlib.md5(f"{vid_file}{img_file}".encode())
    hex_dig = hash_object.hexdigest()
    return int(hex_dig, 16) % len(gpu_locks)

def main():
    source_videos_dir = "/home/nvidia/data/video/HDTF/10s"
    source_images_dir = "/home/nvidia/data/image/CelebA-HQ/300/0"
    output_dir = source_images_dir

    video_files_list = [
        os.path.join(source_videos_dir, f)
        for f in os.listdir(source_videos_dir)
        if os.path.isfile(os.path.join(source_videos_dir, f)) and f.endswith('.mp4') and not any(char.isalpha() for char in f.split('.')[0])
    ]

    image_files_list = [
        os.path.join(source_images_dir, f)
        for f in os.listdir(source_images_dir)
        if os.path.isfile(os.path.join(source_images_dir, f)) and f.endswith('.jpg')
    ]

    model_id = 'c'

    # Fill the task queue
    for vid_file in video_files_list:
        for img_file in image_files_list:
            output_video = f"{os.path.splitext(os.path.basename(vid_file))[0]}_{os.path.splitext(os.path.basename(img_file))[0]}_{model_id}.mp4"
            output_video_path = os.path.join(output_dir, output_video)
            
            # Check if the output file already exists
            if not os.path.exists(output_video_path):
                device_id = calculate_device_id(vid_file, img_file)
                cmd = [
                    "python", "multi_face_single_source.py",
                    "--retina_path", "retinaface/RetinaFace-Res50.h5",
                    "--arcface_path", "arcface_model/ArcFace-Res50.h5",
                    "--facedancer_path", "model_zoo/FaceDancer_config_c_HQ.h5",
                    "--vid_path", vid_file,
                    "--swap_source", img_file,
                    "--output", output_video_path,
                    "--compare", "False",
                    "--sample_rate", "1",
                    "--length", "1",
                    "--align_source", "True",
                    "--device_id", str(device_id)
                ]
                task_queue.put(cmd)

    # Create worker processes for each GPU
    workers = []
    for gpu_id in range(len(gpu_locks)):  # Assuming you have 2 GPUs
        p = Process(target=worker, args=(gpu_id, gpu_locks[gpu_id]))
        p.start()
        workers.append(p)

    # Wait for all tasks to be processed
    task_queue.join()

    # Signal workers to exit by adding None to the queue
    # Ensure enough exit signals for all workers
    for _ in workers:
        task_queue.put(None)

    # Wait for all workers to finish
    for p in workers:
        p.join()

if __name__ == '__main__':
    main()

    """
    在这个版本中，我引入了一个calculate_device_id函数，它基于视频文件和图像文件的路径计算出一个哈希值，然后取模得到设备ID。
    这样可以确保任务更均匀地分配到不同的GPU上，而不仅仅依赖于列表的索引。
    同时，我添加了设置CUDA_VISIBLE_DEVICES的代码到worker函数中，虽然这不是严格必需的，但它强调了每个工作进程将只看到并使用分配给它的GPU。这有助于避免潜在的GPU资源冲突问题。
    """