多模态视频大模型Aria在Docker部署

契机

⚙ 闲逛HuggingFace的时候发现一个25.3B的多模态大模型，支持图片和视频。刚好我有H20的GPU所以部署来看看效果，因为我的宿主机是cuda-12.1所以为了防止环境污染采用docker部署，通过一系列的披荆斩棘比如Segmentation fault (core dumped)异常，最终成功运行在单卡h20服务器上，python3.10，cuda12.4，ubuntu20.04,程序在推理图片的时候占用50g显存，推理5s视频20fps的时候占用60g左右显存。

项目简介

rhymes-ai/Aria · Hugging Face

https://github.com/rhymes-ai/Aria

线上demo尝试

线上demo响应很快，并且描述得很详细，并且可以描述什么时间发生了啥，介绍里面说的是：Cutting a long video by scene transitions with timestamps.(通过带有时间戳的场景过渡来剪切长视频。),这不是自动剪分镜吗，我有一个好想法先写完这篇再说

环境

docker环境

宿主机cuda是12.4以上的可以忽略，宿主机可以随便升降级cuda的也可以忽略要不然会出现以下异常：ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/.../.../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

bash 复制代码

#安装docker前置
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
  && curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
  && curl -fsSL https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

#安装docker和nvidia-docker
sudo apt-get update
sudo apt-get docker.io
sudo apt-get install -y nvidia-docker2
sudo systemctl start docker
docker --version

#配置加速
#data-root为容器目录，我这里指定只是根目录磁盘满了，你磁盘多的可以不指定
vim /etc/docker/daemon.json
{
        "log-driver": "json-file",
        "log-opts": {
                "max-file": "3",
                "max-size": "10m"
        },
        "registry-mirrors" :[
                "https://hub.rat.dev",
                "https://docker.1panel.live",
                "https://docker.rainbond.cc",
                "https://mirror.ccs.tencentyun.com",
                "http://registry.docker-cn.com",
                "http://docker.mirrors.ustc.edu.cn",
                "http://hub-mirror.c.163.com"
        ],
        "data-root": "/home/docker"
}

#重启
sudo systemctl daemon-reload
sudo systemctl restart docker

#运行cuda:12.4.1容器，指定使用哪块gpu，指定挂载路径
#cuda:12.4.1-devel-ubuntu20.04。这个镜像包含了 nvcc 和其他开发工具。
docker run -d \
--name aria \
--gpus '"device=3"' \
-v /home:/home \
nvidia/cuda:12.4.1-devel-ubuntu20.04 \
tail -f /dev/null

#进入docker
docker exec -it aria bash

#安装常见工具
apt install vim
apt install wget
apt install git

bash 复制代码

#迁移docker容器目录
#这只是我的磁盘满了，需要搞到其他盘，我自己记录一下，你不用运行

sudo rsync -aP /var/lib/docker/ /home/docker
docker info | grep "Docker Root Dir"

Conda环境

bash 复制代码

#下载conda，有些云厂商不支持tsinghua，所以任意选一个就行
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

#安装conda，配置环境变量，如果选择了自动配置环境可以不修改bashrc
sh Miniconda3-latest-Linux-x86_64.sh

#添加conda
vim ~/.bashrc 

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/xxx/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/xxx/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/xxx/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/xxx/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

#激活
source ~/.bashrc

代码环境

bash 复制代码

#建立conda环境,必须使用3.10
#ERROR: Package 'aria' requires a different Python: 3.9.20 not in '>=3.10'
conda create --name aria python=3.10

#克隆代码
git clone https://github.com/rhymes-ai/Aria.git

#进入Aria工程目录
conda activate aria
pip install -e .  -i https://mirrors.aliyun.com/pypi/simple
pip install grouped_gemm -i https://mirrors.aliyun.com/pypi/simple
pip install flash-attn --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple

下载模型

本来测试代码可以自动下载，我喜欢放在指定目录，所以搞了个脚本下载

bash 复制代码

import argparse
import time
import logging
from huggingface_hub import snapshot_download

# Configure logging
logging.basicConfig(level=logging.INFO)

def download_model(model_name, local_name, max_retries=15, retry_interval=2):
    for attempt in range(1, max_retries + 1):
        try:
            snapshot_download(
                repo_id=model_name,
                ignore_patterns=["*.bin"],
                local_dir=local_name,
                force_download=False
            )
            logging.info("Download successful")
            return
        except Exception as e:
            logging.error(f"Attempt {attempt} failed: {e}")
            if attempt < max_retries:
                time.sleep(retry_interval)
            else:
                logging.critical("Download failed, exceeded maximum retry attempts")

def main():
    parser = argparse.ArgumentParser(description="Download a model from Hugging Face Hub")
    parser.add_argument("--model_name", required=True, help="Name of the model to download")
    parser.add_argument("--local_name", required=True, help="Local directory to save the model")
    args = parser.parse_args()

    download_model(args.model_name, args.local_name)

if __name__ == "__main__":
    main()

bash 复制代码

#设置国内下载加速
export HF_ENDPOINT=https://hf-mirror.com 

#命令行直接运行，如果缺少依赖手动装下就行
python download_model.py \
--model_name rhymes-ai/Aria \
--local_name /home/models/huggingface/rhymes-ai/Aria

#建议使用nohup
export HF_ENDPOINT=https://hf-mirror.com && nohup xxxxx >> dowload.log 2>&1 &

图片测试

代码

python 复制代码

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

#这里为下载好模型本地地址
model_id_or_path = "/home/models/huggingface/rhymes-ai/Aria"

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)

#你自己搞一个你图片
image_path = "https://m207605830-1.jpg"

image = Image.open(requests.get(image_path, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=500,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=True,
        temperature=0.9,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

结果

视频测试

代码

python 复制代码

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import time
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_id_or_path = "/home/models/huggingface/rhymes-ai/Aria"
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16,
                                             trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)

#这个一定放在模型加载下面，要不然要报错Segmentation fault (core dumped)
from decord import VideoReader
from tqdm import tqdm
from typing import List

def load_video(video_file, num_frames=128, cache_dir="/home/lzy/cached_video_frames", verbosity="DEBUG"):
    # Create cache directory if it doesn't exist
    os.makedirs(cache_dir, exist_ok=True)
    video_basename = os.path.basename(video_file)
    cache_subdir = os.path.join(cache_dir, f"{video_basename}_{num_frames}")
    os.makedirs(cache_subdir, exist_ok=True)
    cached_frames = []
    missing_frames = []
    frame_indices = []
    for i in range(num_frames):
        frame_path = os.path.join(cache_subdir, f"frame_{i}.jpg")
        if os.path.exists(frame_path):
            cached_frames.append(frame_path)
        else:
            missing_frames.append(i)
            frame_indices.append(i)
    vr = VideoReader(video_file)
    duration = len(vr)
    fps = vr.get_avg_fps()
    frame_timestamps = [int(duration / num_frames * (i + 0.5)) / fps for i in range(num_frames)]
    if verbosity == "DEBUG":
        print(
            "Already cached {}/{} frames for video {}, enjoy speed!".format(len(cached_frames), num_frames, video_file))
    # If all frames are cached, load them directly
    if not missing_frames:
        return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps
    actual_frame_indices = [int(duration / num_frames * (i + 0.5)) for i in missing_frames]
    missing_frames_data = vr.get_batch(actual_frame_indices).asnumpy()
    for idx, frame_index in enumerate(tqdm(missing_frames, desc="Caching rest frames")):
        img = Image.fromarray(missing_frames_data[idx]).convert("RGB")
        frame_path = os.path.join(cache_subdir, f"frame_{frame_index}.jpg")
        img.save(frame_path)
        cached_frames.append(frame_path)
    cached_frames.sort(key=lambda x: int(os.path.basename(x).split('_')[1].split('.')[0]))
    return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps

def get_placeholders_for_videos(frames: List, timestamps=[]):
    contents = []
    if not timestamps:
        for i, _ in enumerate(frames):
            contents.append({"text": None, "type": "image"})
        contents.append({"text": "\n", "type": "text"})
    else:
        for i, (_, ts) in enumerate(zip(frames, timestamps)):
            contents.extend(
                [
                    {"text": f"[{int(ts) // 60:02d}:{int(ts) % 60:02d}]", "type": "text"},
                    {"text": None, "type": "image"},
                    {"text": "\n", "type": "text"}
                ]
            )
    return contents

video_extensions = ('.mp4', '.avi', '.mov')
for root, _, files in os.walk("/home/"):
    for file in files:
        if file.endswith(video_extensions):
            video_path = os.path.join(root, file)
            frames, frame_timestamps = load_video(video_path, num_frames=20)
            ### If you want to insert timestamps for Aria Inputs
            contents = get_placeholders_for_videos(frames, frame_timestamps)
            ### If you DO NOT want to insert frame timestamps for Aria Inputs
            # contents = get_placeholders_for_videos(frames)
            start = time.time()
            messages = [
                {
                    "role": "user",
                    "content": [
                        *contents,
                        {
                            "text": "描述视频",
                            "type": "text"},
                    ],
                }
            ]

            text = processor.apply_chat_template(messages, add_generation_prompt=True)
            inputs = processor(text=text, images=frames, return_tensors="pt", max_image_size=980)
            inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
            inputs = {k: v.to(model.device) for k, v in inputs.items()}

            with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
                output = model.generate(
                    **inputs,
                    max_new_tokens=2048,
                    stop_strings=["<|im_end|>"],
                    tokenizer=processor.tokenizer,
                    do_sample=False,
                )
                output_ids = output[0][inputs["input_ids"].shape[1]:]
                result = processor.decode(output_ids, skip_special_tokens=True)

            print(result)
            print(time.time() - start)

我是分析/home/下面的所有视频，你要分析单个改改就行
max_image_size可改成490
num_frames你根据自己视频来选，我的5秒视频，分析20fps，相当于一秒4fps

结果

总结

aria显存占用还可以，60g左右，好像默认使用的是attn_implementation="flash_attention_2"
对比qwen和cpm来说，可以做到：通过带有时间戳的场景过渡来剪切长视频
core dumped调整下import就行

多模态视频大模型Aria在Docker部署

多模态视频大模型Aria在Docker部署

契机

项目简介

线上demo尝试

环境

docker环境

Conda环境

代码环境

下载模型

图片测试

代码

结果

视频测试

代码

结果

总结

写到最后