LongVU :Meta AI 的解锁长视频理解模型,利用自适应时空压缩技术彻底改变视频理解方式

Meta AI在视频理解方面取得了令人瞩目的里程碑式成就,推出了LongVU,这是一种开创性的模型,能够理解以前对人工智能系统来说具有挑战性的长视频。 研究论文 "LongVU:用于长视频语言理解的时空自适应压缩 "提出了一种革命性的方法,使人工智能能够有效地处理和理解长达几分钟甚至一小时的视频,而这在以前是无法实现的。

多模态大语言模型(MLLM)在理解和分析视频内容方面取得了可喜的进展。 然而,受限于给定的上下文长度,处理长视频仍然是一项重大挑战。 为了解决这一限制,我们提出了一种时空自适应压缩机制 LongVU,以减少视频标记的数量,同时保留长视频的视觉细节。 我们的想法是利用跨模态查询和帧间依赖关系,自适应地减少视频中的时空冗余。 具体来说,我们利用 DINOv2 特征来删除相似度高的冗余帧。 然后,我们利用文本引导的跨模态查询来选择性地减少帧特征。 此外,我们还根据帧与帧之间的时间依赖关系,对帧进行空间标记缩减。 我们的自适应压缩策略在有限的上下文长度内有效地处理了大量帧,几乎没有损失任何视觉信息。 在各种视频理解基准测试中,我们的 LongVU 始终超越现有方法,尤其是在长达一小时的视频理解任务(如 VideoMME 和 MLVU)中。 在轻量级 LLM 的情况下,我们的 LongVU 还能有效地扩展到更小的规模,并具有最先进的视频理解性能。

LongVU 架构

LongVU 的结构。 给定一个密集采样的视频帧,我们首先利用 DINOv2 去除冗余帧,然后融合 SigLIP 和 DINOv2 的剩余帧特征。 然后,我们通过跨模态查询有选择地减少视觉标记。 最后,我们基于时间依赖性进行空间标记压缩,以进一步满足 LLM 的有限上下文长度。



示例

python 复制代码
# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
    KeywordsStoppingCriteria,
    process_images,
    tokenizer_image_token,
)
from decord import cpu, VideoReader

tokenizer, model, image_processor, context_len = load_pretrained_model(
    "./checkpoints/longvu_qwen", None, "cambrian_qwen",
)

model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"

vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
    img = vr[frame_index].asnumpy()
    video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]

qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=video,
        image_sizes=image_sizes,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=128,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

Github:https://github.com/Vision-CAIR/LongVU

如何 24GB VRAM 运行

https://github.com/Vision-CAIR/LongVU/issues/6

python 复制代码
# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
    KeywordsStoppingCriteria,
    process_images,
    tokenizer_image_token,
)
from decord import cpu, VideoReader

tokenizer, model, image_processor, context_len = load_pretrained_model(
    "Vision-CAIR/LongVU_Qwen2_7B", 
    model_base=None,
    model_name="cambrian_qwen",
    device="cuda:0"
)

model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"

vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
# frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
num_frames = 1000 if len(vr) > 1000 else len(vr)
frame_indices = np.array([i for i in range(0, num_frames, round(fps),)])

video = []
for frame_index in frame_indices:
    img = vr[frame_index].asnumpy()
    video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]

qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
# with torch.inference_mode():
#     output_ids = model.generate(
#         input_ids,
#         images=video,
#         image_sizes=image_sizes,
#         do_sample=False,
#         temperature=0.2,
#         max_new_tokens=128,
#         use_cache=True,
#         stopping_criteria=[stopping_criteria],
#     )
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        images=video,
        image_sizes=image_sizes,
        do_sample=True,
        temperature=0.2,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

输出:

'The video begins with a scene featuring two characters in an animated setting, one dressed in a bright yellow and red outfit with a mask, and the other in a blue and white traditional robe, standing on a rocky terrain with a green, leaf-like structure and a mountainous backdrop. The character in the yellow and red outfit is seen making a gesture with their right hand, while the other character appears to be speaking or reacting to the first character. The scene then transitions to a misty, ethereal environment where the same two characters are now standing on a staircase leading to a building with a golden roof, surrounded by smoke or clouds. The character in the yellow and red outfit is now holding a sword, while the other character is holding a fan, and both are looking up at the building. The scene shifts again to a large, ornate building with a golden roof, where a figure in a white and red outfit is seen descending a staircase, with smaller figures in white and red attire standing on the steps, and a large, white, cloud-like object in the foreground. The final scene shows the same building with the figure in white and red now seated on a golden throne, surrounded by smaller figures in white and red, and a large, white, cloud-like object still in the foreground, suggesting a ceremonial or significant event taking place.'

相关推荐
Morning的呀1 天前
Class48 GRU
人工智能·深度学习·gru
拾零吖1 天前
李宏毅 Deep Learning
人工智能·深度学习·机器学习
华芯邦1 天前
广东充电芯片助力新能源汽车车载系统升级
人工智能·科技·车载系统·汽车·制造
时空无限1 天前
说说transformer 中的掩码矩阵以及为什么能掩盖住词语
人工智能·矩阵·transformer
查里王1 天前
AI 3D 生成工具知识库:当前产品格局与测评总结
人工智能·3d
武子康1 天前
AI-调查研究-76-具身智能 当机器人走进生活:具身智能对就业与社会结构的深远影响
人工智能·程序人生·ai·职场和发展·机器人·生活·具身智能
小鹿清扫日记1 天前
从蛮力清扫到 “会看路”:室外清洁机器人的文明进阶
人工智能·ai·机器人·扫地机器人·具身智能·连合直租·有鹿巡扫机器人
fanstuck1 天前
Prompt提示工程上手指南(六):AI避免“幻觉”(Hallucination)策略下的Prompt
人工智能·语言模型·自然语言处理·nlp·prompt
zhangfeng11331 天前
win7 R 4.4.0和RStudio1.25的版本兼容性以及系统区域设置有关 导致Plots绘图面板被禁用,但是单独页面显示
开发语言·人工智能·r语言·生物信息
DogDaoDao1 天前
神经网络稀疏化设计构架方法和原理深度解析
人工智能·pytorch·深度学习·神经网络·大模型·剪枝·网络稀疏