LongVU :Meta AI 的解锁长视频理解模型，利用自适应时空压缩技术彻底改变视频理解方式

Meta AI在视频理解方面取得了令人瞩目的里程碑式成就，推出了LongVU，这是一种开创性的模型，能够理解以前对人工智能系统来说具有挑战性的长视频。研究论文 "LongVU：用于长视频语言理解的时空自适应压缩 "提出了一种革命性的方法，使人工智能能够有效地处理和理解长达几分钟甚至一小时的视频，而这在以前是无法实现的。

多模态大语言模型（MLLM）在理解和分析视频内容方面取得了可喜的进展。然而，受限于给定的上下文长度，处理长视频仍然是一项重大挑战。为了解决这一限制，我们提出了一种时空自适应压缩机制 LongVU，以减少视频标记的数量，同时保留长视频的视觉细节。我们的想法是利用跨模态查询和帧间依赖关系，自适应地减少视频中的时空冗余。具体来说，我们利用 DINOv2 特征来删除相似度高的冗余帧。然后，我们利用文本引导的跨模态查询来选择性地减少帧特征。此外，我们还根据帧与帧之间的时间依赖关系，对帧进行空间标记缩减。我们的自适应压缩策略在有限的上下文长度内有效地处理了大量帧，几乎没有损失任何视觉信息。在各种视频理解基准测试中，我们的 LongVU 始终超越现有方法，尤其是在长达一小时的视频理解任务（如 VideoMME 和 MLVU）中。在轻量级 LLM 的情况下，我们的 LongVU 还能有效地扩展到更小的规模，并具有最先进的视频理解性能。

LongVU 架构

LongVU 的结构。给定一个密集采样的视频帧，我们首先利用 DINOv2 去除冗余帧，然后融合 SigLIP 和 DINOv2 的剩余帧特征。然后，我们通过跨模态查询有选择地减少视觉标记。最后，我们基于时间依赖性进行空间标记压缩，以进一步满足 LLM 的有限上下文长度。

示例

python 复制代码

# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
    KeywordsStoppingCriteria,
    process_images,
    tokenizer_image_token,
)
from decord import cpu, VideoReader

tokenizer, model, image_processor, context_len = load_pretrained_model(
    "./checkpoints/longvu_qwen", None, "cambrian_qwen",
)

model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"

vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
    img = vr[frame_index].asnumpy()
    video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]

qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=video,
        image_sizes=image_sizes,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=128,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

Github：https://github.com/Vision-CAIR/LongVU

如何 24GB VRAM 运行

https://github.com/Vision-CAIR/LongVU/issues/6

python 复制代码

# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
    KeywordsStoppingCriteria,
    process_images,
    tokenizer_image_token,
)
from decord import cpu, VideoReader

tokenizer, model, image_processor, context_len = load_pretrained_model(
    "Vision-CAIR/LongVU_Qwen2_7B", 
    model_base=None,
    model_name="cambrian_qwen",
    device="cuda:0"
)

model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"

vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
# frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
num_frames = 1000 if len(vr) > 1000 else len(vr)
frame_indices = np.array([i for i in range(0, num_frames, round(fps),)])

video = []
for frame_index in frame_indices:
    img = vr[frame_index].asnumpy()
    video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]

qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
# with torch.inference_mode():
#     output_ids = model.generate(
#         input_ids,
#         images=video,
#         image_sizes=image_sizes,
#         do_sample=False,
#         temperature=0.2,
#         max_new_tokens=128,
#         use_cache=True,
#         stopping_criteria=[stopping_criteria],
#     )
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        images=video,
        image_sizes=image_sizes,
        do_sample=True,
        temperature=0.2,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

输出：

'The video begins with a scene featuring two characters in an animated setting, one dressed in a bright yellow and red outfit with a mask, and the other in a blue and white traditional robe, standing on a rocky terrain with a green, leaf-like structure and a mountainous backdrop. The character in the yellow and red outfit is seen making a gesture with their right hand, while the other character appears to be speaking or reacting to the first character. The scene then transitions to a misty, ethereal environment where the same two characters are now standing on a staircase leading to a building with a golden roof, surrounded by smoke or clouds. The character in the yellow and red outfit is now holding a sword, while the other character is holding a fan, and both are looking up at the building. The scene shifts again to a large, ornate building with a golden roof, where a figure in a white and red outfit is seen descending a staircase, with smaller figures in white and red attire standing on the steps, and a large, white, cloud-like object in the foreground. The final scene shows the same building with the figure in white and red now seated on a golden throne, surrounded by smaller figures in white and red, and a large, white, cloud-like object still in the foreground, suggesting a ceremonial or significant event taking place.'