Computer Vision Arxiv Daily 2025.02.07

1. Image Processing

001 ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms.

002 Variational Control for Guidance in Diffusion Models

We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear and (blind) non-linear inverse problems without requiring additional model training or modifications.

003 CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding.

004 Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models.

005 PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models

We present the first text-based image editing approach for object parts based on pre-trained diffusion models. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region.

2. Video Processing

001 UniForm: A Unified Diffusion Transformer for Audio-Video Generation

we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs

002 Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression

We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time.

3. 3D Processing

4. LLM & VLM

001 The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss -- visually grounded tokens gradually become less favored throughout generation, and (2) early excitation -- semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information -- visually grounded tokens though not being eventually decided still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.

002 Efficient Few-Shot Continual Learning in Vision-Language Models

we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency.

5. Embodied AI

6. Autonomous Driving

7. Dataset

8. Survey / Book

001 Generative Adversarial Networks Bridging Art and Machine Intelligence

This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversarial mechanisms through illustrative Python examples.

相关推荐
俊哥V几秒前
阿里通义千问发布全模态开源大模型Qwen2.5-Omni-7B
人工智能·ai
果冻人工智能6 分钟前
每一条广告都只为你而生: 用 人工智能 颠覆广告行业的下一步
人工智能
掘金安东尼10 分钟前
GPT-4.5 被 73% 的人误认为人类,“坏了?!我成替身了!”
人工智能·程序员
掘金一周42 分钟前
金石焕新程 >> 瓜分万元现金大奖征文活动即将回归 | 掘金一周 4.3
前端·人工智能·后端
白雪讲堂1 小时前
AI搜索品牌曝光资料包(精准适配文心一言/Kimi/DeepSeek等场景)
大数据·人工智能·搜索引擎·ai·文心一言·deepseek
斯汤雷1 小时前
Matlab绘图案例,设置图片大小,坐标轴比例为黄金比
数据库·人工智能·算法·matlab·信息可视化
ejinxian1 小时前
Spring AI Alibaba 快速开发生成式 Java AI 应用
java·人工智能·spring
程序员Linc1 小时前
边缘检测技术现状初探2:多尺度与形态学方法
计算机视觉·边缘检测·形态学
葡萄成熟时_1 小时前
【第十三届“泰迪杯”数据挖掘挑战赛】【2025泰迪杯】【代码篇】A题解题全流程(持续更新)
人工智能·数据挖掘