Computer Vision Arxiv Daily 2025.02.07

1. Image Processing

001 ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms.

002 Variational Control for Guidance in Diffusion Models

We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear and (blind) non-linear inverse problems without requiring additional model training or modifications.

003 CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding.

004 Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models.

005 PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models

We present the first text-based image editing approach for object parts based on pre-trained diffusion models. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region.

2. Video Processing

001 UniForm: A Unified Diffusion Transformer for Audio-Video Generation

we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs

002 Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression

We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time.

3. 3D Processing

4. LLM & VLM

001 The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss -- visually grounded tokens gradually become less favored throughout generation, and (2) early excitation -- semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information -- visually grounded tokens though not being eventually decided still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.

002 Efficient Few-Shot Continual Learning in Vision-Language Models

we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency.

5. Embodied AI

6. Autonomous Driving

7. Dataset

8. Survey / Book

001 Generative Adversarial Networks Bridging Art and Machine Intelligence

This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversarial mechanisms through illustrative Python examples.

相关推荐
weixin_4462608518 小时前
《从零开始构建智能体》—— 实践与理论结合的智能体入门指南
人工智能
新加坡内哥谈技术18 小时前
Claude 代理技能:从第一性原理出发的深度解析
人工智能
长空任鸟飞_阿康18 小时前
FastAPI 入门指南
人工智能
Pyeako18 小时前
机器学习之KNN算法
人工智能·算法·机器学习
Mxsoft61918 小时前
我发现知识图谱节点关系缺失致诊断不准,自动关系抽取补全救场
人工智能
可信计算18 小时前
【算法随想】一种基于“视觉表征图”拓扑变化的NLP序列预测新范式
人工智能·笔记·python·算法·自然语言处理
爱笑的眼睛1119 小时前
超越剪枝与量化:下一代AI模型压缩工具的技术演进与实践
java·人工智能·python·ai
雨大王51219 小时前
工业生产执行系统(MES)在汽车制造行业的应用案例
运维·人工智能
m0_6265352019 小时前
some 知识点 knowledge
深度学习
数据堂官方账号19 小时前
AI赋能工业4.0:数据堂一站式数据服务加速制造智能化落地
人工智能·机器人·数据集·人机交互·数据采集·数据标注·工业制造