Computer Vision Arxiv Daily 2025.02.07

1. Image Processing

001 ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms.

002 Variational Control for Guidance in Diffusion Models

We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear and (blind) non-linear inverse problems without requiring additional model training or modifications.

003 CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding.

004 Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models.

005 PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models

We present the first text-based image editing approach for object parts based on pre-trained diffusion models. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region.

2. Video Processing

001 UniForm: A Unified Diffusion Transformer for Audio-Video Generation

we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs

002 Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression

We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time.

3. 3D Processing

4. LLM & VLM

001 The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss -- visually grounded tokens gradually become less favored throughout generation, and (2) early excitation -- semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information -- visually grounded tokens though not being eventually decided still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.

002 Efficient Few-Shot Continual Learning in Vision-Language Models

we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency.

5. Embodied AI

6. Autonomous Driving

7. Dataset

8. Survey / Book

001 Generative Adversarial Networks Bridging Art and Machine Intelligence

This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversarial mechanisms through illustrative Python examples.

相关推荐
IT_陈寒1 分钟前
React性能优化:这5个Hook技巧让我的组件渲染效率提升50%(附代码对比)
前端·人工智能·后端
Captaincc3 分钟前
9 月 20 日,TRAE Meetup@Guangzhou 相聚羊城
人工智能·后端
霍格沃兹软件测试开发18 分钟前
快速掌握Dify+Chrome MCP:打造网页操控AI助手
人工智能·chrome·dify·mcp
张子夜 iiii28 分钟前
4步OpenCV-----扫秒身份证号
人工智能·python·opencv·计算机视觉
华新嘉华DTC创新营销2 小时前
华新嘉华:AI搜索优化重塑本地生活行业:智能推荐正取代“关键词匹配”
人工智能·百度·生活
SmartBrain3 小时前
DeerFlow 实践:华为IPD流程的评审智能体设计
人工智能·语言模型·架构
l1t4 小时前
利用DeepSeek实现服务器客户端模式的DuckDB原型
服务器·c语言·数据库·人工智能·postgresql·协议·duckdb
寒月霜华5 小时前
机器学习-数据标注
人工智能·机器学习
paid槮6 小时前
机器视觉之图像处理篇
图像处理·opencv·计算机视觉
九章云极AladdinEdu6 小时前
超参数自动化调优指南:Optuna vs. Ray Tune 对比评测
运维·人工智能·深度学习·ai·自动化·gpu算力