Computer Vision Arxiv Daily 2025.01.14

1. Image Processing

1-001 MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. We propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images.

1-002 Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains.

1-003 Enhancing Image Generation Fidelity via Progressive Prompts

we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control.

1-004 Personalized Preference Fine-tuning of Diffusion Models

We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users.

1-005 Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted contiguous Gaussians.

2. Video Processing

2-001 Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss

In this paper, we address the challenge of generating temporally consistent videos with motion guidance. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video.

2-002 SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

Video editing models have advanced significantly, but evaluating their performance remains challenging. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks.

2-003 IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion

Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes. To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks.

3. 3D Processing

3-001 RMAvatar: Photorealistic Human Avatar Reconstruction from Monocular Video Based on Rectified Mesh-embedded Gaussians

We introduce RMAvatar, a novel human avatar representation with Gaussian splatting embedded on mesh to learn clothed avatar from a monocular video.

3-002 3DGS-to-PC: Convert a 3D Gaussian Splatting Scene into a Dense Point Cloud or Mesh

In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds.

4. LLM & VLM

4-001 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.

5. Embodied AI

6. datasets

6-001 UnCommon Objects in 3D

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360∘ coverage

6-002 OCORD: Open-Campus Object Removal Dataset

this paper introduces a novel approach to object removal by constructing a high-resolution real-world dataset through long-duration video capture with fixed camera settings.

7. Survey

7-001 The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods.

相关推荐
北海道浪子10 分钟前
[免费送$1000]ClaudeCode、Codex等AI模型在开发中的使用
前端·人工智能·后端
福客AI22 分钟前
电商智能客服机器人:客服的服务革新之路
人工智能·机器人
CoderIsArt33 分钟前
CORDIC三角计算技术
人工智能·算法·机器学习
taxunjishu34 分钟前
Ethernet/ip 转 Modbus RTU 驱动,罗克韦尔 PLC 与华为逆变器打造光伏电站智能监控典范
人工智能·物联网·自动化·区块链
Alex艾力的IT数字空间36 分钟前
基于PyTorch和CuPy的GPU并行化遗传算法实现
数据结构·人工智能·pytorch·python·深度学习·算法·机器学习
却道天凉_好个秋1 小时前
OpenCV(十三):通道的分离与合并
人工智能·opencv·计算机视觉
七宝大爷1 小时前
NVIDIA Blackwell Ultra GB300深度解析:AI芯片性能的新巅峰
人工智能·gpu·gb300
鲸鱼在dn1 小时前
大语言模型的后训练与“灾难性遗忘”问题——李宏毅2025大模型第六讲笔记
人工智能·笔记·语言模型
滑水滑成滑头2 小时前
**标题:发散创新:智能交通系统的深度探究与实现**摘要:本文将详细
java·人工智能·python
海云安2 小时前
海云安入选安全牛《企业级AI大模型落地实战技术应用指南(2025版)》优秀案例
人工智能·安全