Computer Vision Arxiv Daily 2025.01.14

1. Image Processing

1-001 MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. We propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images.

1-002 Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains.

1-003 Enhancing Image Generation Fidelity via Progressive Prompts

we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control.

1-004 Personalized Preference Fine-tuning of Diffusion Models

We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users.

1-005 Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted contiguous Gaussians.

2. Video Processing

2-001 Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss

In this paper, we address the challenge of generating temporally consistent videos with motion guidance. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video.

2-002 SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

Video editing models have advanced significantly, but evaluating their performance remains challenging. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks.

2-003 IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion

Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes. To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks.

3. 3D Processing

3-001 RMAvatar: Photorealistic Human Avatar Reconstruction from Monocular Video Based on Rectified Mesh-embedded Gaussians

We introduce RMAvatar, a novel human avatar representation with Gaussian splatting embedded on mesh to learn clothed avatar from a monocular video.

3-002 3DGS-to-PC: Convert a 3D Gaussian Splatting Scene into a Dense Point Cloud or Mesh

In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds.

4. LLM & VLM

4-001 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.

5. Embodied AI

6. datasets

6-001 UnCommon Objects in 3D

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360∘ coverage

6-002 OCORD: Open-Campus Object Removal Dataset

this paper introduces a novel approach to object removal by constructing a high-resolution real-world dataset through long-duration video capture with fixed camera settings.

7. Survey

7-001 The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods.

相关推荐
新知图书43 分钟前
FastGPT简介
人工智能·ai agent·智能体·大模型应用开发·大模型应用
Dev7z1 小时前
基于Matlab卷积神经网络的交通警察手势识别方法研究与实现
人工智能·神经网络·cnn
元拓数智2 小时前
IntaLink:破解数仓建设痛点,重塑高效建设新范式
大数据·数据仓库·人工智能·数据关系·intalink
区块链小八歌2 小时前
从电商收入到链上资产:Liquid Royalty在 Berachain 重塑 RWA 想象力
大数据·人工智能·区块链
沃达德软件2 小时前
大数据反诈平台功能解析
大数据·人工智能
OAoffice2 小时前
智能学习培训考试平台如何驱动未来组织:重塑人才发展格局
人工智能·学习·企业智能学习考试平台·学练考一体化平台
岁月宁静2 小时前
LangChain + LangGraph 实战:构建生产级多模态 WorkflowAgent 的完整指南
人工智能·python·agent
Java中文社群2 小时前
重磅!N8N新版2.0发布!不再支持MySQL?
人工智能
梯度下降不了班3 小时前
【mmodel/xDit】Cross-Attention 深度解析:文生图/文生视频的核心桥梁
人工智能·深度学习·ai作画·stable diffusion·音视频·transformer