Computer Vision Arxiv Daily 2025.01.14

1. Image Processing

1-001 MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. We propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images.

1-002 Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains.

1-003 Enhancing Image Generation Fidelity via Progressive Prompts

we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control.

1-004 Personalized Preference Fine-tuning of Diffusion Models

We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users.

1-005 Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted contiguous Gaussians.

2. Video Processing

2-001 Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss

In this paper, we address the challenge of generating temporally consistent videos with motion guidance. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video.

2-002 SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

Video editing models have advanced significantly, but evaluating their performance remains challenging. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks.

2-003 IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion

Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes. To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks.

3. 3D Processing

3-001 RMAvatar: Photorealistic Human Avatar Reconstruction from Monocular Video Based on Rectified Mesh-embedded Gaussians

We introduce RMAvatar, a novel human avatar representation with Gaussian splatting embedded on mesh to learn clothed avatar from a monocular video.

3-002 3DGS-to-PC: Convert a 3D Gaussian Splatting Scene into a Dense Point Cloud or Mesh

In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds.

4. LLM & VLM

4-001 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.

5. Embodied AI

6. datasets

6-001 UnCommon Objects in 3D

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360∘ coverage

6-002 OCORD: Open-Campus Object Removal Dataset

this paper introduces a novel approach to object removal by constructing a high-resolution real-world dataset through long-duration video capture with fixed camera settings.

7. Survey

7-001 The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods.

相关推荐
一次旅行33 分钟前
HyperTool:突破传统工具调用限制,让Agent更高效执行复杂任务
人工智能
陈天伟教授1 小时前
图解人工智能(58)人工智能应用-围棋国手
人工智能·语音识别·机器翻译
闻道参看1 小时前
2026年AI优质企业培训系统综合测评:合规管控/数据量化
人工智能
老虾头1 小时前
科技贴近烟火:本地化 AI,赋能各行各业日常经营
人工智能
毒爪的小新2 小时前
Linux 环境极速部署 vLLM:从零搭建生产级大模型推理服务
linux·人工智能·ai·语言模型·vllm
老大白菜2 小时前
25美元,DIY开源可穿戴智能AI眼镜:Arduino+乐鑫ESP32+DeepSeek项目
人工智能
DreamLife☼2 小时前
OpenBCI-脑机接口在康复医疗中的应用
深度学习·cnn·脑电·康复·fes·openbci·外骨骼
岁月宁静2 小时前
RAG 文档摄入全链路,从原理到生产落地
vue.js·人工智能·python
小和尚同志2 小时前
AI 自动化测试探索(一):Playwright MCP
前端·人工智能·aigc