中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- mHC:流形约束的超连接
- Youtube-LLM:释放轻量级大型语言模型的本机代理潜力
- [Mindscape 感知检索增强生成,提高长上下文理解](#Mindscape 感知检索增强生成,提高长上下文理解)
- [使用基于超图的内存改进多步 RAG,以进行长上下文复杂关系建模](#使用基于超图的内存改进多步 RAG,以进行长上下文复杂关系建模)
- [InsertAnywhere:桥接 4D 场景几何和扩散模型以实现逼真的视频对象插入](#InsertAnywhere:桥接 4D 场景几何和扩散模型以实现逼真的视频对象插入)
- [让它流动:摇滚上的代理制作,在开放的代理学习生态系统中构建 ROME 模型](#让它流动:摇滚上的代理制作,在开放的代理学习生态系统中构建 ROME 模型)
- 通过辅助损耗将专家和路由器耦合到专家混合中
- LiveTalk:通过改进的按策略蒸馏实现实时多模式交互式视频扩散
- Yume-1.5:文本控制的交互式世界生成模型
- 动态大概念模型:自适应语义空间中的潜在推理
- Stream-DiffVSR:通过自动回归扩散实现低延迟流视频超分辨率
- DiffThinker:利用扩散模型进行生成多模态推理
- 扩散了解透明度:重新利用视频扩散来实现透明对象深度和法线估计
- 通过协作变压器检测操作系统日志中的点和集体异常的统一框架
- [Dream-VL 和 Dream-VLA:具有扩散语言模型骨干的开放视觉语言和视觉语言动作模型](#Dream-VL 和 Dream-VLA:具有扩散语言模型骨干的开放视觉语言和视觉语言动作模型)
- [GaMO:用于稀疏视图 3D 重建的几何感知多视图扩散绘制](#GaMO:用于稀疏视图 3D 重建的几何感知多视图扩散绘制)
- SmartSnap:自我验证代理主动寻找证据
- SpotEdit:扩散变压器中的选择性区域编辑
- [UltraShape 1.0:通过可扩展的几何细化生成高保真 3D 形状](#UltraShape 1.0:通过可扩展的几何细化生成高保真 3D 形状)
- [MAI-UI 技术报告:以现实世界为中心的基础 GUI 代理](#MAI-UI 技术报告:以现实世界为中心的基础 GUI 代理)
- 人工智能遇见大脑:从认知神经科学到自主代理的记忆系统
- [评估 RLVR 的参数有效方法](#评估 RLVR 的参数有效方法)
- TimeBill:大型语言模型的时间预算推理
- UniPercept:迈向跨美学、质量、结构和纹理的统一感知级图像理解
- GRAN-TED:为扩散模型生成稳健、对齐且细致的文本嵌入
mHC:流形约束的超连接
- 标题: mHC: Manifold-Constrained Hyper-Connections
- 作者: Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.24880
- 论文链接 : https://arxiv.org/pdf/2512.24880
英文摘要
Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.
中文摘要
最近,以超连接(HC)为代表的研究通过扩大残余流宽度和多样化连接模式,扩展了过去十年建立的普遍存在的残余连接范式。虽然产生了显着的性能提升,但这种多样化从根本上损害了剩余连接固有的身份映射属性,这导致严重的训练不稳定和受限的可扩展性,并且还会产生显着的内存访问开销。为了应对这些挑战,我们提出了流形约束超连接(mHC),这是一个通用框架,将 HC 的剩余连接空间投影到特定流形上以恢复恒等映射属性,同时结合严格的基础设施优化以确保效率。实证实验表明,mHC 对于大规模训练是有效的,可提供切实的性能改进和卓越的可扩展性。我们预计 mHC 作为 HC 的灵活实用的扩展,将有助于更深入地理解拓扑架构设计,并为基础模型的发展提出有希望的方向。
Youtube-LLM:释放轻量级大型语言模型的本机代理潜力
- 标题: Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
- 作者: Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.24618
- 论文链接 : https://arxiv.org/pdf/2512.24618
- 项目链接 : https://youtu-tip.com/#llm
- gitHub仓库 : https://github.com/TencentCloudADP/youtu-tip
英文摘要
We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.
中文摘要
我们推出 Youtu-LLM,这是一种轻量级但功能强大的语言模型,可将高计算效率与原生代理智能相协调。与典型的依赖蒸馏的小模型不同,Youtu-LLM(1.96B)是从头开始预训练的,系统地培养推理和规划能力。关键技术进步如下: (1) 具有长上下文支持的紧凑架构:Youtu-LLM 基于密集的多潜在注意力 (MLA) 架构和新颖的面向 STEM 的词汇构建,支持 128k 上下文窗口。这种设计在最小的内存占用范围内实现了强大的长上下文推理和状态跟踪,使其成为长视野代理和推理任务的理想选择。(2) 原则性的"Commonsense-STEM-Agent"课程:我们策划了约 11T 个 token 的海量语料库,并实施了多阶段的训练策略。通过逐步将预训练数据分布从一般常识转移到复杂的 STEM 和代理任务,我们确保模型获得深层认知能力,而不是表面对齐。(3)可扩展的代理中期训练:特别是对于代理中期训练,我们采用不同的数据构建方案来综合数学、编码和工具使用领域的丰富多样的轨迹。这种高质量的数据使模型能够有效地内化规划和反思行为。广泛的评估表明,Youtu-LLM 为 2B 级以下的 LLM 树立了新的最先进水平。在一般基准上,它实现了与大型模型竞争的性能,而在特定于代理的任务上,它显着超越了现有的 SOTA 基线,这表明轻量级模型可以拥有强大的内在代理能力。
Mindscape 感知检索增强生成,提高长上下文理解
- 标题: Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
- 作者: Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu
- 日期: 2025-12-19
- ArXiv主页 : https://arxiv.org/abs/2512.17220
- 论文链接 : https://arxiv.org/pdf/2512.17220
英文摘要
Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
中文摘要
人类依靠内容的整体语义表示来理解长而复杂的文本。正如人类心理学中的思维景观感知能力所揭示的那样,这种全局视图有助于组织先验知识、解释新信息并整合分散在文档中的证据。当前的检索增强生成(RAG)系统缺乏此类指导,因此难以处理长上下文任务。在本文中,我们提出了 Mindscape-Aware RAG (MiA-RAG),这是第一个为基于 LLM 的 RAG 系统配备明确的全局上下文感知的方法。MiA-RAG 通过分层总结构建思维景观,并根据全局语义表示条件检索和生成。这使得检索器能够形成丰富的查询嵌入,并且生成器能够在连贯的全局上下文中对检索到的证据进行推理。我们通过多种长上下文和双语基准评估 MiA-RAG,以实现基于证据的理解和全球意义构建。它始终超越基线,进一步的分析表明,它将局部细节与连贯的全局表示相结合,从而实现更类似于人类的长上下文检索和推理。
使用基于超图的内存改进多步 RAG,以进行长上下文复杂关系建模
-
标题: Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
-
作者: Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu
-
日期: 2025-12-30
-
ArXiv主页 : https://arxiv.org/abs/2512.23959
-
gitHub仓库 : https://github.com/Encyclomen/HGMem
英文摘要
Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.
中文摘要
多步检索增强生成(RAG)已成为一种广泛采用的策略,用于在需要全局理解和强化推理的任务上增强大型语言模型(LLM)。许多 RAG 系统都包含一个工作内存模块来整合检索到的信息。然而,现有的存储器设计主要用作被动存储,其积累孤立的事实,以压缩冗长的输入并通过推导生成新的子查询。这种静态性质忽视了原始事实之间至关重要的高阶相关性,而这些事实的组合通常可以为后续步骤提供更有力的指导。因此,它们的表征强度以及对多步推理和知识演化的影响有限,导致推理碎片化和扩展上下文中的全局意义构建能力较弱。我们引入了 HGMem,一种基于超图的内存机制,它将内存的概念从简单的存储扩展到动态的、可表达的结构,以实现复杂的推理和全局理解。在我们的方法中,记忆被表示为超图,其超边对应于不同的记忆单元,从而能够在记忆内逐步形成高阶交互。这种机制将围绕焦点问题的事实和思想联系起来,演变成一个集成的、情境化的知识结构,为后续步骤中更深入的推理提供强有力的命题。我们在几个专为全球意义构建而设计的具有挑战性的数据集上评估了 HGMem。大量的实验和深入的分析表明,我们的方法持续改进了多步 RAG,并且在不同的任务中显着优于强大的基线系统。
InsertAnywhere:桥接 4D 场景几何和扩散模型以实现逼真的视频对象插入
- 标题: InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
- 作者: Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo
- 日期: 2025-12-19
- ArXiv主页 : https://arxiv.org/abs/2512.17504
- 论文链接 : https://arxiv.org/pdf/2512.17504
- 项目链接 : https://myyzzzoooo.github.io/InsertAnywhere/
- gitHub仓库 : https://github.com/myyzzzoooo/InsertAnywhere
英文摘要
Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.
中文摘要
基于扩散的视频生成的最新进展为可控视频编辑开辟了新的可能性,但由于 4D 场景理解有限以及对遮挡和照明效果的处理不足,逼真的视频对象插入 (VOI) 仍然具有挑战性。我们推出了 InsertAnywhere,这是一种新的 VOI 框架,可实现几何一致的对象放置和外观忠实的视频合成。我们的方法从 4D 感知掩模生成模块开始,该模块重建场景几何形状并跨帧传播用户指定的对象放置,同时保持时间连贯性和遮挡一致性。在此空间基础上,我们扩展了基于扩散的视频生成模型,以联合合成插入的对象及其周围的局部变化,例如照明和阴影。为了实现监督训练,我们引入了 ROSE++,这是一种照明感知合成数据集,通过将 ROSE 对象删除数据集转换为对象删除视频、对象存在视频和 VLM 生成的参考图像的三元组而构建。通过大量的实验,我们证明我们的框架可以在不同的现实世界场景中产生几何上合理且视觉上连贯的对象插入,显着优于现有的研究和商业模型。
让它流动:摇滚上的代理制作,在开放的代理学习生态系统中构建 ROME 模型
-
标题: Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
-
作者: Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Chonghuan Liu, ZhenDong Liu, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng
-
日期: 2025-12-31
-
ArXiv主页 : https://arxiv.org/abs/2512.24873
-
gitHub仓库 : https://github.com/alibaba/ROLL
英文摘要
Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.
中文摘要
代理制作要求法学硕士在现实环境中通过采取行动、观察结果和迭代地改进工件来多次操作。尽管开源社区很重要,但它缺乏一个有原则的端到端生态系统来简化代理开发。我们介绍代理学习生态系统 (ALE),这是一个优化代理法学硕士生产流程的基础设施。ALE 由三个部分组成:ROLL,用于权重优化的训练后框架;ROCK,用于轨迹生成的沙箱环境管理器;iFlow CLI,一个用于高效上下文工程的代理框架。我们发布了 ROME(ROME 显然是一个代理模型),这是一个基于 ALE 的开源代理,并接受了超过一百万个轨迹的训练。我们的方法包括用于合成复杂行为的数据组合协议和一种新颖的策略优化算法,即基于交互的策略对齐(IPA),该算法在语义交互块而不是单个令牌上分配信用,以提高长期训练稳定性。根据经验,我们在结构化环境中评估 ROME,并推出 Terminal Bench Pro,这是一个具有改进的规模和污染控制的基准。ROME 在 SWE-bench Verified 和 Terminal Bench 等基准测试中展示了强大的性能,证明了 ALE 基础设施的有效性。
通过辅助损耗将专家和路由器耦合到专家混合中
- 标题: Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
- 作者: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
- 日期: 2025-12-29
- ArXiv主页 : https://arxiv.org/abs/2512.23447
- 论文链接 : https://arxiv.org/pdf/2512.23447
英文摘要
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
中文摘要
专家混合 (MoE) 模型缺乏明确的约束来确保路由器的决策与专家的能力保持一致,这最终限制了模型的性能。为了解决这个问题,我们提出了专家路由器耦合(ERC)损失,这是一种轻量级的辅助损失,它将路由器的决策与专家的能力紧密耦合起来。我们的方法将每个专家的路由器嵌入视为分配给该专家的令牌的代理令牌,并通过专家提供扰动的路由器嵌入以获得内部激活。ERC 损失对这些激活施加了两个约束:(1) 每个专家必须对自己的代理代币表现出比任何其他专家的代理代币更高的激活。(2) 每个代理令牌必须从其相应的专家那里引发比任何其他专家更强的激活。这些约束共同确保每个路由器嵌入忠实地代表其相应专家的能力,而每个专家专门处理实际路由到它的令牌。ERC 损失计算效率很高,仅在 n^2 次激活上运行,其中 n 是专家的数量。这代表了与批次大小无关的固定成本,与之前随代币数量(通常每批次数百万)扩展的耦合方法不同。通过预训练从 3B 到 15B 参数的 MoE-LLM 以及对数万亿代币的广泛分析,我们证明了 ERC 损失的有效性。此外,ERC 损失在培训期间提供了对专家专业水平的灵活控制和定量跟踪,为 MoE 提供了宝贵的见解。
LiveTalk:通过改进的按策略蒸馏实现实时多模式交互式视频扩散
-
标题: LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
-
作者: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
-
日期: 2025-12-29
-
ArXiv主页 : https://arxiv.org/abs/2512.23576
-
gitHub仓库 : https://github.com/GAIR-NLP/LiveTalk
英文摘要
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
中文摘要
通过扩散生成实时视频对于构建通用多模式交互式人工智能系统至关重要。然而,通过扩散模型中的迭代过程对所有视频帧进行双向关注的同时去噪会阻碍实时交互。虽然现有的蒸馏方法可以使模型自回归并减少采样步骤来缓解这种情况,但它们主要关注文本到视频的生成,导致人机交互不自然且效率较低。本文的目标是在多模态环境(包括文本、图像和音频)下进行实时交互式视频传播,以弥补这一差距。鉴于观察到领先的策略蒸馏方法 Self Forcing 在多模式条件下遇到了挑战(闪烁、黑框和质量下降等视觉伪影),我们研究了一种改进的蒸馏方法,重点关注条件输入的质量以及策略优化的初始化和时间表。在包括 HDTF、AVSpeech 和 CelebV-HQ 在内的多模态条件(音频、图像和文本)头像视频生成基准上,我们的蒸馏模型与相似或更大尺寸的全步双向基线的视觉质量相匹配,推理成本和延迟降低了 20 倍。此外,我们将我们的模型与音频语言模型和长格式视频推理技术 Anchor-Heavy Identity Sinks 相结合,构建了 LiveTalk,一个实时多模式交互式化身系统。对我们策划的多轮交互基准的系统级评估表明,LiveTalk 在多轮视频一致性和内容质量方面优于最先进的模型(Sora2、Veo3),同时将响应延迟从 1 到 2 分钟缩短到实时生成,从而实现无缝的人机人工智能多模式交互。
Yume-1.5:文本控制的交互式世界生成模型
- 标题: Yume-1.5: A Text-Controlled Interactive World Generation Model
- 作者: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
- 日期: 2025-12-26
- ArXiv主页 : https://arxiv.org/abs/2512.22096
- 论文链接 : https://arxiv.org/pdf/2512.22096
- 项目链接 : https://stdstu12.github.io/YUME-Project/
- gitHub仓库 : https://github.com/stdstu12/YUME
英文摘要
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
中文摘要
最近的方法已经证明了使用扩散模型来生成交互式和可探索的世界的前景。然而,这些方法大多数都面临着严峻的挑战,例如参数大小过大、依赖冗长的推理步骤以及快速增长的历史背景,这些挑战严重限制了实时性能并缺乏文本控制的生成能力。为了应对这些挑战,我们提出了 \ 方法,这是一种新颖的框架,旨在从单个图像或文本提示生成现实的、交互式的和连续的世界。\method 通过精心设计的框架来实现这一点,该框架支持基于键盘的生成世界的探索。该框架由三个核心组件组成:(1)集成了统一上下文压缩和线性注意力的长视频生成框架;(2)基于双向注意力蒸馏和增强文本嵌入方案的实时流加速策略;(3) 生成世界事件的文本控制方法。我们在补充材料中提供了代码库。
动态大概念模型:自适应语义空间中的潜在推理
- 标题: Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
- 作者: Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.24617
- 论文链接 : https://arxiv.org/pdf/2512.24617
英文摘要
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose Dynamic Large Concept Models (DLCM), a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first compression-aware scaling law, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a decoupled μP parametrization that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting (R=4, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a +2.69% average improvement across 12 zero-shot benchmarks under matched inference FLOPs.
中文摘要
尽管语言表现出高度不均匀的信息密度,但大型语言模型(LLM)对所有标记应用统一计算。这种令牌统一机制浪费了本地可预测跨度上的容量,同时将计算分配给语义关键的转换不足。我们提出了动态大型概念模型(DLCM),这是一种分层语言建模框架,它从潜在表示中学习语义边界,并将计算从标记转移到推理更有效的压缩概念空间。DLCM 端到端地发现可变长度概念,而不依赖于预定义的语言单元。分层压缩从根本上改变了缩放行为。我们引入了第一个压缩感知缩放法则,它解开了令牌级容量、概念级推理能力和压缩比,从而在固定 FLOP 下实现有原则的计算分配。为了稳定地训练这种异构架构,我们进一步开发了一种解耦的 μP 参数化,支持跨宽度和压缩机制的零样本超参数传输。在实际设置中(R=4,相当于每个概念平均有 4 个令牌),DLCM 将大约三分之一的推理计算重新分配到更高容量的推理主干中,在匹配的推理 FLOP 下,在 12 个零样本基准测试中实现了 +2.69% 的平均改进。
Stream-DiffVSR:通过自动回归扩散实现低延迟流视频超分辨率
- 标题: Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
- 作者: Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu
- 日期: 2025-12-29
- ArXiv主页 : https://arxiv.org/abs/2512.23709
- 论文链接 : https://arxiv.org/pdf/2512.23709
- 项目链接 : https://jamichss.github.io/stream-diffvsr-project-page/
英文摘要
Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/
中文摘要
基于扩散的视频超分辨率(VSR)方法实现了很强的感知质量,但由于依赖未来帧和昂贵的多步去噪,对于延迟敏感的设置仍然不切实际。我们提出了 Stream-DiffVSR,一种用于高效在线 VSR 的因果条件扩散框架。它严格在过去的帧上运行,结合了用于快速推理的四步蒸馏降噪器、在潜在去噪期间注入运动对齐线索的自回归时间引导 (ARTG) 模块,以及带有时间处理器模块 (TPM) 的轻量级时间感知解码器,可增强细节和时间一致性。Stream-DiffVSR 在 RTX4090 GPU 上处理 720p 帧只需 0.328 秒,显着优于之前基于扩散的方法。与在线 SOTA TMP 相比,它提高了感知质量 (LPIPS +0.095),同时将延迟降低了 130 倍以上。Stream-DiffVSR 实现了基于扩散的 VSR 报告的最低延迟,将初始延迟从 4600 秒以上减少到 0.328 秒,从而使其成为第一个适合低延迟在线部署的扩散 VSR 方法。项目页面:https://jamichss.github.io/stream-diffvsr-project-page/
DiffThinker:利用扩散模型进行生成多模态推理
- 标题: DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
- 作者: Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng
- 日期: 2025-12-30
- ArXiv主页 : https://arxiv.org/abs/2512.24165
- 论文链接 : https://arxiv.org/pdf/2512.24165
- 项目链接 : https://diffthinker-project.github.io
- gitHub仓库 : https://github.com/lcqysl/DiffThinker
英文摘要
While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.
中文摘要
虽然最近的多模态大型语言模型(MLLM)在多模态推理方面取得了显着的进步,但它们的推理过程仍然主要以文本为中心,导致在复杂的长期、以视觉为中心的任务中表现不佳。在本文中,我们建立了一种新颖的生成多模态推理范式,并介绍了 DiffThinker,一种基于扩散的推理框架。从概念上讲,DiffThinker 将多模态推理重新表述为本地生成图像到图像任务,在以视觉为中心的任务中实现卓越的逻辑一致性和空间精度。我们对 DiffThinker 和 MLLM 进行了系统比较,首次深入研究了该范式的内在特征,揭示了四个核心属性:效率、可控性、本机并行性和协作。跨四个领域(顺序规划、组合优化、约束满足和空间配置)的广泛实验表明,DiffThinker 的性能显着优于领先的闭源模型,包括 GPT-5 (+314.2%) 和 Gemini-3-Flash (+111.6%),以及微调的 Qwen3-VL-32B 基线 (+39.0%),凸显生成式多模态推理是一种有前途的方法。以视觉为中心的推理。
扩散了解透明度:重新利用视频扩散来实现透明对象深度和法线估计
- 标题: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
- 作者: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
- 日期: 2025-12-29
- ArXiv主页 : https://arxiv.org/abs/2512.23705
- 论文链接 : https://arxiv.org/pdf/2512.23705
- 项目链接 : https://daniellli.github.io/projects/DKT/
- gitHub仓库 : https://github.com/Daniellli/DKT
英文摘要
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
中文摘要
透明物体对于感知系统来说仍然是出了名的困难:折射、反射和透射打破了立体、ToF 和纯粹辨别单眼深度背后的假设,导致空洞和暂时不稳定的估计。我们的主要观察结果是,现代视频扩散模型已经合成了令人信服的透明现象,这表明它们已经内化了光学规则。我们构建了 TransPhy3D,一个透明/反射场景的合成视频语料库:使用 Blender/Cycles 渲染的 11k 序列。场景由一组精心策划的类别丰富的静态资产和形状丰富的程序资产与玻璃/塑料/金属材料搭配而成。我们使用基于物理的光线追踪和 OptiX 去噪来渲染 RGB + 深度 + 法线。从大型视频扩散模型开始,我们通过轻量级 LoRA 适配器学习深度(和法线)的视频到视频转换器。在训练过程中,我们连接 DiT 主干中的 RGB 和(噪声)深度潜值,并在 TransPhy3D 和现有的逐帧合成数据集上进行联合训练,从而为任意长度的输入视频生成时间一致的预测。由此产生的模型 DKT 在涉及透明度的真实和合成视频基准上实现了零样本 SOTA:ClearPose、DREDS (CatKnown/CatNovel) 和 TransPhy3D-Test。它提高了强图像/视频基线的准确性和时间一致性,并且法线变体在 ClearPose 上设置了最佳视频法线估计结果。紧凑型 1.3B 版本的运行速度约为 0.17 秒/帧。DKT 的深度集成到抓取堆栈中,提高了半透明、反射和漫射表面的成功率,优于之前的估计器。总之,这些结果支持了一个更广泛的主张:"扩散知道透明度。"生成视频先验可以有效且无标签地重新利用,形成强大的、时间连贯的感知,以应对现实世界的操纵。
通过协作变压器检测操作系统日志中的点和集体异常的统一框架
- 标题: A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers
- 作者: Mohammad Nasirzadeh, Jafar Tahmoresnezhad, Parviz Rashidi-Khazaee
- 日期: 2025-12-29
- ArXiv主页 : https://arxiv.org/abs/2512.23380
- 论文链接 : https://arxiv.org/pdf/2512.23380
- 项目链接 : https://www.alarmif.com
- gitHub仓库 : https://github.com/NasirzadehMoh/CoLog
英文摘要
Log anomaly detection is crucial for preserving the security of operating systems. Depending on the source of log data collection, various information is recorded in logs that can be considered log modalities. In light of this intuition, unimodal methods often struggle by ignoring the different modalities of log data. Meanwhile, multimodal methods fail to handle the interactions between these modalities. Applying multimodal sentiment analysis to log anomaly detection, we propose CoLog, a framework that collaboratively encodes logs utilizing various modalities. CoLog utilizes collaborative transformers and multi-head impressed attention to learn interactions among several modalities, ensuring comprehensive anomaly detection. To handle the heterogeneity caused by these interactions, CoLog incorporates a modality adaptation layer, which adapts the representations from different log modalities. This methodology enables CoLog to learn nuanced patterns and dependencies within the data, enhancing its anomaly detection capabilities. Extensive experiments demonstrate CoLog's superiority over existing state-of-the-art methods. Furthermore, in detecting both point and collective anomalies, CoLog achieves a mean precision of 99.63%, a mean recall of 99.59%, and a mean F1 score of 99.61% across seven benchmark datasets for log-based anomaly detection. The comprehensive detection capabilities of CoLog make it highly suitable for cybersecurity, system monitoring, and operational efficiency. CoLog represents a significant advancement in log anomaly detection, providing a sophisticated and effective solution to point and collective anomaly detection through a unified framework and a solution to the complex challenges automatic log data analysis poses. We also provide the implementation of CoLog at https://github.com/NasirzadehMoh/CoLog.
中文摘要
日志异常检测对于维护操作系统的安全至关重要。根据日志数据收集的来源,日志中记录的各种信息可以被视为日志模式。根据这种直觉,单峰方法常常因忽略日志数据的不同模式而陷入困境。同时,多模态方法无法处理这些模态之间的相互作用。将多模态情感分析应用于日志异常检测,我们提出了 CoLog,一个利用各种模态协作编码日志的框架。CoLog 利用协作变压器和多头印象注意力来学习多种模式之间的交互,确保全面的异常检测。为了处理由这些交互引起的异质性,CoLog 结合了一个模态适应层,它适应来自不同日志模态的表示。这种方法使 CoLog 能够学习数据中的细微差别模式和依赖性,从而增强其异常检测能力。大量的实验证明了 CoLog 相对于现有最先进方法的优越性。此外,在检测点异常和集体异常时,CoLog 在基于日志的异常检测的七个基准数据集上实现了 99.63% 的平均精度、99.59% 的平均召回率和 99.61% 的平均 F1 分数。CoLog全面的检测能力使其非常适合网络安全、系统监控和运营效率。CoLog 代表了日志异常检测的重大进步,通过统一的框架为点和集体异常检测提供了复杂且有效的解决方案,并解决了自动日志数据分析带来的复杂挑战。我们还在 https://github.com/NasirzadehMoh/CoLog 上提供了 CoLog 的实现。
Dream-VL 和 Dream-VLA:具有扩散语言模型骨干的开放视觉语言和视觉语言动作模型
- 标题: Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
- 作者: Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
- 日期: 2025-12-27
- ArXiv主页 : https://arxiv.org/abs/2512.22615
- 论文链接 : https://arxiv.org/pdf/2512.22615
- 项目链接 : https://hkunlp.github.io/blog/2025/dream-vlx/
- gitHub仓库 : https://github.com/DreamLM/Dream-VLX
英文摘要
While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as π_0 and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
中文摘要
虽然自回归大型视觉语言模型(VLM)取得了显着的成功,但它们的顺序生成通常限制了它们在复杂视觉规划和动态机器人控制中的功效。在这项工作中,我们研究了在基于扩散的大语言模型(dLLM)上构建视觉语言模型以克服这些限制的潜力。我们推出 Dream-VL,这是一种基于开放扩散的 VLM (dVLM),在之前的 dVLM 中实现了最先进的性能。Dream-VL 可与基于各种基准的开放数据训练的顶级基于 AR 的 VLM 相媲美,但在应用于视觉规划任务时表现出卓越的潜力。在 Dream-VL 的基础上,我们引入了 Dream-VLA,这是一种基于 dLLM 的视觉-语言-动作模型(dVLA),通过对开放机器人数据集的持续预训练而开发。我们证明,这种扩散主干的原生双向性质可以作为 VLA 任务的卓越基础,本质上适合动作分块和并行生成,从而在下游微调中显着加快收敛速度。Dream-VLA 在 LIBERO 上实现了 97.2% 的平均成功率,在 SimplerEnv-Bridge 上实现了 71.4% 的总体平均成功率,在 SimplerEnv-Fractal 上实现了 60.5% 的总体平均成功率,超越了 π_0 和 GR00T-N1 等领先模型。我们还验证了 dVLM 在不同训练目标的下游任务上超越了 AR 基线。我们发布了 Dream-VL 和 Dream-VLA 以促进社区的进一步研究。
GaMO:用于稀疏视图 3D 重建的几何感知多视图扩散绘制
- 标题: GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
- 作者: Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.25073
- 论文链接 : https://arxiv.org/pdf/2512.25073
- 项目链接 : https://yichuanh.github.io/GaMO/
英文摘要
Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a 25times speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/
中文摘要
3D 重建领域的最新进展在从密集的多视图图像中捕获高质量场景方面取得了显着进展,但在输入视图有限时却遇到了困难。已经采用了各种方法来应对这一挑战,包括正则化技术、语义先验和几何约束。最新的基于扩散的方法通过从新的相机姿势生成新颖的视图来增强训练数据,已经证明了显着的改进,超越了早期的正则化和基于先验的技术。尽管取得了这些进展,我们还是发现了这些最先进方法的三个关键局限性:已知视图外围之外的覆盖范围不足、生成的视图之间的几何不一致以及计算成本高昂的管道。我们引入了 GaMO(几何感知多视图外画),这是一个通过多视图外画重新制定稀疏视图重建的框架。GaMO 不是生成新的视点,而是扩展了现有相机姿势的视野,这本质上保持了几何一致性,同时提供了更广泛的场景覆盖范围。我们的方法以零样本方式采用多视图调节和几何感知去噪策略,无需训练。Replica 和 ScanNet++ 上的大量实验证明了跨 3、6 和 9 个输入视图的最先进的重建质量,在 PSNR 和 LPIPS 方面优于现有方法,同时比基于 SOTA 扩散的方法实现了 25 倍的加速,处理时间不到 10 分钟。项目页面:https://yichuanh.github.io/GaMO/
SmartSnap:自我验证代理主动寻找证据
- 标题: SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
- 作者: Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun
- 日期: 2025-12-26
- ArXiv主页 : https://arxiv.org/abs/2512.22322
- 论文链接 : https://arxiv.org/pdf/2512.22322
- 项目链接 : https://huggingface.co/collections/yolay/smartsnap
- gitHub仓库 : https://github.com/TencentYoutuResearch/SmartSnap
英文摘要
Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
中文摘要
代理强化学习(RL)对于复杂 GUI 任务下自主代理的开发有着巨大的前景,但其可扩展性仍然受到任务完成验证的严重阻碍。现有的任务验证被视为被动的事后过程:验证者(即基于规则的评分脚本、奖励或批评模型以及 LLM-as-a-Judge)分析智能体的整个交互轨迹,以确定智能体是否成功。对包含不相关的、嘈杂的历史记录的详细上下文的这种处理对验证协议提出了挑战,因此导致成本过高和可靠性低。为了克服这一瓶颈,我们提出了 SmartSnap,这是一种从被动、事后验证到代理本身主动、现场自我验证的范式转变。我们引入了自我验证代理,这是一种具有双重任务的新型代理:不仅要完成任务,还要通过精心设计的快照证据来证明其成就。在我们提出的 3C 原则(完整性、简洁性和创造性)的指导下,代理利用其对在线环境的可访问性对最少的、决定性的快照集执行自我验证。此类证据作为一般法官法学硕士验证者确定其有效性和相关性的唯一材料。跨模型系列和规模的移动任务实验表明,我们的 SmartSnap 范式允许以可扩展的方式训练 LLM 驱动的代理,为 8B 和 30B 模型分别带来高达 26.08% 和 16.66% 的性能提升。解决方案寻找和证据寻求之间的协同作用有助于培养高效、自我验证的代理,其性能可与 DeepSeek V3.1 和 Qwen3-235B-A22B 竞争。
SpotEdit:扩散变压器中的选择性区域编辑
- 标题: SpotEdit: Selective Region Editing in Diffusion Transformers
- 作者: Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang
- 日期: 2025-12-26
- ArXiv主页 : https://arxiv.org/abs/2512.22323
- 论文链接 : https://arxiv.org/pdf/2512.22323
- 项目链接 : https://biangbiang0321.github.io/SpotEdit.github.io
- gitHub仓库 : https://github.com/Biangbiang0321/SpotEdit
英文摘要
Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.
中文摘要
Diffusion Transformer 模型通过对条件图像进行编码并将其集成到变换器层中,具有显着先进的图像编辑功能。然而,大多数编辑仅涉及修改小区域,而当前的方法在每个时间步统一处理和降噪所有标记,导致冗余计算并可能降低未更改区域的性能。这就提出了一个基本问题:在编辑过程中是否真的有必要重新生成每个区域?为了解决这个问题,我们提出了 SpotEdit,这是一种无需训练的扩散编辑框架,可以有选择地仅更新修改的区域。SpotEdit 包含两个关键组件:SpotSelector 通过感知相似性识别稳定区域,并通过重用条件图像特征来跳过其计算;SpotFusion 通过动态融合机制自适应地将这些功能与编辑后的标记混合在一起,从而保持上下文连贯性和编辑质量。通过减少不必要的计算并在未修改的区域保持高保真度,SpotEdit 实现了高效、精确的图像编辑。
UltraShape 1.0:通过可扩展的几何细化生成高保真 3D 形状
- 标题: UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement
- 作者: Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan
- 日期: 2025-12-24
- ArXiv主页 : https://arxiv.org/abs/2512.21185
- 论文链接 : https://arxiv.org/pdf/2512.21185
- 项目链接 : https://pku-yuangroup.github.io/UltraShape-1.0/
- gitHub仓库 : https://github.com/PKU-YuanGroup/UltraShape-1.0
英文摘要
In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.
中文摘要
在本报告中,我们介绍了 UltraShape 1.0,这是一种用于高保真 3D 几何生成的可扩展 3D 扩散框架。所提出的方法采用两阶段生成流程:首先合成粗略的全局结构,然后细化以产生详细的高质量几何结构。为了支持可靠的 3D 生成,我们开发了全面的数据处理管道,其中包括新颖的无懈可击的处理方法和高质量的数据过滤。该流程通过删除低质量样本、填充孔洞和加厚薄结构来提高公开可用 3D 数据集的几何质量,同时保留细粒度的几何细节。为了实现细粒度的几何细化,我们将扩散过程中的空间定位与几何细节合成解耦。我们通过在固定空间位置执行基于体素的细化来实现这一点,其中从粗几何导出的体素查询提供通过 RoPE 编码的显式位置锚,允许扩散模型专注于在简化的结构化解决方案空间内合成局部几何细节。我们的模型专门在公开的 3D 数据集上进行训练,尽管训练资源有限,但仍能实现强大的几何质量。广泛的评估表明,UltraShape 1.0 在数据处理质量和几何生成方面与现有开源方法相比具有竞争力。所有代码和训练模型都将发布以支持未来的研究。
MAI-UI 技术报告:以现实世界为中心的基础 GUI 代理
- 标题: MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
- 作者: Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
- 日期: 2025-12-26
- ArXiv主页 : https://arxiv.org/abs/2512.22047
- 论文链接 : https://arxiv.org/pdf/2512.22047
英文摘要
The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.
中文摘要
GUI 代理的开发可能会彻底改变下一代人机交互。受这一愿景的推动,我们推出了 MAI-UI,这是一系列涵盖各种尺寸的基础 GUI 代理,包括 2B、8B、32B 和 235B-A22B 变体。我们确定了实际部署的四个关键挑战:缺乏本机代理与用户交互、仅 UI 操作的限制、缺乏实用的部署架构以及动态环境中的脆弱性。MAI-UI 通过统一的方法解决了这些问题:自我演化的数据管道,可扩展导航数据以包括用户交互和 MCP 工具调用,本机设备-云协作系统按任务状态路由执行,以及具有高级优化功能的在线 RL 框架,可扩展并行环境和上下文长度。MAI-UI 在 GUI 基础和移动导航方面建立了新的最先进技术。在接地基准测试中,它在 ScreenSpot-Pro 上达到 73.5%,在 MMBench GUI L2 上达到 91.3%,在 OSWorld-G 上达到 70.9%,在 UI-Vision 上达到 49.2%,超过了 Gemini-3-Pro 和 ScreenSpot-Pro 上的 Seed1.8。在移动GUI导航方面,它在AndroidWorld上刷新了76.7%的SOTA,超越了UI-Tars-2、Gemini-2.5-Pro和Seed1.8。在 MobileWorld 上,MAI-UI 获得了 41.7% 的成功率,显着优于端到端 GUI 模型,并且与基于 Gemini-3-Pro 的代理框架具有竞争力。我们的在线 RL 实验显示,将并行环境从 32 个扩展到 512 个(+5.2 点)以及将环境步骤预算从 15 增加到 50 个(+4.3 点),可以获得显着收益。最后,原生端云协同系统将端端性能提升33%,云模型调用减少40%以上,并保护用户隐私。
人工智能遇见大脑:从认知神经科学到自主代理的记忆系统
-
标题: AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents
-
作者: Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang, Zekun Wang, Changkai Ji, Zhihao Zhu, Runxuan Liu, Tao Ren, Jinlan Fu, See-Kiong Ng, Xia Liang, Ming Liu, Bing Qin
-
日期: 2025-12-29
-
ArXiv主页 : https://arxiv.org/abs/2512.23343
-
gitHub仓库 : https://github.com/AgentMemory/Huaman-Agent-Memory
英文摘要
Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
中文摘要
记忆是连接过去和未来的关键纽带,为人类和人工智能系统提供了宝贵的概念和经验来应对复杂的任务。最近对自主代理的研究越来越集中于利用认知神经科学来设计高效的记忆工作流程。然而,由于跨学科障碍的限制,现有的作品很难吸收人类记忆机制的本质。为了弥补这一差距,我们系统地综合了跨学科的记忆知识,将认知神经科学的见解与法学硕士驱动的代理联系起来。具体来说,我们首先沿着从认知神经科学到法学硕士再到智能体的渐进轨迹阐明记忆的定义和功能。然后,我们从生物学和人工的角度对内存分类、存储机制和完整的管理生命周期进行比较分析。随后,我们回顾了评估代理内存的主流基准。此外,我们从攻击和防御的双重角度探讨内存安全。最后,我们展望了未来的研究方向,重点是多模式记忆系统和技能获取。
评估 RLVR 的参数有效方法
- 标题: Evaluating Parameter Efficient Methods for RLVR
- 作者: Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu
- 日期: 2025-12-29
- ArXiv主页 : https://arxiv.org/abs/2512.23165
- 论文链接 : https://arxiv.org/pdf/2512.23165
英文摘要
We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (e.g., PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (e.g., VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
中文摘要
我们在具有可验证奖励的强化学习(RLVR)范式下系统地评估参数高效微调(PEFT)方法。RLVR 通过可验证的反馈来激励语言模型增强其推理能力;然而,虽然 LoRA 等方法很常用,但 RLVR 的最佳 PEFT 架构仍未确定。在这项工作中,我们在数学推理基准上对 DeepSeek-R1-Distill 系列中超过 12 种 PEFT 方法进行了首次综合评估。我们的实证结果通过三个主要发现挑战了标准 LoRA 的默认采用。首先,我们证明 DoRA、AdaLoRA 和 MiSS 等结构变体的性能始终优于 LoRA。其次,我们发现了 SVD 通知的初始化策略(例如 PiSSA、MiLoRA)中的谱崩溃现象,将其失败归因于主成分更新和 RL 优化之间的根本失调。此外,我们的消融表明,极端参数减少(例如,VeRA、Rank-1)严重阻碍了推理能力。我们进一步进行消融研究和缩放实验来验证我们的发现。这项工作为倡导更多探索参数高效的强化学习方法提供了明确的指南。
TimeBill:大型语言模型的时间预算推理
- 标题: TimeBill: Time-Budgeted Inference for Large Language Models
- 作者: Qi Fan, An Zou, Yehan Ma
- 日期: 2025-12-26
- ArXiv主页 : https://arxiv.org/abs/2512.21859
- 论文链接 : https://arxiv.org/pdf/2512.21859
英文摘要
Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.
中文摘要
大型语言模型 (LLM) 越来越多地部署在时间关键型系统中,例如机器人、自动驾驶、嵌入式智能和工业自动化,其中在给定的时间预算内生成准确的响应对于决策、控制或安全关键型任务至关重要。然而,法学硕士的自回归生成过程使得建模和估计端到端执行时间变得具有挑战性。此外,现有的基于固定键值(KV)缓存驱逐比率的高效推理方法难以适应具有不同时间预算的不同任务,其中不正确的驱逐比率可能导致推理不完整或响应性能下降。在本文中,我们提出了 TimeBill,一种用于法学硕士的新颖的时间预算推理框架,可以平衡推理效率和响应性能。更具体地说,我们提出了一个细粒度的响应长度预测器(RLP)和一个执行时间估计器(ETE)来准确预测LLM的端到端执行时间。在此之后,我们开发了一种时间预算的高效推理方法,该方法根据执行时间预测和给定的时间预算自适应地调整 KV 缓存驱逐率。最后,通过大量的实验,我们展示了TimeBill在提高任务完成率和在各种超限策略下保持响应性能方面的优势。
UniPercept:迈向跨美学、质量、结构和纹理的统一感知级图像理解
- 标题: UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
- 作者: Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu
- 日期: 2025-12-25
- ArXiv主页 : https://arxiv.org/abs/2512.21675
- 论文链接 : https://arxiv.org/pdf/2512.21675
- 项目链接 : https://thunderbolt215.github.io/Unipercept-project/
- gitHub仓库 : https://github.com/thunderbolt215/UniPercept
英文摘要
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.
中文摘要
多模态大语言模型(MLLM)在视觉理解任务(例如视觉基础、分割和字幕)方面取得了显着进展。然而,它们感知感知级图像特征的能力仍然有限。在这项工作中,我们提出了 UniPercept-Bench,这是一个跨三个关键领域的感知级图像理解的统一框架:美学、质量、结构和纹理。我们建立了分层定义系统并构建大规模数据集来评估感知级图像理解。在此基础上,我们开发了一个强大的基线 UniPercept,通过领域自适应预训练和任务对齐 RL 进行训练,从而实现了视觉评分 (VR) 和视觉问答 (VQA) 任务的稳健泛化。UniPercept 在感知级图像理解方面优于现有的 MLLM,并且可以作为文本到图像生成的即插即用奖励模型。这项工作定义了 MLLM 时代的感知级图像理解,并通过引入全面的基准和强大的基线,为推进感知级多模态图像理解提供了坚实的基础。
GRAN-TED:为扩散模型生成稳健、对齐且细致的文本嵌入
- 标题: GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
- 作者: Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang
- 日期: 2025-12-17
- ArXiv主页 : https://arxiv.org/abs/2512.15560
- 论文链接 : https://arxiv.org/pdf/2512.15560
英文摘要
The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about 750times faster. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
中文摘要
文本编码器是文本到图像和文本到视频扩散模型的关键组件,从根本上决定了生成内容的语义保真度。然而,它的发展受到两大挑战的阻碍:缺乏可靠预测下游生成性能的有效评估框架,以及有效适应预训练语言模型进行视觉合成的困难。为了解决这些问题,我们引入了 GRAN-TED,这是一种为扩散模型生成稳健、对齐和细致的文本嵌入的范例。我们的贡献是双重的。首先,我们提出了 TED-6K,这是一种新颖的纯文本基准,可以有效、稳健地评估编码器的表征质量,而无需昂贵的端到端模型训练。我们证明,通过轻量级统一适配器标准化的 TED-6K 性能与编码器在下游生成任务中的有效性密切相关。值得注意的是,在我们的实验设置下,与从头开始训练扩散模型相比,使用 TED-6K 进行评估大约快 750 倍。其次,在这个经过验证的框架的指导下,我们使用新颖的两阶段训练范例开发了一种卓越的文本编码器。此过程涉及多模态大语言模型的初始微调阶段,以获得更好的视觉表示,然后采用分层加权方法来提取更细致和有效的文本特征。我们的实验表明,最终的 GRAN-TED 编码器不仅在 TED-6K 上实现了最先进的性能,而且在文本到图像和文本到视频生成方面也带来了明显的性能提升。我们的 TED-6K 数据集和评估代码可通过以下链接获取:https://anonymous.4open.science/r/GRAN-TED-4FCC/。