中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- [GDPO:多奖励 RL 优化的群体奖励解耦标准化策略优化](#GDPO:多奖励 RL 优化的群体奖励解耦标准化策略优化)
- LTX-2:高效联合视听基础模型
- [NeoVerse:通过野外单目视频增强 4D 世界模型](#NeoVerse:通过野外单目视频增强 4D 世界模型)
- Youtu-Agent:通过自动生成和混合策略优化来扩展代理生产力
- 熵自适应微调:解决自信冲突以减轻遗忘
- InfiniDepth:具有神经隐式场的任意分辨率和细粒度深度估计
- 递归语言模型
- K-EXAONE技术报告
- 不断发展的程序化技能网络
- 法学硕士可以预测自己的失败吗?通过内部电路实现自我意识
- NextFlow:统一顺序建模激活多模态理解和生成
- [驯服幻觉:通过反事实视频生成增强 MLLM 的视频理解](#驯服幻觉:通过反事实视频生成增强 MLLM 的视频理解)
- [MOSS Transcribe Diarize:通过说话者分类进行准确转录](#MOSS Transcribe Diarize:通过说话者分类进行准确转录)
- 头像强制:实时交互式头像生成以实现自然对话
- DreamID-V:通过扩散变压器弥合图像到视频的差距,实现高保真脸部交换
- UniCorn:通过自我生成的监督实现自我改进的统一多模式模型
- RL-AWB:低光夜间场景自动白平衡校正的深度强化学习
- NitroGen:通用游戏代理的开放基础模型
- 嵌套学习:深度学习架构的幻想
- Atlas:为多领域复杂推理编排异构模型和工具
- 可学习的乘数:释放语言模型矩阵层的规模
- MindWatcher:迈向更智能的多模态工具集成推理
- [通过 FusionRoute 进行代币级法学硕士合作](#通过 FusionRoute 进行代币级法学硕士合作)
- MiMo-V2-Flash技术报告
- [SimpleMem:LLM 代理的高效终身记忆](#SimpleMem:LLM 代理的高效终身记忆)
- VideoAuto-R1:一次思考,两次回答的视频自动推理
- SenseNova-MARS:通过强化学习增强多模式代理推理和搜索
- SciEvalKit:科学通用智能的开源评估工具包
GDPO:多奖励 RL 优化的群体奖励解耦标准化策略优化
- 标题: GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
- 作者: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
- 日期: 2026-01-08
- ArXiv主页 : https://arxiv.org/abs/2601.05242
- 论文链接 : https://arxiv.org/pdf/2601.05242
- 项目链接 : https://nvlabs.github.io/GDPO/
- gitHub仓库 : https://github.com/NVlabs/GDPO
英文摘要
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
中文摘要
随着语言模型的能力越来越强,用户期望它们不仅能够提供准确的响应,而且能够在各种场景中提供符合不同人类偏好的行为。为了实现这一目标,强化学习 (RL) 管道已开始整合多种奖励,每种奖励都捕获不同的偏好,以引导模型实现这些所需的行为。然而,最近的工作默认在多奖励设置下应用组相对策略优化(GRPO),而没有检查其适用性。在本文中,我们证明,直接应用 GRPO 来规范化不同的推出奖励组合会导致它们崩溃为相同的优势值,从而降低训练信号的分辨率,导致收敛不理想,在某些情况下,会导致早期训练失败。然后,我们引入了群体奖励解耦标准化策略优化(GDPO),这是一种新的策略优化方法,通过解耦个体奖励的标准化来解决这些问题,更忠实地保留它们的相对差异,并实现更准确的多奖励优化,同时显着提高训练稳定性。我们在三个任务上比较 GDPO 和 GRPO:工具调用、数学推理和编码推理,评估正确性指标(准确性、错误率)和约束遵守指标(格式、长度)。在所有设置中,GDPO 始终优于 GRPO,证明了其对于多奖励强化学习优化的有效性和普遍性。
LTX-2:高效联合视听基础模型
- 标题: LTX-2: Efficient Joint Audio-Visual Foundation Model
- 作者: Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman
- 日期: 2026-01-06
- ArXiv主页 : https://arxiv.org/abs/2601.03233
- 论文链接 : https://arxiv.org/pdf/2601.03233
- 项目链接 : https://app.ltx.studio/ltx-2-playground/i2v
- gitHub仓库 : https://github.com/Lightricks/LTX-2
英文摘要
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
中文摘要
最近的文本到视频扩散模型可以生成引人注目的视频序列,但它们仍然保持沉默------缺少音频提供的语义、情感和氛围线索。我们引入了 LTX-2,这是一种开源基础模型,能够以统一的方式生成高质量、时间同步的视听内容。LTX-2 由具有 14B 参数视频流和 5B 参数音频流的非对称双流变压器组成,通过具有时间位置嵌入的双向音频-视频交叉注意层和用于共享时间步调节的跨模态 AdaLN 进行耦合。该架构可以实现统一视听模型的高效训练和推理,同时为视频生成分配比音频生成更多的容量。我们采用多语言文本编码器来实现更广泛的及时理解,并引入模态感知的无分类器指导(模态-CFG)机制来改进视听对齐和可控性。除了生成语音之外,LTX-2 还可以生成丰富、连贯的音轨,遵循每个场景的人物、环境、风格和情感,并配有自然背景和拟音元素。在我们的评估中,该模型实现了最先进的视听质量并迅速遵守开源系统,同时以专有模型的一小部分计算成本和推理时间提供与专有模型相当的结果。所有模型权重和代码均公开发布。
NeoVerse:通过野外单目视频增强 4D 世界模型
- 标题: NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
- 作者: Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang
- 日期: 2026-01-01
- ArXiv主页 : https://arxiv.org/abs/2601.00393
- 论文链接 : https://arxiv.org/pdf/2601.00393
- 项目链接 : https://neoverse-4d.github.io
- gitHub仓库 : https://github.com/IamCreateAI/NeoVerse
英文摘要
In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io
中文摘要
在本文中,我们提出了 NeoVerse,一种多功能的 4D 世界模型,能够进行 4D 重建、新轨迹视频生成和丰富的下游应用。我们首先确定当前 4D 世界建模方法中可扩展性的常见限制,该限制是由昂贵且专门的多视图 4D 数据或繁琐的训练预处理引起的。相比之下,我们的 NeoVerse 建立在一个核心理念之上,该理念使整个管道可扩展至各种野外单目视频。具体来说,NeoVerse 具有无姿态前馈 4D 重建、在线单目退化模式模拟和其他良好对齐的技术。这些设计使 NeoVerse 具有多功能性并可推广到各个领域。同时,NeoVerse 在标准重建和生成基准方面实现了最先进的性能。我们的项目页面位于 https://neoverse-4d.github.io
Youtu-Agent:通过自动生成和混合策略优化来扩展代理生产力
- 标题: Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
- 作者: Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.24615
- 论文链接 : https://arxiv.org/pdf/2512.24615
- 项目链接 : https://tencentcloudadp.github.io/youtu-agent/
- gitHub仓库 : https://github.com/TencentCloudADP/youtu-tip
英文摘要
Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose Youtu-Agent, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a Workflow mode for standard tasks and a Meta-Agent mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an Agent Practice module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an Agent RL module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47%) and GAIA (72.8%) using open-weight models. Our automated generation pipeline achieves over 81% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7% and +5.4% respectively. Moreover, our Agent RL training achieves 40% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35% and 21% on Maths and general/multi-hop QA benchmarks.
中文摘要
现有的大型语言模型(LLM)代理框架面临两个重大挑战:高配置成本和静态功能。构建高质量的代理通常需要在工具集成和快速工程方面进行大量的手动工作,而部署的代理在没有昂贵的微调的情况下很难适应动态环境。为了解决这些问题,我们提出了 Youtu-Agent,这是一个专为 LLM 代理的自动生成和持续演进而设计的模块化框架。Youtu-Agent具有结构化的配置系统,可解耦执行环境、工具包和上下文管理,从而实现灵活的重用和自动合成。我们引入两种生成范式:用于标准任务的工作流模式和用于复杂的非标准需求的元代理模式,能够自动生成工具代码、提示和配置。此外,Youtu-Agent建立了混合策略优化系统:(1)Agent实践模块,使Agent能够在不更新参数的情况下通过上下文优化来积累经验并提高性能;(2) Agent RL 模块,与分布式训练框架集成,以端到端、大规模的方式实现任何 Youtu-Agent 的可扩展且稳定的强化学习。实验表明,Youtu-Agent 使用开放权重模型在 WebWalkerQA (71.47%) 和 GAIA (72.8%) 上实现了最先进的性能。我们的自动生成管道实现了超过 81% 的工具合成成功率,而 Practice 模块将 AIME 2024/2025 上的性能分别提高了 +2.7% 和 +5.4%。此外,我们的 Agent RL 训练在 7B LLM 上实现了 40% 的加速和稳定的性能改进,在数学和通用/多跳 QA 基准上将编码/推理和搜索能力分别提高了 35% 和 21%。
熵自适应微调:解决自信冲突以减轻遗忘
- 标题: Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
- 作者: Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma
- 日期: 2026-01-05
- ArXiv主页 : https://arxiv.org/abs/2601.02151
- 论文链接 : https://arxiv.org/pdf/2601.02151
- 项目链接 : https://ymxyll.github.io/EAFT/
- gitHub仓库 : https://github.com/hiyouga/LLaMA-Factory
英文摘要
Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model's internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as "Confident Conflicts" tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.
中文摘要
有监督微调(SFT)是领域适应的标准范例,但它经常会带来灾难性遗忘的代价。与此形成鲜明对比的是,同策略强化学习(RL)有效地保留了一般能力。我们研究了这种差异并确定了一个基本的分布差距:虽然 RL 与模型的内部信念保持一致,但 SFT 迫使模型适应外部监督。这种不匹配通常表现为"置信冲突"令牌,其特征是概率低但熵低。在这些情况下,模型对自己的预测非常有信心,但被迫学习不同的基本事实,从而触发破坏性的梯度更新。为了解决这个问题,我们提出了熵自适应微调(EAFT)。与仅依赖预测概率的方法不同,EAFT 利用令牌级熵作为门控机制来区分认知不确定性和知识冲突。这使得模型能够从不确定的样本中学习,同时抑制冲突数据上的梯度。在数学、医学和代理领域对 Qwen 和 GLM 系列(参数范围从 4B 到 32B 参数)进行的广泛实验证实了我们的假设。EAFT 始终与标准 SFT 的下游性能相匹配,同时显着减轻一般能力的下降。
InfiniDepth:具有神经隐式场的任意分辨率和细粒度深度估计
- 标题: InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
- 作者: Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng
- 日期: 2026-01-06
- ArXiv主页 : https://arxiv.org/abs/2601.03252
- 论文链接 : https://arxiv.org/pdf/2601.03252
- 项目链接 : https://zju3dv.github.io/InfiniDepth
- gitHub仓库 : https://github.com/zju3dv/InfiniDepth
英文摘要
Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.
中文摘要
现有的深度估计方法从根本上仅限于预测离散图像网格上的深度。这种表示将其可扩展性限制为任意输出分辨率并阻碍几何细节恢复。本文介绍了 InfiniDepth,它将深度表示为神经隐式场。通过简单而有效的局部隐式解码器,我们可以查询连续二维坐标的深度,从而实现任意分辨率和细粒度的深度估计。为了更好地评估我们方法的能力,我们从五种不同的游戏中策划了高质量的 4K 合成基准,涵盖具有丰富几何和外观细节的不同场景。大量实验表明,InfiniDepth 在相对和度量深度估计任务的合成基准和现实基准上均实现了最先进的性能,尤其是在精细区域中表现出色。它还有利于在大视点变化下进行新颖的视图合成任务,从而产生具有更少孔洞和伪影的高质量结果。
递归语言模型
- 标题: Recursive Language Models
- 作者: Alex L. Zhang, Tim Kraska, Omar Khattab
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.24601
- 论文链接 : https://arxiv.org/pdf/2512.24601
- 项目链接 : https://alexzhang13.github.io/blog/2025/rlm/
- gitHub仓库 : https://github.com/alexzhang13/rlm
英文摘要
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.
中文摘要
我们研究允许大型语言模型(LLM)通过推理时间缩放的角度处理任意长的提示。我们提出了递归语言模型(RLM),这是一种通用推理策略,它将长提示视为外部环境的一部分,并允许法学硕士以编程方式检查、分解并在提示片段上递归调用自身。我们发现,RLM 成功地处理了超出模型上下文窗口两个数量级的输入,即使对于较短的提示,在四个不同的长上下文任务中,其质量也显着优于基础 LLM 和常见的长上下文支架,同时每个查询的成本相当(或更便宜)。
K-EXAONE技术报告
-
标题: K-EXAONE Technical Report
-
作者: Eunbi Choi, Kibong Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Hyunjik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Jiwon Ham, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Yonghwan Jo, Jiyeon Jung, Naeun Kang, Dohoon Kim, Euisoon Kim, Hayeon Kim, Hyosang Kim, Hyunseo Kim, Jieun Kim, Minu Kim, Myoungshin Kim, Unsol Kim, Youchul Kim, YoungJin Kim, Chaeeun Lee, Chaeyoon Lee, Changhun Lee, Dahm Lee, Edward Hwayoung Lee, Honglak Lee, Jinsang Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Solji Lim, Woohyung Lim, Chanwoo Moon, Jaewoo Park, Jinho Park, Yongmin Park, Hyerin Seo, Wooseok Seo, Yongwoo Song, Sejong Yang, Sihoon Yang, Chang En Yea, Sihyuk Yi, Chansik Yoon, Dongkeun Yoon, Sangyeon Yoon, Hyeongu Yun
-
日期: 2026-01-05
-
ArXiv主页 : https://arxiv.org/abs/2601.01739
-
gitHub仓库 : https://github.com/LG-AI-EXAONE/K-EXAONE
英文摘要
This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.
中文摘要
本技术报告介绍了 LG AI Research 开发的大规模多语言语言模型 K-EXAONE。K-EXAONE 基于 Mixture-of-Experts 架构构建,总参数为 236B,在推理过程中激活 23B 参数。它支持 256K-token 上下文窗口,涵盖六种语言:韩语、英语、西班牙语、德语、日语和越南语。我们根据涵盖推理、代理、一般、韩语和多语言能力的综合基准套件来评估 K-EXAONE。在这些评估中,K-EXAONE 表现出与类似尺寸的开放式重量模型相当的性能。K-EXAONE 旨在推进人工智能,让生活更美好,定位为强大的专有人工智能基础模型,适用于广泛的工业和研究应用。
不断发展的程序化技能网络
- 标题: Evolving Programmatic Skill Networks
- 作者: Haochen Shi, Xingdi Yuan, Bang Liu
- 日期: 2026-01-07
- ArXiv主页 : https://arxiv.org/abs/2601.03509
- 论文链接 : https://arxiv.org/pdf/2601.03509
英文摘要
We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN's learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.
中文摘要
我们研究在开放式具体环境中持续获取技能,其中代理必须构建、完善和重用不断扩展的可执行技能库。我们介绍了程序化技能网络(PSN),这是一个框架,其中技能是可执行的符号程序,形成一个通过经验演变的组合网络。PSN 定义了通过大型语言模型实例化的三个核心机制:(1)针对技能组合的结构化故障定位的 REFLECT,(2)通过成熟度感知更新门控进行渐进优化,稳定可靠技能,同时保持不确定技能的可塑性,以及(3)回滚验证下的规范结构重构,保持网络紧凑性。我们进一步表明 PSN 的学习动态与神经网络训练在结构上相似。MineDojo 和 Crafter 上的实验展示了强大的技能重用、快速适应以及跨开放式任务分配的强大泛化能力。\footnote{我们计划开源代码。
法学硕士可以预测自己的失败吗?通过内部电路实现自我意识
-
标题: Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
-
作者: Amirhosein Ghasemabadi, Di Niu
-
日期: 2025-12-23
-
ArXiv主页 : https://arxiv.org/abs/2512.20578
-
gitHub仓库 : https://github.com/Amirhosein-gh98/Gnosis
英文摘要
Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.
中文摘要
大型语言模型(LLM)生成流畅且复杂的输出,但常常无法识别自己的错误和幻觉。现有的方法通常依赖于外部判断、多样本一致性或基于文本的自我批评,这会导致额外的计算或与真实正确性的相关性较弱。我们问:法学硕士可以通过在推理过程中检查内部状态来预测自己的失败吗?我们引入了 Gnosis,一种轻量级的自我意识机制,使冻结的 LLM 能够通过解码来自隐藏状态和注意力模式的信号来执行内在的自我验证。Gnosis 被动地观察内部痕迹,将它们压缩为固定预算描述符,并以可忽略的推理成本预测正确性,仅添加约 5M 参数,并且独立于序列长度进行操作。在数学推理、开放域问答和学术知识基准测试以及从 1.7B 到 20B 参数范围的冻结主干中,Gnosis 在准确性和校准方面始终优于强大的内部基线和大型外部判断。此外,它将零样本推广到部分生成,从而能够及早检测故障轨迹和计算感知控制。这些结果表明,可靠的正确性线索是生成过程所固有的,并且可以在没有外部监督的情况下有效地提取。
NextFlow:统一顺序建模激活多模态理解和生成
-
标题: NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
-
作者: Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu
-
日期: 2026-01-05
-
ArXiv主页 : https://arxiv.org/abs/2601.02204
-
gitHub仓库 : https://github.com/ByteVisionLab/NextFlow
英文摘要
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
中文摘要
我们提出了 NextFlow,一个统一的仅解码器自回归变压器,在 6 万亿个交错的文本图像离散标记上进行训练。通过利用统一自回归架构中的统一视觉表示,NextFlow 原生激活多模态理解和生成功能,解锁图像编辑、交错内容和视频生成的能力。受模态独特性质的启发(其中文本是严格顺序的,而图像本质上是分层的),我们保留文本的下一个标记预测,但采用下一个尺度预测进行视觉生成。这与传统的光栅扫描方法不同,只需 5 秒即可生成 1024x1024 图像,比同类 AR 模型快几个数量级。我们通过强大的训练方法解决多尺度生成的不稳定性。此外,我们引入了强化学习的前缀调整策略。实验表明,NextFlow 在统一模型中实现了最先进的性能,并且在视觉质量方面可与专门的扩散基线相媲美。
驯服幻觉:通过反事实视频生成增强 MLLM 的视频理解
- 标题: Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
- 作者: Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang
- 日期: 2025-12-30
- ArXiv主页 : https://arxiv.org/abs/2512.24271
- 论文链接 : https://arxiv.org/pdf/2512.24271
- 项目链接 : https://amap-ml.github.io/Taming-Hallucinations/
- gitHub仓库 : https://github.com/AMAP-ML/Taming-Hallucinations
英文摘要
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise ell_1 advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
中文摘要
多模态大语言模型(MLLM)在视频理解方面取得了显着进展。然而,它们存在一个严重的弱点:过度依赖语言先验,这可能会导致视觉上的无根据的幻觉,特别是在处理违背常识的反事实视频时。这种限制源于文本和视频之间固有的数据不平衡,由于收集和注释反事实数据的成本高昂,因此很难解决。为了解决这个问题,我们引入了 DualityForge,这是一种新颖的反事实数据合成框架,它采用可控的、基于扩散的视频编辑将现实世界的视频转换为反事实场景。通过将结构化上下文信息嵌入到视频编辑和 QA 生成过程中,该框架会自动生成高质量的 QA 对以及原始编辑的视频对,以进行对比训练。基于此,我们构建了 DualityVidQA,一个旨在减少 MLLM 幻觉的大规模视频数据集。此外,为了充分利用配对数据的对比性质,我们提出了对偶归一化优势训练(DNA-Train),这是一种两阶段 SFT-RL 训练机制,其中 RL 阶段应用成对 ell_1 优势归一化,从而实现更稳定、更高效的策略优化。DualityVidQA-Test 的实验表明,我们的方法大大减少了反事实视频上的模型幻觉,比 Qwen2.5-VL-7B 基线相对提高了 24.0%。此外,我们的方法在幻觉和通用基准方面都取得了显着的进步,表明了强大的泛化能力。我们将开源我们的数据集和代码。
MOSS Transcribe Diarize:通过说话者分类进行准确转录
- 标题: MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
- 作者: MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
- 日期: 2026-01-04
- ArXiv主页 : https://arxiv.org/abs/2601.01554
- 论文链接 : https://arxiv.org/pdf/2601.01554
- 项目链接 : https://mosi.cn/models/moss-transcribe-diarize
英文摘要
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.
中文摘要
发言者属性时间戳转录(SATS)旨在转录所说内容并精确确定每个发言者的时间安排,这对于会议转录特别有价值。现有的 SATS 系统很少采用端到端的公式,并且进一步受到有限的上下文窗口、薄弱的远程说话人记忆以及无法输出时间戳的限制。为了解决这些限制,我们提出了 MOSS Transcribe Diarize,这是一种统一的多模式大语言模型,可以在端到端范例中联合执行说话者属性、时间戳转录。MOSS Transcribe Diarize 经过大量真实野生数据的训练,并配备了长达 90 分钟输入的 128k 上下文窗口,具有良好的扩展性和鲁棒性。在综合评估中,它在多个公共和内部基准测试中优于最先进的商业系统。
头像强制:实时交互式头像生成以实现自然对话
- 标题: Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
- 作者: Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
- 日期: 2026-01-02
- ArXiv主页 : https://arxiv.org/abs/2601.00664
- 论文链接 : https://arxiv.org/pdf/2601.00664
- 项目链接 : https://taekyungki.github.io/AvatarForcing/
- gitHub仓库 : https://github.com/TaekyungKi/AvatarForcing
英文摘要
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
中文摘要
会说话的头像生成从静态肖像中创建栩栩如生的头像,用于虚拟通信和内容创建。然而,当前的模型尚未传达真正互动交流的感觉,常常产生缺乏情感参与的单向反应。我们确定了真正交互式化身的两个关键挑战:在因果约束下实时生成运动,以及在没有额外标记数据的情况下学习富有表现力的、充满活力的反应。为了应对这些挑战,我们提出了 Avatar Forcing,这是一种用于交互式头部头像生成的新框架,可通过扩散强制来模拟实时用户与头像交互。这种设计允许化身处理实时多模式输入,包括用户的音频和动作,并且具有低延迟,可以对言语和非言语线索(例如言语、点头和笑声)做出即时反应。此外,我们引入了一种直接偏好优化方法,该方法利用通过删除用户条件构建的合成丢失样本,从而实现表达交互的无标签学习。实验结果表明,我们的框架能够实现低延迟(约 500 毫秒)的实时交互,与基线相比实现了 6.8 倍的加速,并产生反应性和表现力的化身运动,与基线相比,首选率超过 80%。
DreamID-V:通过扩散变压器弥合图像到视频的差距,实现高保真脸部交换
- 标题: DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
- 作者: Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He
- 日期: 2026-01-04
- ArXiv主页 : https://arxiv.org/abs/2601.01425
- 论文链接 : https://arxiv.org/pdf/2601.01425
- 项目链接 : https://guoxu1233.github.io/DreamID-V/
- gitHub仓库 : https://github.com/bytedance/DreamID-V
英文摘要
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.
中文摘要
视频换脸 (VFS) 需要将源身份无缝注入目标视频,同时精心保留原始姿势、表情、灯光、背景和动态信息。现有方法很难在保持时间一致性的同时保持身份相似性和属性保存。为了应对这一挑战,我们提出了一个全面的框架,将图像换脸(IFS)的优势无缝转移到视频领域。我们首先介绍一种新颖的数据管道 SyncID-Pipe,它预训练身份锚定视频合成器,并将其与 IFS 模型相结合,构建双向 ID 四联体以进行显式监督。基于配对数据,我们提出了第一个基于扩散变压器的框架 DreamID-V,采用核心模态感知调节模块来有区别地注入多模型条件。同时,我们提出了从合成到真实的课程机制和身份一致性强化学习策略,以增强挑战性场景下的视觉真实感和身份一致性。为了解决基准测试有限的问题,我们推出了 IDBench-V,这是一个涵盖多种场景的综合基准测试。大量实验表明 DreamID-V 优于最先进的方法,并进一步展现出卓越的多功能性,可以无缝适应各种交换相关任务。
UniCorn:通过自我生成的监督实现自我改进的统一多模式模型
- 标题: UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
- 作者: Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao
- 日期: 2026-01-06
- ArXiv主页 : https://arxiv.org/abs/2601.03193
- 论文链接 : https://arxiv.org/pdf/2601.03193
- 项目链接 : https://costaliya.github.io/UniCorn.github.io/
- gitHub仓库 : https://github.com/Hungryyan1/UniCorn
英文摘要
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
中文摘要
虽然统一多模态模型(UMM)在跨模态理解方面取得了显着的成功,但它们利用此类内部知识进行高质量生成的能力仍然存在巨大差距。我们将这种差异形式化为传导失语症,这是一种模型准确解释多模态输入但难以将这种理解转化为忠实且可控的合成的现象。为了解决这个问题,我们提出了 UniCorn,这是一个简单而优雅的自我完善框架,无需外部数据或教师监督。通过将单个 UMM 划分为三个协作角色:提议者、解决者和判断者,UniCorn 通过自我游戏生成高质量的交互,并采用认知模式重建将潜在理解提炼为明确的生成信号。为了验证多模态一致性的恢复,我们引入了 UniCycle,这是一种基于文本到图像到文本重建循环的循环一致性基准。大量实验表明,UniCorn 在六个通用图像生成基准上比基本模型实现了全面且实质性的改进。值得注意的是,它在 TIIF(73.8)、DPG(86.8)、CompBench(88.5) 和 UniCycle 上实现了 SOTA 性能,同时在 WISE 上进一步实现了 +5.0 的大幅提升,在 OneIG 上实现了 +6.5 的大幅提升。这些结果强调,我们的方法显着增强了 T2I 生成,同时保持了强大的理解能力,证明了统一多模态智能的完全自我监督细化的可扩展性。
RL-AWB:低光夜间场景自动白平衡校正的深度强化学习
- 标题: RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
- 作者: Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu
- 日期: 2026-01-08
- ArXiv主页 : https://arxiv.org/abs/2601.05249
- 论文链接 : https://arxiv.org/pdf/2601.05249
- 项目链接 : https://ntuneillee.github.io/research/rl-awb/
- gitHub仓库 : https://github.com/BrianChen1120/RL-AWB
英文摘要
Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/
中文摘要
由于低光噪声和复杂的照明条件,夜间色彩恒定性仍然是计算摄影中的一个具有挑战性的问题。我们提出了 RL-AWB,这是一种将统计方法与深度强化学习相结合的新型框架,用于夜间白平衡。我们的方法从针对夜间场景定制的统计算法开始,将显着的灰色像素检测与新颖的照明估计相结合。在此基础上,我们开发了第一个以统计算法为核心的深度强化学习颜色恒常性方法,模仿专业的 AWB 调优专家,动态优化每张图像的参数。为了促进跨传感器评估,我们引入了第一个多传感器夜间数据集。实验结果表明,我们的方法在低光和照明良好的图像中实现了卓越的泛化能力。项目页面:https://ntuneillee.github.io/research/rl-awb/
NitroGen:通用游戏代理的开放基础模型
- 标题: NitroGen: An Open Foundation Model for Generalist Gaming Agents
- 作者: Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, Linxi "Jim" Fan
- 日期: 2026-01-04
- ArXiv主页 : https://arxiv.org/abs/2601.02427
- 论文链接 : https://arxiv.org/pdf/2601.02427
- 项目链接 : https://nitrogen.minedojo.org/
- gitHub仓库 : https://github.com/MineDojo/NitroGen
英文摘要
We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.
中文摘要
我们推出了 NitroGen,这是一种针对多面手游戏代理的视觉动作基础模型,该模型经过 1,000 多款游戏中 40,000 小时的游戏视频的训练。我们整合了三个关键要素:1)通过从公开的游戏视频中自动提取玩家动作而构建的互联网规模的视频动作数据集,2)可以衡量跨游戏泛化的多游戏基准环境,以及3)通过大规模行为克隆训练的统一视觉动作模型。NitroGen 在不同领域展现了强大的能力,包括 3D 动作游戏中的战斗遭遇、2D 平台游戏中的高精度控制以及程序生成世界中的探索。它可以有效地转移到未见过的游戏中,与从头开始训练的模型相比,任务成功率相对提高了 52%。我们发布数据集、评估套件和模型权重,以推进对通用实体代理的研究。
嵌套学习:深度学习架构的幻想
- 标题: Nested Learning: The Illusion of Deep Learning Architectures
- 作者: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
- 日期: 2025-12-31
- ArXiv主页 : https://arxiv.org/abs/2512.24695
- 论文链接 : https://arxiv.org/pdf/2512.24695
英文摘要
Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.
中文摘要
尽管最近取得了进展,特别是在开发语言模型方面,但仍存在根本性挑战和未解答的问题,即此类模型如何不断学习/记忆、自我改进并找到有效的解决方案。在本文中,我们提出了一种新的学习范式,称为嵌套学习(NL),它连贯地表示具有一组嵌套、多级和/或并行优化问题的机器学习模型,每个问题都有自己的上下文流。通过 NL 的视角,现有的深度学习方法通过压缩自身的上下文流来从数据中学习,并且上下文学习自然地出现在大型模型中。NL 提出了一种理念,即设计更多级别、更具表现力的学习算法,从而实现更高阶的上下文学习,并有可能释放有效的持续学习能力。我们通过提出三个核心贡献来倡导 NL:(1)表达优化器:我们证明已知的基于梯度的优化器,例如 Adam、带有 Momentum 的 SGD 等,实际上是旨在压缩梯度信息(通过梯度下降)的关联记忆模块。基于这一见解,我们提出了其他更具表现力的优化器,具有深度记忆和/或更强大的学习规则;(2)自我修改学习模块:利用NL对学习算法的见解,我们提出了一个序列模型,通过学习自己的更新算法来学习如何修改自己;(3)连续记忆系统:我们提出了一种新的记忆系统表述,概括了长/短期记忆的传统观点。将我们的自修改序列模型与连续记忆系统相结合,我们提出了一个名为 Hope 的持续学习模块,在语言建模、知识整合、小样本泛化任务、持续学习和长上下文推理任务方面显示出有希望的结果。
Atlas:为多领域复杂推理编排异构模型和工具
- 标题: Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
- 作者: Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao
- 日期: 2026-01-07
- ArXiv主页 : https://arxiv.org/abs/2601.03872
- 论文链接 : https://arxiv.org/pdf/2601.03872
英文摘要
The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) training-free cluster-based routing that exploits empirical priors for domain-specific alignment, and (2) RL-based multi-step routing that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.
中文摘要
大型语言模型(LLM)与外部工具的集成显着扩展了人工智能代理的功能。然而,随着法学硕士和工具多样性的增加,选择最佳模型-工具组合成为高维优化挑战。现有方法通常依赖于单个模型或固定的工具调用逻辑,无法利用异构模型-工具对之间的性能变化。在本文中,我们提出了 ATLAS(自适应工具-LLM 对齐和协同调用),这是一种用于跨域复杂推理中动态工具使用的双路径框架。ATLAS 通过双路径方法运行:(1) 免训练的基于集群的路由,利用经验先验进行特定领域的对齐;(2) 基于强化学习的多步路由,探索分布外泛化的自主轨迹。跨 15 个基准的大量实验表明,我们的方法优于 GPT-4o 等闭源模型,在分布内 (+10.1%) 和分布外 (+13.1%) 任务上都超过了现有的路由方法。此外,我们的框架通过编排专门的多模式工具在视觉推理方面显示出显着的进步。
可学习的乘数:释放语言模型矩阵层的规模
- 标题: Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
- 作者: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid
- 日期: 2026-01-08
- ArXiv主页 : https://arxiv.org/abs/2601.04890
- 论文链接 : https://arxiv.org/pdf/2601.04890
- 项目链接 : https://tiiuae.github.io/Falcon-H1/
英文摘要
Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.
中文摘要
将权重衰减(WD)应用于矩阵层是大型语言模型预训练的标准做法。先前的工作表明,随机梯度噪声会引起权重矩阵 W 的布朗式膨胀,其增长被 WD 抵消,从而导致具有特定权重范数 ||W|| 的 WD 噪声平衡。在这项工作中,我们将平衡范数视为训练过程的有害产物,并通过引入可学习乘数来学习最佳规模来解决它。首先,我们将一个可学习的标量乘数附加到 W 上,并确认 WD 噪声平衡范数不是最优的:学习的尺度会适应数据并提高性能。然后,我们认为单独的行和列范数也受到类似的约束,并通过引入可学习的每行和每列乘数来释放它们的规模。我们的方法可以被视为 muP 乘数的可学习的、更具表现力的概括。它的性能优于经过良好调整的 muP 基线,减少了乘法器调整的计算开销,并解决了诸如前向传递对称性和学习乘法器的宽度缩放等实际问题。最后,我们使用 Adam 和 Muon 优化器验证可学习乘数,其中显示下游评估的改进与从 Adam 切换到 Muon 的改进相匹配。
MindWatcher:迈向更智能的多模态工具集成推理
-
标题: MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
-
作者: Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang, Wei Wang, Hongfu Yang, Jiqing Zhan, Chunpeng Zhou, Zheng Zhou, Hao Ma, Tao Wei, Pan Zhou, Wei Chen
-
日期: 2025-12-29
-
ArXiv主页 : https://arxiv.org/abs/2512.23412
-
gitHub仓库 : https://github.com/TIMMY-CHAN/MindWatcher
英文摘要
Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.
中文摘要
传统的基于工作流的代理在解决需要工具调用的现实问题时表现出有限的智能。能够自主推理和工具调用的工具集成推理(TIR)代理正在迅速崛起,成为涉及与外部环境多步骤交互的复杂决策任务的强大方法。在这项工作中,我们介绍了 MindWatcher,这是一种集成了交错思维和多模式思想链 (CoT) 推理的 TIR 代理。MindWatcher 可以自主决定是否以及如何调用不同的工具并协调它们的使用,而无需依赖人工提示或工作流程。交错思维范式使模型能够在任何中间阶段在思维和工具调用之间切换,而其多模态 CoT 功能允许在推理过程中操作图像以产生更精确的搜索结果。我们实施自动化数据审核和评估管道,并辅以手动整理的高质量训练数据集,并构建一个名为 MindWatcher-Evaluate Bench (MWE-Bench) 的基准来评估其性能。MindWatcher 配备了一整套辅助推理工具,使其能够解决广泛领域的多模态问题。大规模、高质量的本地图像检索数据库,涵盖汽车、动物、植物等八个类别,赋予模型小规模的鲁棒物体识别能力。最后,我们为 MindWatcher 设计了更高效的训练基础设施,提高了训练速度和硬件利用率。实验不仅证明 MindWatcher 通过高级工具调用匹配或超过了更大或更新模型的性能,而且还揭示了智能体训练的关键见解,例如智能体强化学习中的遗传遗传现象。
通过 FusionRoute 进行代币级法学硕士合作
- 标题: Token-Level LLM Collaboration via FusionRoute
- 作者: Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao
- 日期: 2026-01-08
- ArXiv主页 : https://arxiv.org/abs/2601.05106
- 论文链接 : https://arxiv.org/pdf/2601.05106
英文摘要
Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.
中文摘要
大型语言模型 (LLM) 在不同领域展现出优势。然而,使用单个通用模型在这些领域实现强大的性能通常需要扩展到训练和部署成本高昂的规模。另一方面,虽然较小的领域专业模型效率更高,但它们很难泛化到其训练分布之外。为了解决这个困境,我们提出了FusionRoute,这是一个强大且有效的令牌级多LLM协作框架,其中轻量级路由器同时(i)在每个解码步骤选择最合适的专家,并且(ii)提供补充logit,通过logit加法细化或纠正所选专家的下一个令牌分布。与仅依赖于固定专家输出的现有令牌级协作方法不同,我们提供的理论分析表明,纯专家路由从根本上是有限的:除非强大的全局覆盖假设成立,否则它通常无法实现最优解码策略。通过使用可训练的互补生成器增强专家选择,FusionRoute 扩展了有效的策略类别,并能够在温和条件下恢复最优价值函数。根据经验,在 Llama-3 和 Gemma-2 系列以及涵盖数学推理、代码生成和指令跟踪的各种基准测试中,FusionRoute 的性能优于序列级和令牌级协作、模型合并和直接微调,同时在各自的任务上与领域专家保持竞争力。
MiMo-V2-Flash技术报告
- 标题: MiMo-V2-Flash Technical Report
- 作者: Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang, Peidian Li, Qianli Chen, Shaohui Liu, Shihua Yu, Shijie Cao, Shimao Chen, Shouqiu Yu, Shuo Liu, Tianling Zhou, Weijiang Su, Weikun Wang, Wenhan Ma, Xiangwei Deng, Bohan Mao, Bowen Ye, Can Cai, Chenghua Wang, Chengxuan Zhu, Chong Ma, Chun Chen, Chunan Li, Dawei Zhu, Deshan Xiao, Dong Zhang, Duo Zhang, Fangyue Liu, Feiyu Yang, Fengyuan Shi, Guoan Wang, Hao Tian, Hao Wu, Heng Qu, Hongfei Yi, Hongxu An, Hongyi Guan, Xing Zhang, Yifan Song, Yihan Yan, Yihao Zhao, Yingchun Lai, Yizhao Gao, Yu Cheng, Yuanyuan Tian, Yudong Wang, Zhen Tang, Zhengju Tang, Zhengtao Wen, Zhichao Song, Zhixian Zheng, Zihan Jiang, Jian Wen, Jiarui Sun, Jiawei Li, Jinlong Xue, Jun Xia, Kai Fang, Menghang Zhu, Nuo Chen, Qian Tu, Qihao Zhang, Qiying Wang, Rang Li, Rui Ma, Shaolei Zhang, Shengfan Wang, Shicheng Li, Shuhao Gu, Shuhuai Ren, Sirui Deng, Tao Guo, Tianyang Lu, Weiji Zhuang, Weikang Zhang, Weimin Xiong, Wenshan Huang, Wenyu Yang, Xin Zhang, Xing Yong, Xu Wang, Xueyang Xie, Yilin Jiang, Yixin Yang, Yongzhe He, Yu Tu, Yuanliang Dong, Yuchen Liu, Yue Ma, Yue Yu, Yuxing Xiang, Zhaojun Huang, Zhenru Lin, Zhipeng Xu, Zhiyang Chen, Zhonghua Deng, Zihan Zhang, Zihao Yue
- 日期: 2026-01-06
- ArXiv主页 : https://arxiv.org/abs/2601.02780
- 论文链接 : https://arxiv.org/pdf/2601.02780
- 项目链接 : https://mimo.xiaomi.com/blog/mimo-v2-flash
- gitHub仓库 : https://github.com/XiaomiMiMo/MiMo-V2-Flash
英文摘要
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
中文摘要
我们推出 MiMo-V2-Flash,这是一种专家混合 (MoE) 模型,具有 309B 总参数和 15B 活动参数,专为快速、强大的推理和代理功能而设计。MiMo-V2-Flash采用混合注意力架构,将滑动窗口注意力(SWA)与全局注意力交错,在5:1的混合比例下具有128个token的滑动窗口。该模型通过多令牌预测 (MTP) 对 27 万亿个令牌进行了预训练,采用原生 32k 上下文长度,随后扩展到 256k。为了有效扩展训练后计算,MiMo-V2-Flash 引入了一种新颖的多教师按策略蒸馏 (MOPD) 范例。在此框架中,领域专业教师(例如,通过大规模强化学习进行培训)提供密集且代币级别的奖励,使学生模型能够完美掌握教师的专业知识。MiMo-V2-Flash 可以与 DeepSeek-V3.2 和 Kimi-K2 等顶级开放权重模型相媲美,尽管它们分别只使用了它们总参数的 1/2 和 1/3。在推理过程中,通过将 MTP 重新用作推测解码的草案模型,MiMo-V2-Flash 通过三个 MTP 层实现了高达 3.6 的接受长度和 2.6 倍的解码加速。我们开源模型权重和三层 MTP 权重,以促进开放研究和社区合作。
SimpleMem:LLM 代理的高效终身记忆
- 标题: SimpleMem: Efficient Lifelong Memory for LLM Agents
- 作者: Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao
- 日期: 2026-01-05
- ArXiv主页 : https://arxiv.org/abs/2601.02553
- 论文链接 : https://arxiv.org/pdf/2601.02553
- 项目链接 : https://aiming-lab.github.io/SimpleMem-Page/
- gitHub仓库 : https://github.com/aiming-lab/SimpleMem
英文摘要
To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.
中文摘要
为了支持复杂环境中可靠的长期交互,LLM 代理需要能够有效管理历史经验的内存系统。现有方法要么通过被动上下文扩展保留完整的交互历史,从而导致大量冗余,要么依靠迭代推理来过滤噪声,从而产生高昂的令牌成本。为了应对这一挑战,我们引入了 SimpleMem,一种基于语义无损压缩的高效内存框架。我们提出了一个三阶段管道,旨在最大限度地提高信息密度和令牌利用率:(1)语义结构化压缩,它应用熵感知过滤将非结构化交互提炼成紧凑的多视图索引内存单元;(2) Recursive Memory Consolidation,一种异步过程,将相关单元集成到更高级别的抽象表示中,以减少冗余;(3)自适应查询感知检索,根据查询复杂度动态调整检索范围,以有效地构建精确的上下文。在基准数据集上的实验表明,我们的方法在准确性、检索效率和推理成本方面始终优于基线方法,平均 F1 提高了 26.4%,同时将推理时间令牌消耗减少了多达 30 倍,展示了性能和效率之间的卓越平衡。代码可在 https://github.com/aiming-lab/SimpleMem 获取。
VideoAuto-R1:一次思考,两次回答的视频自动推理
- 标题: VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
- 作者: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
- 日期: 2026-01-08
- ArXiv主页 : https://arxiv.org/abs/2601.05175
- 论文链接 : https://arxiv.org/pdf/2601.05175
- 项目链接 : https://ivul-kaust.github.io/projects/videoauto-r1/
- gitHub仓库 : https://github.com/IVUL-KAUST/VideoAuto-R1
英文摘要
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
中文摘要
思想链(CoT)推理已成为视频理解任务中多模态大语言模型的强大工具。然而,其必要性和相对于直接回答的优势仍未得到充分探索。在本文中,我们首先证明,对于 RL 训练的视频模型,直接回答通常可以匹配甚至超越 CoT 性能,尽管 CoT 会以更高的计算成本进行逐步分析。受此启发,我们提出了 VideoAuto-R1,这是一种采用"必要时推理"策略的视频理解框架。在训练过程中,我们的方法遵循一次思考,两次回答的范式:模型首先生成一个初始答案,然后进行推理,最后输出一个经过审查的答案。这两个答案都通过可验证的奖励进行监督。在推理过程中,模型使用初始答案的置信度来决定是否继续推理。在视频 QA 和接地基准测试中,VideoAuto-R1 实现了最先进的准确性,同时显着提高了效率,将平均响应长度减少了约 3.3 倍,例如从 149 个令牌减少到仅 44 个令牌。此外,我们观察到感知导向任务的思维模式激活率较低,但推理密集型任务的思维模式激活率较高。这表明基于语言的显式推理通常是有益的,但并不总是必要的。
SenseNova-MARS:通过强化学习增强多模式代理推理和搜索
-
标题: SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
-
作者: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu
-
日期: 2025-12-30
-
ArXiv主页 : https://arxiv.org/abs/2512.24330
英文摘要
While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.
中文摘要
虽然视觉语言模型 (VLM) 可以通过代理推理解决复杂的任务,但它们的功能在很大程度上仍然受限于面向文本的思维链或孤立的工具调用。它们未能表现出将动态工具操作与连续推理无缝交织所需的类人熟练程度,特别是在需要协调的外部工具(例如搜索和图像裁剪)的知识密集型和视觉复杂场景中。在这项工作中,我们介绍了 SenseNova-MARS,这是一种新颖的多模式代理推理和搜索框架,它通过强化学习 (RL) 为 VLM 提供交错的视觉推理和工具使用功能。具体来说,SenseNova-MARS动态集成了图像搜索、文本搜索和图像裁剪工具,以应对细粒度和知识密集型的视觉理解挑战。在强化学习阶段,我们提出了批量归一化组序列策略优化(BN-GSPO)算法来提高训练稳定性并提高模型有效调用工具和推理的能力。为了全面评估复杂视觉任务上的代理 VLM,我们引入了 HR-MMSearch 基准,这是第一个面向搜索的基准,由具有知识密集型和搜索驱动问题的高分辨率图像组成。实验表明,SenseNova-MARS 在开源搜索和细粒度图像理解基准测试中实现了最先进的性能。具体来说,在面向搜索的基准测试中,SenseNova-MARS-8B 在 MMSearch 上得分为 67.84,在 HR-MMSearch 上得分为 41.64,超过了 Gemini-3-Flash 和 GPT-5 等专有模型。SenseNova-MARS 通过提供有效且强大的工具使用功能,代表着向代理 VLM 迈出了有希望的一步。为了促进该领域的进一步研究,我们将发布所有代码、模型和数据集。
SciEvalKit:科学通用智能的开源评估工具包
- 标题: SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
- 作者: Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai
- 日期: 2025-12-26
- ArXiv主页 : https://arxiv.org/abs/2512.22334
- 论文链接 : https://arxiv.org/pdf/2512.22334
- 项目链接 : https://opencompass.org.cn/Intern-Discovery-Eval/rank
- gitHub仓库 : https://github.com/InternScience/SciEvalKit
英文摘要
We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
中文摘要
我们推出了 SciEvalKit,这是一个统一的基准测试工具包,旨在评估跨广泛科学学科和任务能力的科学 AI 模型。与通用评估平台不同,SciEvalKit专注于科学智能的核心能力,包括科学多模态感知、科学多模态推理、科学多模态理解、科学符号推理、科学代码生成、科学假设生成和科学知识理解。它支持六个主要科学领域,从物理和化学到天文学和材料科学。SciEvalKit 构建了专家级科学基准的基础,这些基准来自现实世界的特定领域数据集,确保任务反映真实的科学挑战。该工具包具有灵活、可扩展的评估管道,可以跨模型和数据集进行批量评估,支持自定义模型和数据集集成,并提供透明、可重复和可比较的结果。通过桥接基于能力的评估和学科多样性,SciEvalKit 提供了标准化但可定制的基础设施,以对下一代科学基础模型和智能代理进行基准测试。该工具包是开源的并积极维护,以促进社区驱动的 AI4Science 开发和进步。