中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 大型语言模型的代理推理
- 你的群体相对优势存在偏差
- EvoCUA:通过学习可扩展的综合经验来发展计算机使用代理
- 沙箱法学硕士引发通用代理智能
- Being-H0.5:扩展以人为中心的机器人学习以实现跨实体泛化
- Qwen3-TTS技术报告
- [HERMES:KV 缓存作为高效流媒体视频理解的分层内存](#HERMES:KV 缓存作为高效流媒体视频理解的分层内存)
- 灵活性陷阱:为什么任意顺序限制扩散语言模型中的推理潜力
- ABC-Bench:现实世界开发中代理后端编码的基准测试
- 软件工程中基于法学硕士的问题解决的进展和前沿:综合调查
- RubricHub:通过自动粗到细生成的全面且具有高度辨别力的Rubric数据集
- 迈向高效智能体:记忆、工具学习和规划
- 使用表示自动编码器缩放文本到图像扩散变压器
- Stable-DiffCoder:推动代码扩散大型语言模型的前沿
- BayesianVLA:通过潜在动作查询对视觉语言动作模型进行贝叶斯分解
- Paper2Rebuttal:透明作者回复协助的多代理框架
- MMDeepResearch-Bench:多模式深度研究代理的基准
- OmniTransfer:时空视频传输的一体化框架
- 定位、引导和改进:大型语言模型中可操作机制可解释性的实用调查
- Think3D:用空间思考进行空间推理
- 毒苹果效应:通过人工智能代理的技术扩展对中介市场进行战略操纵
- 学习在测试时发现
- 重新思考现实世界的视频生成模型
- SAMTok:用两个单词表示任何掩码
- 解锁内隐体验:从文本合成工具使用轨迹
- 多重思维:通过标记式分支合并进行推理
- GutenOCR:一个基于视觉语言的文档前端
- FutureOmni:从多模式法学硕士的全模式环境中评估未来预测
- Terminal-Bench:在命令行界面中针对艰巨、现实的任务对代理进行基准测试
- [AgencyBench:在 100 万代币现实世界环境中对自治代理的前沿进行基准测试](#AgencyBench:在 100 万代币现实世界环境中对自治代理的前沿进行基准测试)
大型语言模型的代理推理
-
标题: Agentic Reasoning for Large Language Models
-
作者: Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He
-
日期: 2026-01-18
-
ArXiv主页 : https://arxiv.org/abs/2601.12538
-
gitHub仓库 : https://github.com/weitianxin/Awesome-Agentic-Reasoning
英文摘要
Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, they struggle in open-ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single-agent capabilities including planning, tool use, and search in stable environments; self-evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi-agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in-context reasoning, which scales test-time interaction through structured orchestration, from post-training reasoning, which optimizes behaviors via reinforcement learning and supervised fine-tuning. We further review representative agentic reasoning frameworks across real-world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance for real-world deployment.
中文摘要
推理是推理、解决问题和决策的基础认知过程。虽然大型语言模型 (LLM) 在封闭世界环境中表现出强大的推理能力,但它们在开放式和动态环境中却举步维艰。代理推理将法学硕士重新定义为通过持续互动进行计划、行动和学习的自主代理,标志着范式的转变。在这项调查中,我们沿着三个互补的维度组织主体推理。首先,我们通过三个层次来描述环境动态:基础代理推理,建立核心的单代理能力,包括稳定环境中的规划、工具使用和搜索;自我进化的代理推理,研究代理如何通过反馈、记忆和适应来完善这些能力;集体多智能体推理,将智能扩展到涉及协调、知识共享和共同目标的协作环境。在这些层中,我们区分了上下文推理和训练后推理,前者通过结构化编排来扩展测试时交互,后者通过强化学习和监督微调来优化行为。我们进一步回顾了现实世界应用和基准的代表性代理推理框架,包括科学、机器人、医疗保健、自主研究和数学。这项调查将代理推理方法综合成一个连接思想和行动的统一路线图,并概述了开放的挑战和未来的方向,包括个性化、长期交互、世界建模、可扩展的多代理训练和现实世界部署的治理。
你的群体相对优势存在偏差
- 标题: Your Group-Relative Advantage Is Biased
- 作者: Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban
- 日期: 2026-01-13
- ArXiv主页 : https://arxiv.org/abs/2601.08521
- 论文链接 : https://arxiv.org/pdf/2601.08521
英文摘要
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
中文摘要
验证者奖励强化学习 (RLVR) 已成为一种广泛使用的方法,用于在推理任务上对大型语言模型进行后训练,基于组的方法(例如 GRPO 及其变体)得到了广泛采用。这些方法依赖于群体相对优势估计来避免博学的批评,但其理论特性仍然知之甚少。在这项工作中,我们揭示了基于群体的强化学习的一个基本问题:群体相对优势估计器相对于真实(预期)优势存在固有的偏差。我们提供的第一个理论分析表明,它系统地低估了硬提示的优势,高估了简单提示的优势,导致探索和利用不平衡。为了解决这个问题,我们提出了历史感知自适应难度加权(HA-DW),这是一种自适应重新加权方案,可根据不断变化的难度锚点和训练动态调整优势估计。对五个数学推理基准的理论分析和实验都表明,当集成到 GRPO 及其变体中时,HA-DW 持续提高性能。我们的结果表明,纠正有偏差的优势估计对于稳健且高效的 RLVR 训练至关重要。
EvoCUA:通过学习可扩展的综合经验来发展计算机使用代理
- 标题: EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
- 作者: Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, Xipeng Qiu
- 日期: 2026-01-22
- ArXiv主页 : https://arxiv.org/abs/2601.15876
- 论文链接 : https://arxiv.org/pdf/2601.15876
- 项目链接 : https://github.com/meituan/EvoCUA
- gitHub仓库 : https://github.com/meituan/EvoCUA
英文摘要
The development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work, we introduce EvoCUA, a native computer use agentic model. Unlike static imitation, EvoCUA integrates data generation and policy optimization into a self-sustaining evolutionary cycle. To mitigate data scarcity, we develop a verifiable synthesis engine that autonomously generates diverse tasks coupled with executable validators. To enable large-scale experience acquisition, we design a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts. Building on these massive trajectories, we propose an iterative evolving learning strategy to efficiently internalize this experience. This mechanism dynamically regulates policy updates by identifying capability boundaries -- reinforcing successful routines while transforming failure trajectories into rich supervision through error analysis and self-correction. Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B (45.0%), and surpasses leading closed-weights models such as UI-TARS-2 (53.1%). Crucially, our results underscore the generalizability of this approach: the evolving paradigm driven by learning from experience yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities.
中文摘要
本地计算机使用代理(CUA)的开发代表了多模式人工智能的重大飞跃。然而,它们的潜力目前受到静态数据扩展的限制。现有的范式主要依赖于静态数据集的被动模仿,难以捕捉长期计算机任务中固有的复杂因果动态。在这项工作中,我们介绍了 EvoCUA,一种本机计算机使用代理模型。与静态模仿不同,EvoCUA 将数据生成和策略优化集成到一个自我维持的进化循环中。为了缓解数据稀缺性,我们开发了一个可验证的合成引擎,可以自动生成各种任务以及可执行的验证器。为了实现大规模经验获取,我们设计了一个可扩展的基础设施,可协调数以万计的异步沙箱部署。基于这些巨大的轨迹,我们提出了一种迭代发展的学习策略,以有效地内化这种经验。该机制通过识别能力边界来动态调节策略更新------强化成功的例程,同时通过错误分析和自我纠正将失败轨迹转化为丰富的监督。OSWorld 基准的实证评估表明,EvoCUA 的成功率达到 56.7%,建立了新的开源最先进水平。值得注意的是,EvoCUA 的性能显着优于之前最好的开源模型 OpenCUA-72B (45.0%),并超过了领先的封闭权重模型,如 UI-TARS-2 (53.1%)。至关重要的是,我们的结果强调了这种方法的普遍性:从经验中学习驱动的不断发展的范式在不同规模的基础模型中产生了一致的性能增益,为提升本地代理能力建立了一条强大且可扩展的路径。
沙箱法学硕士引发通用代理智能
- 标题: LLM-in-Sandbox Elicits General Agentic Intelligence
- 作者: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
- 日期: 2026-01-22
- ArXiv主页 : https://arxiv.org/abs/2601.16206
- 论文链接 : https://arxiv.org/pdf/2601.16206
- 项目链接 : https://llm-in-sandbox.github.io
- gitHub仓库 : https://github.com/llm-in-sandbox/llm-in-sandbox
英文摘要
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文摘要
我们引入了 LLM-in-Sandbox,使 LLM 能够在代码沙箱(即虚拟计算机)中进行探索,以激发非代码领域的通用智能。我们首先证明,强大的法学硕士无需额外培训即可表现出利用代码沙箱执行非代码任务的泛化能力。例如,法学硕士自发访问外部资源以获取新知识,利用文件系统处理长上下文,并执行脚本以满足格式要求。我们进一步表明,这些代理能力可以通过LLM-in-Sandbox强化学习(LLM-in-Sandbox-RL)得到增强,该学习仅使用非代理数据来训练沙箱探索模型。实验表明,LLM-in-Sandbox 在免训练和训练后环境中都实现了涵盖数学、物理、化学、生物医学、长上下文理解和指令遵循的稳健泛化。最后,我们从计算和系统角度分析LLM-in-Sandbox的效率,并将其作为Python包开源以方便实际部署。
Being-H0.5:扩展以人为中心的机器人学习以实现跨实体泛化
- 标题: Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
- 作者: Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, Zongqing Lu
- 日期: 2026-01-19
- ArXiv主页 : https://arxiv.org/abs/2601.12993
- 论文链接 : https://arxiv.org/pdf/2601.12993
- 项目链接 : https://research.beingbeyond.com/being-h05
- gitHub仓库 : https://github.com/BeingBeyond/Being-H
英文摘要
We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that treats human interaction traces as a universal "mother tongue" for physical interaction. To support this, we present UniHand-2.0, the largest embodied pre-training recipe to date, comprising over 35,000 hours of multimodal data across 30 distinct robotic embodiments. Our approach introduces a Unified Action Space that maps heterogeneous robot controls into semantically aligned slots, enabling low-resource robots to bootstrap skills from human data and high-resource platforms. Built upon this human-centric foundation, we design a unified sequential modeling and multi-task pre-training paradigm to bridge human demonstrations and robotic execution. Architecturally, Being-H0.5 utilizes a Mixture-of-Transformers design featuring a novel Mixture-of-Flow (MoF) framework to decouple shared motor primitives from specialized embodiment-specific experts. Finally, to make cross-embodiment policies stable in the real world, we introduce Manifold-Preserving Gating for robustness under sensory shift and Universal Async Chunking to universalize chunked control across embodiments with different latency and control profiles. We empirically demonstrate that Being-H0.5 achieves state-of-the-art results on simulated benchmarks, such as LIBERO (98.9%) and RoboCasa (53.9%), while also exhibiting strong cross-embodiment capabilities on five robotic platforms.
中文摘要
我们推出 Being-H0.5,这是一种基础视觉-语言-动作 (VLA) 模型,专为跨不同机器人平台的稳健跨实体泛化而设计。虽然现有的 VLA 经常面临形态异质性和数据稀缺的问题,但我们提出了一种以人为中心的学习范式,将人类交互痕迹视为物理交互的通用"母语"。为了支持这一点,我们推出了 UniHand-2.0,这是迄今为止最大的具体预训练方案,包含 30 个不同机器人实施例的超过 35,000 小时的多模式数据。我们的方法引入了统一动作空间,将异构机器人控制映射到语义对齐的插槽中,使低资源机器人能够从人类数据和高资源平台中引导技能。建立在以人为本的基础上,我们设计了统一的顺序建模和多任务预训练范例,以连接人类演示和机器人执行。在架构上,Being-H0.5 采用混合变压器设计,采用新颖的混合流 (MoF) 框架,将共享电机原语与特定实施例的专家解耦。最后,为了使跨实施例策略在现实世界中稳定,我们引入了流形保留门控(Manifold-Preserving Gating),以实现感知转移下的鲁棒性,并引入通用异步分块(Universal Async Chunking),以在具有不同延迟和控制配置文件的实施例之间实现分块控制的通用化。我们凭经验证明,Being-H0.5 在模拟基准上取得了最先进的结果,例如 LIBERO (98.9%) 和 RoboCasa (53.9%),同时还在五个机器人平台上展示了强大的跨实体能力。
Qwen3-TTS技术报告
-
标题: Qwen3-TTS Technical Report
-
作者: Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
-
日期: 2026-01-22
-
ArXiv主页 : https://arxiv.org/abs/2601.15621
-
gitHub仓库 : https://github.com/QwenLM/Qwen3-TTS
英文摘要
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
中文摘要
在本报告中,我们介绍了 Qwen3-TTS 系列,这是一系列先进的多语言、可控、强大的流式文本转语音模型。Qwen3-TTS 支持最先进的 3 秒语音克隆和基于描述的控制,允许创建全新的语音并对输出语音进行细粒度操作。Qwen3-TTS 经过超过 500 万小时跨越 10 种语言的语音数据的训练,采用双轨 LM 架构进行实时合成,并配有两个语音标记器:1) Qwen-TTS-Tokenizer-25Hz 是一种强调语义内容的单码本编解码器,可与 Qwen-Audio 无缝集成,并通过分块 DiT 实现流式波形重建。2) Qwen-TTS-Tokenizer-12Hz 实现了极端比特率降低和超低延迟流传输,通过其 12.5 Hz、16 层多码本设计和轻量级因果 ConvNet 实现立即第一个数据包发射 (97,ms)。大量实验表明,在不同的客观和主观基准测试中(例如,TTS 多语言测试集、InstructTTSEval 和我们的长语音测试集)都具有最先进的性能。为了促进社区研究和开发,我们在 Apache 2.0 许可证下发布了分词器和模型。
HERMES:KV 缓存作为高效流媒体视频理解的分层内存
- 标题: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
- 作者: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu
- 日期: 2026-01-21
- ArXiv主页 : https://arxiv.org/abs/2601.14724
- 论文链接 : https://arxiv.org/pdf/2601.14724
- 项目链接 : https://hermes-streaming.github.io/
- gitHub仓库 : https://github.com/haowei-freesky/HERMES
英文摘要
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10times faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
中文摘要
多模态大语言模型 (MLLM) 的最新进展证明了离线视频理解方面的显着改进。然而,将这些功能扩展到流视频输入仍然具有挑战性,因为现有模型很难同时保持稳定的理解性能、实时响应和低 GPU 内存开销。为了应对这一挑战,我们提出了 HERMES,这是一种新颖的免训练架构,用于实时准确地理解视频流。基于机械注意力研究,我们将 KV 缓存概念化为一个分层内存框架,它封装了多个粒度的视频信息。在推理过程中,HERMES 重用紧凑的 KV 缓存,从而在资源限制下实现高效的流式理解。值得注意的是,HERMES 在用户查询到达时不需要辅助计算,从而保证了连续视频流交互的实时响应,与之前的 SOTA 相比,TTFT 速度提高了 10 倍。即使与统一采样相比,视频标记减少了高达 68%,HERMES 在所有基准测试中仍能实现优异或相当的准确度,在流数据集上提升高达 11.4%。
灵活性陷阱:为什么任意顺序限制扩散语言模型中的推理潜力
- 标题: The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
- 作者: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
- 日期: 2026-01-21
- ArXiv主页 : https://arxiv.org/abs/2601.15165
- 论文链接 : https://arxiv.org/pdf/2601.15165
- 项目链接 : https://nzl-thu.github.io/the-flexibility-trap
- gitHub仓库 : https://github.com/LeapLabTHU/JustGRPO
英文摘要
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap
中文摘要
扩散大型语言模型 (dLLM) 打破了传统 LLM 严格的从左到右的约束,能够以任意顺序生成令牌。直观上,这种灵活性意味着解决方案空间严格超越固定的自回归轨迹,从理论上讲,为数学和编码等一般任务释放了卓越的推理潜力。因此,许多工作都利用强化学习(RL)来激发 dLLM 的推理能力。在本文中,我们揭示了一个反直觉的现实:当前形式的任意顺序生成缩小而不是扩大了 dLLM 的推理边界。我们发现 dLLM 倾向于利用这种顺序灵活性来绕过对于探索至关重要的高不确定性标记,从而导致解决方案空间过早崩溃。这一观察结果挑战了 dLLM 现有 RL 方法的前提,其中相当大的复杂性,例如处理组合轨迹和棘手的可能性,通常致力于保持这种灵活性。我们证明,有意放弃任意顺序并应用标准组相对策略优化(GRPO)可以更好地引发有效推理。我们的方法 JustGRPO 非常简约,但却非常有效(例如,在 GSM8K 上的准确率达到 89.1%),同时完全保留了 dLLM 的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap
ABC-Bench:现实世界开发中代理后端编码的基准测试
- 标题: ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
- 作者: Jie Yang, Honglin Guo, Li Ji, Jiazheng Zhou, Rui Zheng, Zhikai Lei, Shuo Zhang, Zhiheng Xi, Shichun Liu, Yuxin Wang, Bo Wang, Yining Zheng, Tao Gui, Xipeng Qiu
- 日期: 2026-01-16
- ArXiv主页 : https://arxiv.org/abs/2601.11077
- 论文链接 : https://arxiv.org/pdf/2601.11077
- 项目链接 : https://dawning-road.github.io/blog/abc-bench
- gitHub仓库 : https://github.com/OpenMOSS/ABC-Bench
英文摘要
The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.
中文摘要
大型语言模型 (LLM) 发展为自主代理,将人工智能编码的范围从本地化代码生成扩展到复杂的、存储库级的、执行驱动的问题解决。然而,当前的基准测试主要评估静态上下文中的代码逻辑,忽略了现实工程的动态、全流程需求,特别是在后端开发中,需要严格的环境配置和服务部署。为了解决这一差距,我们引入了 ABC-Bench,这是一个专门设计用于评估现实的可执行工作流程中的代理后端编码的基准。使用可扩展的自动化管道,我们从开源存储库中策划了涵盖 8 种语言和 19 个框架的 224 项实际任务。与之前的评估不同,ABC-Bench 要求代理管理从存储库探索到实例化容器化服务的整个开发生命周期,并通过外部端到端 API 测试。我们的广泛评估表明,即使是最先进的模型也难以在这些整体任务上提供可靠的性能,这凸显了当前模型功能与实际后端工程需求之间的巨大差距。我们的代码可在 https://github.com/OpenMOSS/ABC-Bench 获取。
软件工程中基于法学硕士的问题解决的进展和前沿:综合调查
- 标题: Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey
- 作者: Caihua Li, Lianghong Guo, Yanlin Wang, Daya Guo, Wei Tao, Zhenyu Shan, Mingwei Liu, Jiachi Chen, Haoyu Song, Duyu Tang, Hongyu Zhang, Zibin Zheng
- 日期: 2026-01-15
- ArXiv主页 : https://arxiv.org/abs/2601.11655
- 论文链接 : https://arxiv.org/pdf/2601.11655
- 项目链接 : https://deepsoftwareanalytics.github.io/Awesome-Issue-Resolution/
- gitHub仓库 : https://github.com/DeepSoftwareAnalytics/Awesome-Issue-Resolution
英文摘要
Issue resolution, a complex Software Engineering (SWE) task integral to real-world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE-bench revealed this task as profoundly difficult for large language models, thereby significantly accelerating the evolution of autonomous coding agents. This paper presents a systematic survey of this emerging domain. We begin by examining data construction pipelines, covering automated collection and synthesis approaches. We then provide a comprehensive analysis of methodologies, spanning training-free frameworks with their modular components to training-based techniques, including supervised fine-tuning and reinforcement learning. Subsequently, we discuss critical analyses of data quality and agent behavior, alongside practical applications. Finally, we identify key challenges and outline promising directions for future research. An open-source repository is maintained at https://github.com/DeepSoftwareAnalytics/Awesome-Issue-Resolution to serve as a dynamic resource in this field.
中文摘要
问题解决是现实世界开发中不可或缺的一项复杂的软件工程 (SWE) 任务,已成为人工智能面临的严峻挑战。像 SWE-bench 这样的基准测试的建立表明,这项任务对于大型语言模型来说非常困难,从而显着加速了自主编码代理的发展。本文对这一新兴领域进行了系统调查。我们首先检查数据构建管道,涵盖自动收集和合成方法。然后,我们提供了对方法的全面分析,涵盖免培训框架及其模块化组件以及基于培训的技术,包括监督微调和强化学习。随后,我们讨论数据质量和代理行为的关键分析以及实际应用。最后,我们确定了主要挑战并概述了未来研究的有希望的方向。https://github.com/DeepSoftwareAnalytics/Awesome-Issue-Resolution 维护着一个开源存储库,作为该领域的动态资源。
RubricHub:通过自动粗到细生成的全面且具有高度辨别力的Rubric数据集
- 标题: RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
- 作者: Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen
- 日期: 2026-01-13
- ArXiv主页 : https://arxiv.org/abs/2601.08430
- 论文链接 : https://arxiv.org/pdf/2601.08430
- 项目链接 : https://huggingface.co/datasets/sojuL/RubricHub_v1
- gitHub仓库 : https://github.com/teqkilla/RubricHub
英文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale (sim110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.
中文摘要
具有可验证奖励的强化学习(RLVR)推动了数学等推理密集型领域的实质性进展。然而,由于缺乏基本事实,优化开放式生成仍然具有挑战性。虽然基于评分标准的评估提供了结构化的验证代理,但现有方法存在可扩展性瓶颈和粗略标准,导致监督天花板效应。为了解决这个问题,我们提出了一个自动化的从粗到细的红字生成框架。通过协同原则引导的综合、多模型聚合和难度演化,我们的方法产生了全面且高度区分的标准,能够捕捉微妙的细微差别。基于这个框架,我们引入了RubricHub,一个大规模(sim110k)和多领域的数据集。我们通过由基于 Rubric 的拒绝采样微调 (RuFT) 和强化学习 (RuRL) 组成的两阶段训练后管道验证其实用性。实验结果表明,RubricHub 实现了显着的性能提升:我们训练后的 Qwen3-14B 在 HealthBench (69.3) 上实现了最先进 (SOTA) 的结果,超越了 GPT-5 等专有前沿模型。代码和数据将很快发布。
迈向高效智能体:记忆、工具学习和规划
- 标题: Toward Efficient Agents: Memory, Tool learning, and Planning
- 作者: Xiaofang Yang, Lijun Li, Heng Zhou, Tong Zhu, Xiaoye Qu, Yuchen Fan, Qianshan Wei, Rui Ye, Li Kang, Yiran Qin, Zhiqiang Kou, Daizong Liu, Qi Li, Ning Ding, Siheng Chen, Jing Shao
- 日期: 2026-01-20
- ArXiv主页 : https://arxiv.org/abs/2601.14192
- 论文链接 : https://arxiv.org/pdf/2601.14192
- 项目链接 : https://efficient-agents.github.io/
- gitHub仓库 : https://github.com/yxf203/Awesome-Efficient-Agents
英文摘要
Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.
中文摘要
近年来,人们对将大型语言模型扩展到代理系统越来越感兴趣。虽然代理的有效性不断提高,但对于实际部署至关重要的效率却常常被忽视。因此,本文研究了代理的三个核心组成部分的效率:内存、工具学习和规划,并考虑了延迟、令牌、步骤等成本。为了进行全面的研究来解决代理系统本身的效率,我们回顾了一系列在实现上有所不同但经常收敛于共享高级原则的最新方法,包括但不限于通过压缩和管理来限制上下文,设计强化学习奖励以最大限度地减少工具调用,以及采用受控搜索机制来提高效率,我们将对此进行详细讨论。因此,我们用两种互补的方式来描述效率:在固定成本预算下比较有效性,以及在可比较的有效性水平上比较成本。这种权衡也可以通过有效性和成本之间的帕累托边界来看待。从这个角度来看,我们还通过总结这些组件的评估协议并整合基准和方法研究中常见报告的效率指标来检查以效率为导向的基准。此外,我们还讨论了主要挑战和未来方向,旨在提供有前景的见解。
使用表示自动编码器缩放文本到图像扩散变压器
- 标题: Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
- 作者: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
- 日期: 2026-01-22
- ArXiv主页 : https://arxiv.org/abs/2601.16208
- 论文链接 : https://arxiv.org/pdf/2601.16208
- 项目链接 : https://rae-dit.github.io/scale-rae/
- gitHub仓库 : https://github.com/ZitengWangNYU/Scale-RAE
英文摘要
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
中文摘要
通过在高维语义潜在空间中进行训练,表示自动编码器 (RAE) 在 ImageNet 上的扩散建模中显示出明显的优势。在这项工作中,我们研究了该框架是否可以扩展到大规模、自由格式的文本到图像(T2I)生成。我们首先通过对网络、合成和文本渲染数据进行训练,将冻结表示编码器 (SigLIP-2) 上的 RAE 解码器扩展到 ImageNet 之外,发现虽然规模提高了总体保真度,但有针对性的数据组合对于文本等特定领域至关重要。然后,我们对最初为 ImageNet 提出的 RAE 设计选择进行严格的压力测试。我们的分析表明,扩展简化了框架:虽然与维度相关的噪声调度仍然至关重要,但诸如宽扩散头和噪声增强解码之类的架构复杂性在规模上提供的优势可以忽略不计。在这个简化的框架上构建,我们对扩散变压器从 0.5B 到 9.8B 参数范围内的 RAE 与最先进的 FLUX VAE 进行了受控比较。在所有模型规模的预训练过程中,RAE 的表现始终优于 VAE。此外,在对高质量数据集进行微调期间,基于 VAE 的模型在 64 个 epoch 后出现灾难性的过拟合,而 RAE 模型在 256 个 epoch 中保持稳定,并始终获得更好的性能。在所有实验中,基于 RAE 的扩散模型表现出更快的收敛速度和更好的生成质量,使 RAE 成为比 VAE 更简单、更强大的基础,可用于大规模 T2I 生成。此外,由于视觉理解和生成都可以在共享表示空间中运行,因此多模态模型可以直接对生成的潜在变量进行推理,为统一模型开辟了新的可能性。
Stable-DiffCoder:推动代码扩散大型语言模型的前沿
- 标题: Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
- 作者: Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, Wei Wei
- 日期: 2026-01-22
- ArXiv主页 : https://arxiv.org/abs/2601.15892
- 论文链接 : https://arxiv.org/pdf/2601.15892
- 项目链接 : https://bytedance-seed.github.io/Stable-DiffCoder/
- gitHub仓库 : https://github.com/ByteDance-Seed/Stable-DiffCoder
英文摘要
Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of ~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
中文摘要
与自回归 (AR) 模型相比,基于扩散的语言模型 (DLLM) 提供非顺序、分块生成和更丰富的数据重用,但现有代码 DLLM 在可比较的预算下仍然落后于强大的 AR 基线。我们在一项对照研究中重新审视了这一设置,并引入了 Stable-DiffCoder,这是一种重用 Seed-Coder 架构、数据和训练管道的块扩散代码模型。为了实现高效的知识学习和稳定的训练,我们采用了块扩散连续预训练(CPT)阶段,并通过定制的预热和逐块削波噪声计划进行了增强。在相同的数据和架构下,Stable-DiffCoder 在一系列广泛的代码基准测试中总体表现优于 AR 同行。此外,仅依靠 CPT 和监督微调阶段,Stable-DiffCoder 就实现了比各种 8B AR 和 DLLM 更强的性能,这表明基于扩散的训练可以提高代码建模质量,而不仅仅是 AR 训练。此外,基于扩散的任意顺序建模改进了用于编辑和推理的结构化代码建模,并通过数据增强使低资源编码语言受益。
BayesianVLA:通过潜在动作查询对视觉语言动作模型进行贝叶斯分解
- 标题: BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
- 作者: Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen
- 日期: 2026-01-21
- ArXiv主页 : https://arxiv.org/abs/2601.15197
- 论文链接 : https://arxiv.org/pdf/2601.15197
- 项目链接 : https://github.com/ZGC-EmbodyAI/BayesianVLA
- gitHub仓库 : https://github.com/ZGC-EmbodyAI/LangForce
英文摘要
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior p(a mid v) and a language-conditioned posterior π(a mid v, ell). We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
中文摘要
视觉-语言-动作 (VLA) 模型在机器人操作方面显示出了前景,但通常很难推广到新指令或复杂的多任务场景。我们确定了当前训练范式中的一个关键病态,其中目标驱动的数据收集会产生数据集偏差。在这样的数据集中,语言指令仅通过视觉观察就可以高度预测,从而导致指令和动作之间的条件互信息消失,我们将这种现象称为信息崩溃。因此,模型退化为仅视觉策略,忽略语言限制并在分布外(OOD)设置中失败。为了解决这个问题,我们提出了 BayesianVLA,这是一种通过贝叶斯分解强制执行指令的新颖框架。通过引入可学习的潜在动作查询,我们构建了一个双分支架构来估计仅视觉先验 p(a mid v) 和语言条件后验 π(a mid v, ell)。然后,我们优化策略以最大化操作和指令之间的条件逐点互信息(PMI)。这个目标有效地惩罚了视觉捷径并奖励了明确解释语言命令的行为。不需要新数据,BayesianVLA 显着提高了泛化能力。在 SimplerEnv 和 RoboCasa 上进行的广泛实验展示了巨大的收益,包括在具有挑战性的 OOD SimplerEnv 基准上提高了 11.3%,验证了我们的方法在实际应用中稳健地处理语言的能力。
Paper2Rebuttal:透明作者回复协助的多代理框架
- 标题: Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance
- 作者: Qianli Ma, Chang Guo, Zhiheng Tian, Siyu Wang, Jipeng Xiao, Yuanhao Yue, Zhipeng Zhang
- 日期: 2026-01-20
- ArXiv主页 : https://arxiv.org/abs/2601.14171
- 论文链接 : https://arxiv.org/pdf/2601.14171
- 项目链接 : https://mqleet.github.io/Paper2Rebuttal_ProjectPage/
- gitHub仓库 : https://github.com/AutoLab-SAI-SJTU/Paper2Rebuttal
英文摘要
Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce RebuttalAgent, the first multi-agents framework that reframes rebuttal generation as an evidence-centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text while integrating an autonomous and on-demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, RebuttalAgent ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed RebuttalBench and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.
中文摘要
撰写有效的反驳是一项高风险的任务,需要的不仅仅是语言的流畅性,因为它需要审稿人的意图和稿件细节之间的精确一致。当前的解决方案通常将其视为直接文本生成问题,遭受幻觉、忽视批评和缺乏可验证基础的困扰。为了解决这些限制,我们引入了 RebuttalAgent,这是第一个多代理框架,它将反驳生成重新构建为以证据为中心的规划任务。我们的系统将复杂的反馈分解为原子关注点,并通过将压缩摘要与高保真文本合成来动态构建混合上下文,同时集成自主且按需的外部搜索模块来解决需要外部文献的关注点。通过在起草之前生成可检查的应对计划,RebuttalAgent 可确保每个论点都明确锚定在内部或外部证据中。我们在拟议的 RebuttalBench 上验证了我们的方法,并证明我们的管道在覆盖范围、忠诚度和战略一致性方面优于强大的基线,为同行评审过程提供透明且可控的助手。代码将被发布。
MMDeepResearch-Bench:多模式深度研究代理的基准
- 标题: MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
- 作者: Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang
- 日期: 2026-01-18
- ArXiv主页 : https://arxiv.org/abs/2601.12346
- 论文链接 : https://arxiv.org/pdf/2601.12346
- 项目链接 : https://mmdeepresearch-bench.github.io/
- gitHub仓库 : https://github.com/AIoT-MLSys-Lab/MMDeepResearch-Bench
英文摘要
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
中文摘要
深度研究代理 (DRA) 通过多步骤搜索和合成生成引用丰富的报告,但现有基准主要针对纯文本设置或简短的多模式 QA,缺少端到端多模式证据使用。我们引入了 MMDeepResearch-Bench (MMDR-Bench),这是跨 21 个领域的 140 个专家制作的任务的基准,其中每个任务提供一个图像文本包来评估多模态理解和基于引文的报告生成。与之前的设置相比,MMDR-Bench 强调使用明确证据的报告式综合,其中模型必须将视觉工件与来源声明联系起来,并保持叙述、引文和视觉参考之间的一致性。我们进一步提出了一个统一的、可解释的评估管道:用于报告质量的 Formula-LLM 自适应评估 (FLAE)、用于基于引文的证据对齐的可信检索对齐引文评估 (TRACE) 和用于文本视觉完整性的多模态支持对齐完整性检查 (MOSAIC),每个管道都会产生细粒度的信号,支持超出单一总体评分的错误诊断。跨 25 个最先进模型的实验揭示了生成质量、引用规则和多模态基础之间的系统权衡,强调仅靠强大的散文并不能保证忠实的证据使用,而多模态完整性仍然是深度研究代理的关键瓶颈。
OmniTransfer:时空视频传输的一体化框架
- 标题: OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
- 作者: Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao
- 日期: 2026-01-20
- ArXiv主页 : https://arxiv.org/abs/2601.14250
- 论文链接 : https://arxiv.org/pdf/2601.14250
- 项目链接 : https://pangzecheung.github.io/OmniTransfer/
- gitHub仓库 : https://github.com/PangzeCheung/OmniTransfer
英文摘要
Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.
中文摘要
视频比图像或文本传达更丰富的信息,捕捉空间和时间动态。然而,大多数现有的视频定制方法依赖于参考图像或特定于任务的时间先验,未能充分利用视频固有的丰富的时空信息,从而限制了视频生成的灵活性和泛化性。为了解决这些限制,我们提出了 OmniTransfer,一个用于时空视频传输的统一框架。它利用跨帧的多视图信息来增强外观一致性,并利用时间线索来实现细粒度的时间控制。为了统一各种视频传输任务,OmniTransfer 采用了三个关键设计: 任务感知位置偏差,自适应地利用参考视频信息来提高时间对齐或外观一致性;参考解耦因果学习将参考分支和目标分支分开,以实现精确的参考传输,同时提高效率;任务自适应多模态对齐使用多模态语义指导来动态区分和处理不同的任务。大量实验表明,OmniTransfer 在外观(ID 和风格)和时间传输(相机运动和视频效果)方面优于现有方法,同时在不使用姿势的运动传输中匹配姿势引导方法,为灵活、高保真视频生成建立了新范例。
定位、引导和改进:大型语言模型中可操作机制可解释性的实用调查
-
标题: Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
-
作者: Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong
-
日期: 2026-01-20
-
ArXiv主页 : https://arxiv.org/abs/2601.14004
-
gitHub仓库 : https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey
英文摘要
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.
中文摘要
机械可解释性 (MI) 已成为揭开大型语言模型 (LLM) 不透明决策神秘面纱的重要方法。然而,现有的评论主要将 MI 视为一门观察科学,总结分析见解,但缺乏可操作干预的系统框架。为了弥补这一差距,我们提出了一项围绕管道构建的实用调查:"定位、引导和改进"。我们根据特定的可解释对象对定位(诊断)和引导(干预)方法进行正式分类,以建立严格的干预协议。此外,我们还展示了该框架如何实现一致性、能力和效率的切实改进,有效地将 MI 作为模型优化的可行方法进行操作。这项工作的精选论文列表可在 https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey 获取。
Think3D:用空间思考进行空间推理
-
标题: Think3D: Thinking with Space for Spatial Reasoning
-
作者: Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu
-
日期: 2026-01-19
-
ArXiv主页 : https://arxiv.org/abs/2601.13029
-
gitHub仓库 : https://github.com/zhangzaibin/spagent
英文摘要
Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.
中文摘要
对物理世界的理解和推理需要空间智能:解释二维感知之外的几何、透视和空间关系的能力。虽然最近的视觉大型模型 (VLM) 在视觉理解方面表现出色,但它们从根本上来说仍然是 2D 感知器,并且难以进行真正的 3D 推理。我们引入 Think3D,这是一个使 VLM 代理能够利用 3D 空间进行思考的框架。通过利用从图像或视频中恢复点云和相机姿势的 3D 重建模型,Think3D 允许代理通过基于相机的操作和自我/全局视图切换来主动操纵空间,将空间推理转变为交互式 3D 思维链过程。在无需额外训练的情况下,Think3D 显着提高了 GPT-4.1 和 Gemini 2.5 Pro 等高级模型的空间推理性能,在 BLINK Multi-view 和 MindCube 上平均增益 +7.8%,在 VSI-Bench 上平均增益 +4.7%。我们进一步表明,难以进行空间探索的较小模型可以从强化学习策略中受益匪浅,该策略使模型能够选择信息丰富的观点和操作。借助 RL,工具使用带来的收益从 +0.7% 增加到 +6.8%。我们的研究结果表明,免训练、工具增强的空间探索是多模态智能体中实现更灵活和类人 3D 推理的可行途径,从而建立了多模态智能的新维度。代码和权重发布于https://github.com/zhangzaibin/spagent。
毒苹果效应:通过人工智能代理的技术扩展对中介市场进行战略操纵
- 标题: The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents
- 作者: Eilam Shapira, Roi Reichart, Moshe Tennenholtz
- 日期: 2026-01-16
- ArXiv主页 : https://arxiv.org/abs/2601.11496
- 论文链接 : https://arxiv.org/pdf/2601.11496
英文摘要
The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the "Poisoned Apple" effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator's choice of market design in their favor. This strategic release improves the releaser's welfare at the expense of their opponent and the regulator's fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.
中文摘要
人工智能主体融入经济市场从根本上改变了战略互动的格局。我们研究了在三种典型博弈论环境中扩展可用技术集的经济影响:讨价还价(资源分配)、谈判(不对称信息贸易)和说服(战略信息传输)。我们发现,仅仅增加人工智能代表的选择就可以极大地改变均衡收益和监管结果,通常会激励监管机构主动开发和发布技术。相反,我们发现了一种被称为"毒苹果"效应的战略现象:代理人可能会发布一项新技术,但他们和他们的对手最终都不会使用该技术,只是为了操纵监管机构对市场设计的选择,使其有利于他们。这种战略性发布提高了发布者的福利,但牺牲了对手和监管者的公平目标。我们的研究结果表明,静态监管框架很容易受到技术扩展的操纵,因此需要动态的市场设计来适应不断变化的人工智能能力。
学习在测试时发现
- 标题: Learning to Discover at Test Time
- 作者: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
- 日期: 2026-01-22
- ArXiv主页 : https://arxiv.org/abs/2601.16175
- 论文链接 : https://arxiv.org/pdf/2601.16175
- 项目链接 : https://test-time-training.github.io/discover/
- gitHub仓库 : https://github.com/test-time-training/discover
英文摘要
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2times faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
中文摘要
我们如何利用人工智能来发现科学问题的新进展?先前在测试时间扩展方面的工作(例如 AlphaEvolve)通过提示冻结的 LLM 来执行搜索。我们在测试时执行强化学习,因此法学硕士可以继续训练,但现在拥有针对测试问题的经验。这种形式的持续学习非常特殊,因为它的目标是产生一个伟大的解决方案,而不是平均而言许多好的解决方案,并且解决这个问题而不是推广到其他问题。因此,我们的学习目标和搜索子程序旨在优先考虑最有希望的解决方案。我们将此方法称为"测试时训练发现"(TTT-Discover)。继之前的工作之后,我们专注于持续奖励的问题。我们报告我们尝试的每个问题的结果,涉及数学、GPU 内核工程、算法设计和生物学。TTT-Discover 几乎在所有这些问题中都设定了最新的技术水平:(i) Erdős 最小重叠问题和自相关不等式;(ii) GPUMode 内核竞赛(比现有技术快 2 倍);(iii) 过去的 AtCoder 算法竞赛;(iv)单细胞分析中的去噪问题。我们的解决方案由专家或组织者审查。我们的所有结果都是通过开放模型 OpenAI gpt-oss-120b 实现的,并且可以使用我们公开的代码进行重现,这与之前需要封闭边界模型的最佳结果形成鲜明对比。我们的测试时训练运行是使用 Tinker(Thinking Machines 的 API)执行的,每个问题的成本仅为几百美元。
重新思考现实世界的视频生成模型
- 标题: Rethinking Video Generation Model for the Embodied World
- 作者: Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou
- 日期: 2026-01-21
- ArXiv主页 : https://arxiv.org/abs/2601.15282
- 论文链接 : https://arxiv.org/pdf/2601.15282
- 项目链接 : https://dagroup-pku.github.io/ReVidgen.github.io/
- gitHub仓库 : https://github.com/DAGroup-PKU/ReVidgen
英文摘要
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.
中文摘要
视频生成模型具有显着先进的体现智能,解锁了生成各种机器人数据的新可能性,这些数据可以捕获物理世界中的感知、推理和动作。然而,合成准确反映现实世界机器人交互的高质量视频仍然具有挑战性,并且缺乏标准化基准限制了公平比较和进步。为了解决这一差距,我们引入了一个全面的机器人基准测试 RBench,旨在评估跨五个任务域和四个不同实施例的面向机器人的视频生成。它通过可重复的子指标评估任务级别的正确性和视觉保真度,包括结构一致性、物理合理性和动作完整性。对 25 个代表性模型的评估凸显了在生成物理真实机器人行为方面的重大缺陷。此外,该基准与人类评估的 Spearman 相关系数达到 0.96,验证了其有效性。虽然 RBench 提供了识别这些缺陷的必要视角,但实现物理真实感需要超越评估,以解决高质量训练数据的严重短缺问题。在这些见解的驱动下,我们引入了一个完善的四阶段数据管道,从而产生了 RoVid-X,这是最大的视频生成开源机器人数据集,包含 400 万个带注释的视频剪辑,涵盖数千个任务,并丰富了全面的物理属性注释。总的来说,这个评估和数据的协同生态系统为视频模型的严格评估和可扩展训练奠定了坚实的基础,加速了嵌入式人工智能向通用智能的发展。
SAMTok:用两个单词表示任何掩码
- 标题: SAMTok: Representing Any Mask with Two Words
- 作者: Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
- 日期: 2026-01-22
- ArXiv主页 : https://arxiv.org/abs/2601.16093
- 论文链接 : https://arxiv.org/pdf/2601.16093
- 项目链接 : https://zhouyiks.github.io/projects/SAMTok/
- gitHub仓库 : https://github.com/bytedance/Sa2VA
英文摘要
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
中文摘要
像素级功能对于构建交互式智能系统至关重要。然而,由于复杂的区域级编码器、专门的分割解码器和不兼容的训练目标,像素级多模态 LLM (MLLM) 仍然难以扩展。为了解决这些挑战,我们提出了 SAMTok,这是一种离散掩码标记器,可将任何区域掩码转换为两个特殊标记,并使用这些标记以高保真度重建掩码。通过将掩码视为新的语言标记,SAMTok 使基础 MLLM(例如 QwenVL 系列)能够通过标准的下一个标记预测和简单的强化学习来学习像素级功能,而无需架构修改和专门的损失设计。SAMTok 基于 SAM2 构建,并使用掩码编码器和残差矢量量化器在 209M 个不同掩码上进行训练,以生成离散、紧凑且信息丰富的标记。借助 5M SAMTok 格式的掩模理解和生成数据样本,QwenVL-SAMTok 在区域字幕、区域 VQA、基础对话、引用分割、场景图解析和多轮交互式分割方面获得了最先进的或可比的结果。我们进一步引入了文本答案匹配奖励,可以实现掩码生成的高效强化学习,从而对 GRES 和 GCG 基准进行重大改进。我们的结果展示了一种可扩展且简单的范例,可以为 MLLM 配备强大的像素级功能。我们的代码和型号可供使用。
解锁内隐体验:从文本合成工具使用轨迹
- 标题: Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text
- 作者: Zhihao Xu, Rumei Li, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xunliang Cai, Xiting Wang
- 日期: 2026-01-15
- ArXiv主页 : https://arxiv.org/abs/2601.10355
- 论文链接 : https://arxiv.org/pdf/2601.10355
英文摘要
Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.
中文摘要
使大型语言模型 (LLM) 能够有效地利用多轮交互中的工具对于构建有能力的自主代理至关重要。然而,获取多样化且真实的多回转刀具使用数据仍然是一个重大挑战。在这项工作中,我们提出了一种新颖的基于文本的范例。我们观察到,文本语料库自然包含丰富的、多步骤的问题解决经验,可以作为多轮工具使用任务的未开发的、可扩展的、真实的数据源。基于这一见解,我们引入了 GEM,这是一种数据合成管道,可以通过四个阶段的过程从文本语料库中生成和提取多轮工具使用轨迹:相关性过滤、工作流和工具提取、轨迹基础和复杂性细化。为了降低计算成本,我们通过监督微调进一步训练专门的轨迹合成器。该模型将复杂的生成管道提炼为高效的端到端轨迹生成器。实验表明,我们的 GEM-32B 比 BFCL V3 多圈基准提高了 16.5%。我们的模型部分超过了在 τ - bench(航空和零售)域内数据上训练的模型的性能,突出了我们基于文本的合成范式所产生的卓越泛化能力。值得注意的是,我们的轨迹合成器与整个管道的质量相匹配,同时显着降低了推理延迟和成本。
多重思维:通过标记式分支合并进行推理
- 标题: Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
- 作者: Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
- 日期: 2026-01-13
- ArXiv主页 : https://arxiv.org/abs/2601.08808
- 论文链接 : https://arxiv.org/pdf/2601.08808
- 项目链接 : https://gmlr-penn.github.io/Multiplex-Thinking/
- gitHub仓库 : https://github.com/GMLR-Penn/Multiplex-Thinking
英文摘要
Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.
中文摘要
大型语言模型通常通过思想链 (CoT) 更有效地解决复杂的推理任务,但代价是长、低带宽的标记序列。相比之下,人类通常会通过维持合理的后续步骤的分布来进行温和的推理。受此启发,我们提出了 Multiplex Thinking,这是一种随机软推理机制,在每个思考步骤中,对 K 个候选 token 进行采样,并将它们的嵌入聚合成单个连续的 Multiplex token。这保留了词汇嵌入先验和标准离散生成的采样动态,同时在多重部署上引入易于处理的概率分布。因此,可以通过策略强化学习(RL)直接优化多重轨迹。重要的是,多重思维是自适应的:当模型有信心时,多重令牌几乎是离散的,并且表现得像标准 CoT;当不确定时,它紧凑地表示多个看似合理的下一步,而不增加序列长度。在具有挑战性的数学推理基准中,多重思维始终优于从 Pass@1 到 Pass@1024 的强大离散 CoT 和 RL 基线,同时生成更短的序列。代码和检查点可在 https://github.com/GMLR-Penn/Multiplex-Thinking 获取。
GutenOCR:一个基于视觉语言的文档前端
- 标题: GutenOCR: A Grounded Vision-Language Front-End for Documents
- 作者: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
- 日期: 2026-01-20
- ArXiv主页 : https://arxiv.org/abs/2601.14490
- 论文链接 : https://arxiv.org/pdf/2601.14490
- 项目链接 : https://ocr.roots.ai/
- gitHub仓库 : https://github.com/Roots-Automation/GutenOCR
英文摘要
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
中文摘要
GutenOCR是通过微调Qwen2.5-VL-3B和Qwen2.5-VL-7B获得的一系列接地OCR前端。由此产生的单检查点视觉语言模型通过一个统一的、基于提示的界面公开阅读、检测和基础。这些模型经过商业文档、科学文章和综合基础数据的训练,支持使用行级和段落级边界框以及条件"x 在哪里?"查询的全页和本地化阅读。我们引入了接地 OCR 评估协议,并表明 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将其 Qwen2.5-VL-7B 主干的综合接地 OCR 分数提高了一倍以上(0.40 至 0.82)。在 Fox 和 OmniDocBench v1.5 上,我们的方法极大地改进了区域级和行级 OCR 以及文本检测召回,但揭示了页面级线性化、颜色引导 OCR 和公式密集型布局方面的权衡。
FutureOmni:从多模式法学硕士的全模式环境中评估未来预测
- 标题: FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs
- 作者: Qian Chen, Jinlan Fu, Changsong Li, See-Kiong Ng, Xipeng Qiu
- 日期: 2026-01-20
- ArXiv主页 : https://arxiv.org/abs/2601.13836
- 论文链接 : https://arxiv.org/pdf/2601.13836
- 项目链接 : https://openmoss.github.io/FutureOmni
- gitHub仓库 : https://github.com/OpenMOSS/FutureOmni
英文摘要
Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).
中文摘要
尽管多模态大语言模型(MLLM)表现出强大的全模态感知,但它们根据视听线索预测未来事件的能力在很大程度上仍未得到探索,因为现有基准主要侧重于回顾性理解。为了弥补这一差距,我们引入了 FutureOmni,这是第一个旨在评估视听环境中的全模式未来预测的基准。评估的模型需要执行跨模式因果和时间推理,并有效利用内部知识来预测未来事件。FutureOmni 通过可扩展的法学硕士辅助、人机交互管道构建,包含 8 个主要领域的 919 个视频和 1,034 个多项选择 QA 对。对 13 个全模态和 7 个纯视频模型的评估表明,当前系统在视听未来预测方面存在困难,特别是在语音较多的场景中,Gemini 3 Flash 的最佳准确率达到 64.8%。为了缓解这一限制,我们整理了一个 7K 样本指令调整数据集,并提出了全模态未来预测 (OFF) 训练策略。对 FutureOmni 以及流行的视听和纯视频基准的评估表明,OFF 增强了未来的预测和泛化能力。我们公开发布所有代码 (https://github.com/OpenMOSS/FutureOmni) 和数据集 (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni)。
Terminal-Bench:在命令行界面中针对艰巨、现实的任务对代理进行基准测试
-
标题: Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
-
作者: Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap, Jan-Lucas Uslu, Jeffrey Li, Jianbo Wu, Minghao Yan, Song Bian, Vedang Sharma, Ke Sun, Steven Dillmann, Akshay Anand, Andrew Lanpouthakoun, Bardia Koopah, Changran Hu, Etash Guha, Gabriel H. S. Dreiman, Jiacheng Zhu, Karl Krauth, Li Zhong, Niklas Muennighoff, Robert Amanfu, Shangyin Tan, Shreyas Pimpalgaonkar, Tushar Aggarwal, Xiangning Lin, Xin Lan, Xuandong Zhao, Yiqing Liang, Yuanli Wang, Zilong Wang, Changzhi Zhou, David Heineman, Hange Liu, Harsh Trivedi, John Yang, Junhong Lin, Manish Shetty, Michael Yang, Nabil Omi, Negin Raoof, Shanda Li, Terry Yue Zhuo, Wuwei Lin, Yiwei Dai, Yuxin Wang, Wenhao Chai, Shang Zhou, Dariush Wahdany, Ziyu She, Jiaming Hu, Zhikang Dong, Yuxuan Zhu, Sasha Cui, Ahson Saiyed, Arinbjörn Kolbeinsson, Jesse Hu, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Alex Dimakis, Andy Konwinski, Ludwig Schmidt
-
日期: 2026-01-17
-
ArXiv主页 : https://arxiv.org/abs/2601.11868
-
gitHub仓库 : https://github.com/laude-institute/terminal-bench
英文摘要
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
中文摘要
人工智能代理可能很快就能在不同领域自主完成有价值的长期任务。当前的基准要么不衡量现实世界的任务,要么不够困难,无法有意义地衡量前沿模型。为此,我们推出了 Terminal-Bench 2.0:一个精心策划的硬基准测试,由计算机终端环境中的 89 个任务组成,其灵感来自真实工作流程的问题。每个任务都具有独特的环境、人工编写的解决方案和全面的验证测试。我们表明,前沿模型和智能体在基准测试中的得分低于 65%,并进行错误分析以确定模型和智能体需要改进的领域。我们在 https://www.tbench.ai/ 上发布了数据集和评估工具,以帮助开发人员和研究人员进行未来的工作。
AgencyBench:在 100 万代币现实世界环境中对自治代理的前沿进行基准测试
- 标题: AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
- 作者: Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu
- 日期: 2026-01-16
- ArXiv主页 : https://arxiv.org/abs/2601.11044
- 论文链接 : https://arxiv.org/pdf/2601.11044
- 项目链接 : https://agencybench.opensii.ai
- gitHub仓库 : https://github.com/GAIR-NLP/AgencyBench
英文摘要
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.
中文摘要
基于大型语言模型 (LLM) 的自主代理展示了多方面的能力,可以为经济生产做出重大贡献。然而,现有的基准仍然侧重于单一代理能力,无法捕捉长期的现实场景。此外,对实际任务的人机反馈的依赖会造成可扩展性瓶颈,阻碍自动部署收集和评估。为了弥补这一差距,我们引入了 AgencyBench,这是一个源自日常 AI 使用情况的综合基准,评估 32 个现实场景中的 6 项核心代理功能,其中包括 138 项具有特定查询、可交付成果和评价标准的任务。这些场景平均需要 90 次工具调用、100 万个代币和数小时的执行时间才能解决。为了实现自动化评估,我们采用用户模拟代理来提供迭代反馈,并使用 Docker 沙箱来进行基于视觉和功能的评估。实验表明,闭源模型的性能明显优于开源模型(48.4% vs 32.1%)。进一步的分析揭示了不同模型在资源效率、反馈驱动的自我纠正和特定工具使用偏好方面的显着差异。最后,我们研究了代理支架的影响,观察到专有模型在其本机生态系统中表现出卓越的性能(例如,通过 Claude-Agent-SDK 的 Claude-4.5-Opus),而开源模型则表现出明显的性能峰值,这表明针对特定执行框架的潜在优化。AgencyBench 是下一代代理的关键测试平台,强调了模型架构与代理框架共同优化的必要性。我们相信这项工作揭示了自主代理的未来方向,我们在 https://github.com/GAIR-NLP/AgencyBench 发布了完整的基准测试和评估工具包。