中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 少即是多:微型网络的递归推理
- 代理通过早期经验学习
- 代理上下文工程:不断发展的上下文以实现自我改进的语言模型
- Apriel-1.5-15b-思想者
- Paper2Video:从科学论文自动生成视频
- MM-HELIX:通过整体平台和自适应混合策略优化促进多模态长链反射推理
- 用于有效规划和工具使用的流程中代理系统优化
- 缓存到缓存:大型语言模型之间的直接语义通信
- [Fathom-DeepResearch:解锁 SLM 的长视野信息检索和合成](#Fathom-DeepResearch:解锁 SLM 的长视野信息检索和合成)
- 低概率代币通过可验证的奖励维持强化学习的探索
- DreamOmni2:基于多模式指令的编辑和生成
- Ming-UniVision:使用统一连续分词器进行联合图像理解和生成
- MemMamba:重新思考状态空间模型中的内存模式
- UniVideo:视频的统一理解、生成和编辑
- VideoCanvas:通过上下文条件从任意时空补丁统一视频完成
- [TaTToo:基于工具的思维 PRM,用于表格推理中的测试时间缩放](#TaTToo:基于工具的思维 PRM,用于表格推理中的测试时间缩放)
- 大型推理模型从有缺陷的思维中学习更好的一致性
- 元意识增强推理模型:自对准强化学习
- [Fast-dLLM v2:高效的块扩散法学硕士](#Fast-dLLM v2:高效的块扩散法学硕士)
- Lumina-DiMOO:用于多模态生成和理解的全方位扩散大语言模型
- [当想法遇到事实:长上下文 LM 的可重用推理](#当想法遇到事实:长上下文 LM 的可重用推理)
- [Video-LMM 后训练:深入研究大型多模态模型的视频推理](#Video-LMM 后训练:深入研究大型多模态模型的视频推理)
- 从什么到为什么:基于证据的化学反应条件推理的多智能体系统
- 免训练组相关策略优化
- MITS:通过逐点互信息增强法学硕士的树搜索推理
- [CoDA:通过扩散适应对 LM 进行编码](#CoDA:通过扩散适应对 LM 进行编码)
- 结盟华尔兹:联合培训代理以确保安全合作
- 通过渐进一致性蒸馏的高效多模态大型语言模型
- [RLinf-VLA:VLA+RL 训练的统一高效框架](#RLinf-VLA:VLA+RL 训练的统一高效框架)
- VChain:视频生成推理的视觉思维链
少即是多:微型网络的递归推理
- 标题: Less is More: Recursive Reasoning with Tiny Networks
- 作者: Alexia Jolicoeur-Martineau
- 日期: 2025-10-06
- ArXiv主页 : https://arxiv.org/abs/2510.04871
- 论文链接 : https://arxiv.org/pdf/2510.04871
- 项目链接 : https://alexiajm.github.io/2025/09/29/tiny_recursive_models.html#
- gitHub仓库 : https://github.com/SamsungSAILMontreal/TinyRecursiveModels
英文摘要
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
中文摘要
分层推理模型(HRM)是一种使用两个以不同频率递归的小型神经网络的新颖方法。这种受生物学启发的方法在数独、迷宫和 ARC-AGI 等难题任务上击败了大型语言模型 (LLM),同时使用小模型(27M 参数)对小数据(大约 1000 个示例)进行训练。HRM 对于解决小型网络的难题有着巨大的希望,但它尚未得到很好的理解,并且可能不是最理想的。我们提出了微型递归模型(TRM),这是一种更简单的递归推理方法,它比 HRM 具有更高的泛化能力,同时使用只有 2 层的单个微型网络。仅用 7M 个参数,TRM 在 ARC-AGI-1 上获得 45% 的测试准确率,在 ARC-AGI-2 上获得 8% 的测试准确率,高于大多数参数少于 0.01% 的法学硕士(例如 Deepseek R1、o3-mini、Gemini 2.5 Pro)。
代理通过早期经验学习
- 标题: Agent Learning via Early Experience
- 作者: Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08558
- 论文链接 : https://arxiv.org/pdf/2510.08558
英文摘要
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
中文摘要
语言代理的长期目标是通过自己的经验进行学习和改进,最终在复杂的现实世界任务中超越人类。然而,在许多环境中,通过强化学习从经验数据中训练代理仍然很困难,这些环境要么缺乏可验证的奖励(例如网站),要么需要低效的长期部署(例如多回合工具的使用)。因此,当前大多数智能体都依赖于专家数据的监督微调,这难以扩展且泛化能力较差。这种限制源于专家演示的本质:它们仅捕获狭窄范围的场景,并将代理暴露在有限的环境多样性中。我们通过一种称为早期经验的中间范式来解决这一限制:由代理自己的行为生成的交互数据,其中产生的未来状态充当没有奖励信号的监督。在这个范式中,我们研究了使用此类数据的两种策略:(1)隐式世界建模,它使用收集的状态将政策建立在环境动态中;(2)自我反思,智能体从次优行为中学习,以改进推理和决策。我们评估八个不同的环境和多个模型系列。我们的方法不断提高有效性和域外泛化能力,强调早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了有希望的信号,即早期经验为后续强化学习提供了坚实的基础,将其定位为模仿学习和完全经验驱动的智能体之间的实用桥梁。
代理上下文工程:不断发展的上下文以实现自我改进的语言模型
- 标题: Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
- 作者: Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun
- 日期: 2025-10-06
- ArXiv主页 : https://arxiv.org/abs/2510.04618
- 论文链接 : https://arxiv.org/pdf/2510.04618
英文摘要
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
中文摘要
大型语言模型 (LLM) 应用程序(例如代理和特定领域推理)越来越依赖于上下文适应,即使用指令、策略或证据修改输入,而不是权重更新。先前的方法提高了可用性,但经常遭受简洁性偏差,这会降低简洁摘要的领域洞察力,并且会受到上下文崩溃的影响,其中迭代重写会随着时间的推移而侵蚀细节。在 Dynamic Cheatsheet 引入的自适应记忆的基础上,我们引入了 ACE(代理上下文工程),该框架将上下文视为不断发展的剧本,通过生成、反思和管理的模块化过程来积累、完善和组织策略。ACE 通过结构化的增量更新来防止崩溃,这些更新可以通过长上下文模型保留详细的知识和规模。在代理和特定领域的基准测试中,ACE 优化了离线(例如系统提示)和在线(例如代理内存)环境,始终优于强大的基准:代理提高 10.6%,财务提高 8.6%,同时显着降低适应延迟和部署成本。值得注意的是,ACE 可以在没有标签监督的情况下有效地适应,而是利用自然的执行反馈。在 AppWorld 排行榜上,ACE 在总体平均水平上与排名最高的生产级代理相匹配,并且在更难的测试挑战中超越了它,尽管使用了较小的开源模型。这些结果表明,全面、不断发展的环境能够以较低的开销实现可扩展、高效和自我改进的法学硕士系统。
Apriel-1.5-15b-思想者
- 标题: Apriel-1.5-15b-Thinker
- 作者: Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, Rishabh Maheshwary, Shiva Krishna Reddy Malay, Jash Mehta, Pulkit Pattnaik, Saloni Mittal, Khalil Slimi, Kelechi Ogueji, Akintunde Oladipo, Soham Parikh, Oluwanifemi Bamgbose, Toby Liang, Ahmed Masry, Khyati Mahajan, Sai Rajeswar Mudumba, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Torsten Scholak, Sagar Davasam, Srinivas Sunkara, Nicholas Chapados
- 日期: 2025-10-01
- ArXiv主页 : https://arxiv.org/abs/2510.01141
- 论文链接 : https://arxiv.org/pdf/2510.01141
英文摘要
We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.
中文摘要
我们推出了 Apriel-1.5-15B-Thinker,这是一个 150 亿参数的开放权重多模态推理模型,它通过训练设计而不是纯粹的规模来实现前沿水平的性能。从 Pixtral-12B 开始,我们采用渐进的三阶段方法:(1)深度升级以扩展推理能力,无需从头开始进行预训练;(2)分阶段持续预训练,首先开发基础文本和视觉理解,然后通过有针对性的合成数据生成解决空间结构、组成理解和细粒度感知来增强视觉推理;(3)对具有显式内容的策划指令响应对进行高质量的仅文本监督微调。推理痕迹涵盖数学、编码、科学和工具使用。值得注意的是,我们的模型在没有强化学习或偏好优化的情况下实现了有竞争力的结果,从而隔离了我们以数据为中心的持续预训练方法的贡献。在人工智能分析指数上,Apriel-1.5-15B-Thinker 的得分为 52 分,与 DeepSeek-R1-0528 相当,尽管所需的计算资源要少得多。在十个图像基准测试中,其性能平均与 Gemini-2.5-Flash 和 Claude Sonnet-3.7 相差 5 分,这是在单 GPU 部署限制下运行的模型的一项关键成就。我们的结果表明,深思熟虑的中期训练 2 设计可以在不大规模的情况下缩小巨大的能力差距,使基础设施有限的组织能够进行前沿级多模态推理。我们在 MIT 许可下发布模型检查点、所有训练方案和评估协议,以推进开源研究。
Paper2Video:从科学论文自动生成视频
- 标题: Paper2Video: Automatic Video Generation from Scientific Papers
- 作者: Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
- 日期: 2025-10-06
- ArXiv主页 : https://arxiv.org/abs/2510.05096
- 论文链接 : https://arxiv.org/pdf/2510.05096
- 项目链接 : https://showlab.github.io/Paper2Video/
- gitHub仓库 : https://github.com/showlab/Paper2Video
英文摘要
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.
中文摘要
学术演示视频已成为研究交流的重要媒介,但制作它们仍然是高度劳动密集型的,通常需要数小时的幻灯片设计、录制和编辑 2 至 10 分钟的短视频。与自然视频不同,演示视频生成面临独特的挑战:研究论文的输入、密集的多模态信息(文本、图形、表格)以及协调多个对齐通道(例如幻灯片、字幕、语音和人类讲话者)的需要。为了应对这些挑战,我们推出了 PaperTalker,这是第一个包含 101 篇研究论文的基准测试,并配有作者创建的演示视频、幻灯片和演讲者元数据。我们进一步设计了四个量身定制的评估指标------元相似度、PresentArena、PresentQuiz 和 IP Memory------来衡量视频如何向观众传达论文信息。在此基础上,我们提出了 PaperTalker,第一个用于学术演示视频生成的多代理框架。它通过新颖的有效树搜索视觉选择、光标定位、字幕、语音合成和头部说话渲染将幻灯片生成与有效布局细化相结合,同时并行化幻灯片生成以提高效率。Paper2Video 上的实验表明,通过我们的方法生成的演示视频比现有基线更忠实、信息更丰富,为自动化和即用型学术视频生成迈出了实际的一步。我们的数据集、代理和代码可在 https://github.com/showlab/Paper2Video 获取。
MM-HELIX:通过整体平台和自适应混合策略优化促进多模态长链反射推理
- 标题: MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
- 作者: Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08540
- 论文链接 : https://arxiv.org/pdf/2510.08540
- 项目链接 : https://mm-helix.github.io/
- gitHub仓库 : https://github.com/PhoenixZ810/MM-HELIX
英文摘要
While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
中文摘要
虽然当前的多模态大型语言模型 (MLLM) 已表现出对数学和逻辑等推理任务的熟练程度,但它们的长链反思推理能力(解决复杂现实问题的先决条件)在很大程度上仍未得到充分开发。在这项工作中,我们首先进行广泛的实证调查来评估这种能力。利用精心设计的数据合成引擎,我们构建了 MM-HELIX,这是一个多模式基准,由 42 个具有挑战性的合成任务的 1,260 个样本组成,需要迭代思维和回溯。该基准的实证结果表明,现有的 MLLM 在长链反思推理中表现出显着的性能缺陷。为了解决这个限制,我们生成训练后数据并进一步探索利用这些数据的学习范例。我们首先开发逐步引发响应生成管道来创建 MM-HELIX-100K,这是一个用于指令调整阶段的 100k 高质量反射推理轨迹的大型数据集。鉴于标准强化学习由于奖励信号稀疏和监督微调后的灾难性遗忘而在复杂任务上失败,我们提出了自适应混合策略优化(AHPO),这是一种新颖的训练策略,可以动态地将离线监督和在线优化统一到一个阶段。该策略使模型能够在奖励稀疏时从专家数据中学习,并在熟练后进行独立探索。当应用于 Qwen2.5-VL-7B 基线时,我们的方法在 MM-HELIX 基准上实现了 +18.6% 的精度改进,并在一般数学和逻辑任务上表现出强大的泛化能力,平均性能增益 +5.7%。我们的工作表明,MLLM 中的反思推理可以有效地学习和推广,为开发更强大的 MLLM 铺平道路。
用于有效规划和工具使用的流程中代理系统优化
- 标题: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
- 作者: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
- 日期: 2025-10-07
- ArXiv主页 : https://arxiv.org/abs/2510.05592
- 论文链接 : https://arxiv.org/pdf/2510.05592
- 项目链接 : https://lupantech.github.io/
- gitHub仓库 : https://github.com/lupantech/AgentFlow
英文摘要
Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.
中文摘要
结果驱动的强化学习在大型语言模型(LLM)中具有先进的推理能力,但流行的工具增强方法训练单一的、整体的策略,在完整的上下文中交织思想和工具调用;这对于长远的视野和多样化的工具来说扩展性很差,并且对新场景的泛化能力很弱。代理系统通过将工作分解到专门的模块中,提供了一种有前途的替代方案,但大多数系统仍然无需培训,或者依赖于与多轮交互的实时动态脱钩的离线培训。我们引入了 AgentFlow,这是一个可训练的流中代理框架,它通过不断发展的内存协调四个模块(规划器、执行器、验证器、生成器),并直接在多轮循环内优化其规划器。为了在实际环境中训练策略,我们提出了基于流的组细化策略优化(Flow-GRPO),它通过将多轮优化转换为一系列易于处理的单轮策略更新来解决长期、稀疏奖励信用分配问题。它向每个回合广播一个单一的、可验证的轨迹级结果,使本地规划者的决策与全球成功保持一致,并通过群体标准化优势稳定学习。在十个基准测试中,具有 7B 规模骨干网的 AgentFlow 优于表现最佳的基线,在搜索方面平均准确率提高了 14.9%,在代理方面平均准确率提高了 14.0%,在数学方面提高了 14.5%,在科学任务方面提高了 4.1%,甚至超过了 GPT-4o 等更大的专有模型。进一步的分析证实了流程中优化的好处,显示出改进的规划、增强的工具调用可靠性以及模型大小和推理轮次的积极扩展。
缓存到缓存:大型语言模型之间的直接语义通信
- 标题: Cache-to-Cache: Direct Semantic Communication Between Large Language Models
- 作者: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
- 日期: 2025-10-03
- ArXiv主页 : https://arxiv.org/abs/2510.03215
- 论文链接 : https://arxiv.org/pdf/2510.03215
- 项目链接 : https://fuvty.github.io/C2C_Project_Page/
- gitHub仓库 : https://github.com/thu-nics/C2C
英文摘要
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.
中文摘要
多法学硕士系统利用多种大型语言模型的互补优势,实现单一模型无法实现的性能和效率提升。在现有设计中,LLM 通过文本进行通信,迫使内部表示转换为输出标记序列。这个过程既丢失了丰富的语义信息,又导致了逐个令牌的生成延迟。出于这些限制,我们不禁要问:法学硕士能否进行超越文本的交流?Oracle实验表明,丰富KV-Cache语义可以在不增加缓存大小的情况下提高响应质量,支持KV-Cache作为模型间通信的有效媒介。因此,我们提出了缓存到缓存(C2C),这是 LLM 之间直接语义通信的新范例。C2C 使用神经网络将源模型的 KV 缓存与目标模型的 KV 缓存进行投影和融合,以实现直接语义传输。可学习的门控机制选择从缓存通信中受益的目标层。与文本通信相比,C2C 利用了两种模型的深层、专业语义,同时避免了显式的中间文本生成。实验表明,C2C 的平均准确率比单个模型高 8.5-10.5%。它的性能进一步优于文本通信范例约 3.0-5.0%,同时延迟平均加快 2.0 倍。我们的代码可在 https://github.com/thu-nics/C2C 获取。
Fathom-DeepResearch:解锁 SLM 的长视野信息检索和合成
-
标题: Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
-
作者: Shreyas Singh, Kunal Singh, Pradeep Moturi
-
日期: 2025-09-28
-
ArXiv主页 : https://arxiv.org/abs/2509.24107
-
gitHub仓库 : https://github.com/FractalAIResearchLabs/Fathom-DeepResearch
英文摘要
Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA.
中文摘要
工具集成推理已成为启用代理应用程序的关键焦点。其中,深度研究代理因其在复杂、开放式信息搜索任务上的出色表现而受到广泛关注。我们引入 Fathom-DeepResearch,一个由两个专门模型组成的代理系统。第一个是 Fathom-Search-4B,这是一种从 Qwen3-4B 训练而来的深度搜索模型,并通过实时网络搜索和有针对性的网页查询针对基于证据的调查进行了优化。它的训练结合了三个进步:(i) DUETQA,一个通过多智能体自我对弈生成的 5K 样本数据集,强制执行严格的网络搜索依赖性和异构源基础;(ii) RAPO,GRPO 的零开销扩展,通过课程修剪、奖励感知优势扩展和按提示重放缓冲区来稳定具有可验证奖励的多轮强化学习;(iii) 可操纵的阶梯级奖励,根据认知行为和边际效用对每个工具调用进行分类,从而能够明确控制搜索轨迹的广度、深度和范围。这些改进使得工具调用能够在必要时可靠地扩展到 20 次以上。第二个是 Fathom-Synthesizer-4B,由 Qwen3-4B 训练而成,它将多轮 DeepSearch 轨迹转换为结构化、引用密集的 DeepResearch 报告,以进行全面综合。该系统在 DeepSearch 基准测试(SimpleQA、FRAMES、WebWalker、Seal0、MuSiQue)和 DeepResearch-Bench 上进行评估,在开放权重类别中实现了最先进的性能,同时展示了对 HLE、AIME-25、GPQA-Diamond 和 MedQA 等各种推理任务的强大泛化能力。
低概率代币通过可验证的奖励维持强化学习的探索
-
标题: Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
-
作者: Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou
-
日期: 2025-10-03
-
ArXiv主页 : https://arxiv.org/abs/2510.03222
-
gitHub仓库 : https://github.com/CarlanLark/Lp-Reg-dev
英文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textit{reasoning sparks}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of reasoning sparks is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a 60.17% average accuracy on five math benchmarks, an improvement of 2.66% over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.
中文摘要
可验证奖励的强化学习(RLVR)在复杂推理中推动了大型语言模型的发展,但其可扩展性常常受到训练瓶颈的阻碍,随着策略熵的崩溃,性能会趋于稳定,这表明探索的损失。以前的方法通常通过维持高策略熵来解决这个问题,但控制有意义的探索的精确机制仍未得到充分探索。我们的分析表明,不加选择地关注熵可能会放大不相关的标记并破坏训练的稳定性。本文研究了 RLVR 中的探索动态并确定了一个关键问题:逐渐消除有价值的低概率探索性标记,我们将其称为 \textit{推理火花}。我们发现,虽然预训练模型丰富,但由于过度惩罚,这些火花在 RLVR 过程中被系统性地熄灭,导致探索的退化。为了解决这个问题,我们引入了低概率正则化(Lp-Reg)。其核心机制将策略规范为启发式代理分配。该代理是通过过滤掉假定的噪声标记并重新规范化剩余候选者的分布来构建的。结果是一个噪音较小的代理,其中推理火花的概率被放大,然后作为软正则化目标来保护这些有价值的代币不被 KL 散度消除。实验表明,Lp-Reg 能够实现大约 1,000 个步骤的稳定的策略训练,这是基线熵控制方法崩溃的情况。这种持续的探索带来了最先进的性能,在五个数学基准上实现了 60.17% 的平均准确率,比之前的方法提高了 2.66%。代码可在 https://github.com/CarlanLark/Lp-Reg 获取。
DreamOmni2:基于多模式指令的编辑和生成
- 标题: DreamOmni2: Multimodal Instruction-based Editing and Generation
- 作者: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
- 日期: 2025-10-08
- ArXiv主页 : https://arxiv.org/abs/2510.06679
- 论文链接 : https://arxiv.org/pdf/2510.06679
- 项目链接 : https://pbihao.github.io/projects/DreamOmni2/index.html
- gitHub仓库 : https://github.com/dvlab-research/DreamOmni2
英文摘要
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
中文摘要
基于指令的图像编辑和主题驱动生成的最新进展引起了人们的广泛关注,但这两项任务在满足实际用户需求方面仍然面临限制。基于指令的编辑仅依赖于语言指令,而语言指令通常无法捕获特定的编辑细节,因此需要参考图像。与此同时,主题驱动的生成仅限于组合具体的物体或人,而忽视了更广泛、抽象的概念。为了应对这些挑战,我们提出了两项新任务:基于多模式指令的编辑和生成。这些任务支持文本和图像指令,并将范围扩展到包括具体和抽象概念,极大地增强了它们的实际应用。我们推出 DreamOmni2,解决两个主要挑战:数据创建和模型框架设计。我们的数据合成管道包含三个步骤:(1)使用特征混合方法为抽象和具体概念创建提取数据,(2)使用编辑和提取模型生成基于多模态指令的编辑训练数据,以及(3)进一步应用提取模型为基于多模态指令的编辑创建训练数据。对于该框架,为了处理多图像输入,我们提出了索引编码和位置编码移位方案,这有助于模型区分图像并避免像素混淆。此外,我们引入了与 VLM 和我们的生成/编辑模型的联合训练,以更好地处理复杂的指令。此外,我们还为这两项新任务提出了全面的基准来推动它们的发展。实验表明DreamOmni2取得了令人瞩目的成果。模型和代码将被发布。
Ming-UniVision:使用统一连续分词器进行联合图像理解和生成
- 标题: Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
- 作者: Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou
- 日期: 2025-10-08
- ArXiv主页 : https://arxiv.org/abs/2510.06590
- 论文链接 : https://arxiv.org/pdf/2510.06590
- 项目链接 : https://inclusionai.github.io/blog/mingtok/
- gitHub仓库 : https://github.com/inclusionAI/Ming-UniVision
英文摘要
Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
中文摘要
视觉标记化仍然是在自回归范式中统一视觉理解和生成的核心挑战。现有方法通常在离散潜在空间中使用标记器来与大型语言模型中的标记对齐,其中量化误差可能会限制语义表达并降低视觉语言理解的能力。为了解决这个问题,我们引入了 MingTok,这是一个新的视觉分词器系列,具有连续的潜在空间,用于统一的自回归生成和理解。虽然理解任务有利于区分高维特征,但生成任务更喜欢紧凑的低级代码。因此,为了协调这些相互竞争的需求,MingTok 采用了三阶段顺序架构,涉及低级编码、语义扩展和视觉重建。在此基础上,Ming-UniVision 消除了对特定任务视觉表示的需求,并将不同的视觉语言任务统一在单一自回归预测范式下。通过将理解和生成表述为共享连续空间中的下一个令牌预测,它无缝支持多轮上下文任务,例如迭代理解、生成和编辑。根据经验,我们发现使用统一的连续视觉表示可以通过理解和生成任务来协调分词器的竞争需求,从而在两个领域实现最先进的性能。我们希望我们的发现能够促进连续域中统一的视觉标记化。发布推理代码和模型权重以造福社区。
MemMamba:重新思考状态空间模型中的内存模式
-
标题: MemMamba: Rethinking Memory Patterns in State Space Model
-
作者: Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun
-
日期: 2025-09-28
-
ArXiv主页 : https://arxiv.org/abs/2510.03279
-
gitHub仓库 : https://github.com/XuezheMax/megalodon
英文摘要
With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.
中文摘要
随着数据的爆炸性增长,长序列建模在自然语言处理和生物信息学等任务中变得越来越重要。然而,现有方法面临着效率和内存之间固有的权衡。循环神经网络面临梯度消失和爆炸的问题,使得它们难以扩展。Transformer 可以对全局依赖关系进行建模,但受到二次复杂度的限制。最近,诸如 Mamba 之类的选择性状态空间模型已经证明了 O(n) 时间和 O(1) 循环推理的高效率,但它们的远程记忆呈指数衰减。在这项工作中,我们通过数学推导和信息论分析来系统地揭示曼巴的记忆衰减机制,回答一个基本问题:曼巴的长期记忆的本质是什么以及它如何保留信息?为了量化关键信息丢失,我们进一步引入了水平-垂直内存保真度指标,以捕获层内和层间的退化情况。受到人类在阅读长文档时如何提取和保留显着信息的启发,我们提出了 MemMamba,这是一种新颖的架构框架,它将状态摘要机制与跨层和跨令牌注意力集成在一起,在保持线性复杂性的同时减轻了远程遗忘。MemMamba 在 PG19 和密钥检索等长序列基准测试上比现有 Mamba 变体和 Transformer 实现了显着改进,同时推理效率提高了 48%。理论分析和实证结果都表明MemMamba在复杂性与内存权衡方面取得了突破,为超长序列建模提供了新的范式。
UniVideo:视频的统一理解、生成和编辑
- 标题: UniVideo: Unified Understanding, Generation, and Editing for Videos
- 作者: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08377
- 论文链接 : https://arxiv.org/pdf/2510.08377
- 项目链接 : https://congwei1230.github.io/UniVideo/
- gitHub仓库 : https://github.com/KwaiVGI/UniVideo
英文摘要
Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
中文摘要
统一的多模态模型在多模态内容生成和编辑方面显示出了有希望的结果,但仍然很大程度上局限于图像领域。在这项工作中,我们提出了 UniVideo,这是一个将统一建模扩展到视频领域的多功能框架。UniVideo采用双流设计,将用于指令理解的多模态大语言模型(MLLM)与用于视频生成的多模态DiT(MMDiT)相结合。这种设计能够准确解释复杂的多模式指令,同时保持视觉一致性。在此架构之上,UniVideo 将不同的视频生成和编辑任务统一在单一多模式指令范例下,并在它们之间进行联合训练。大量实验表明,UniVideo 在文本/图像到视频生成、上下文视频生成和上下文视频编辑方面达到或超过了最先进的特定任务基线。值得注意的是,UniVideo 的统一设计实现了两种形式的通用化。首先,UniVideo 通过在单个指令中集成多种功能来支持任务组合,例如将编辑与风格转换相结合。其次,即使没有接受过自由格式视频编辑的明确培训,UniVideo 也可以将其编辑能力从大规模图像编辑数据转移到此设置,处理看不见的指令,例如绿屏字符或更改视频中的材料。除了这些核心功能之外,UniVideo 还支持基于视觉提示的视频生成,其中 MLLM 解释视觉提示并在合成过程中指导 MMDiT。为了促进未来的研究,我们将发布我们的模型和代码。
VideoCanvas:通过上下文条件从任意时空补丁统一视频完成
- 标题: VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
- 作者: Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08555
- 论文链接 : https://arxiv.org/pdf/2510.08555
- 项目链接 : https://onevfall.github.io/project_page/videocanvas/
- gitHub仓库 : https://github.com/KwaiVGI/VideoCanvas
英文摘要
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
中文摘要
我们引入了任意时空视频完成的任务,其中视频是由放置在任何空间位置和时间戳的任意用户指定的补丁生成的,类似于在视频画布上绘画。这种灵活的公式自然地将许多现有的可控视频生成任务(包括第一帧图像到视频、修复、扩展和插值)统一在一个单一的、有凝聚力的范例下。然而,实现这一愿景在现代潜在视频扩散模型中面临着一个根本障碍:因果 VAE 引入的时间模糊性,其中多个像素帧被压缩为单个潜在表示,使得精确的帧级调节在结构上变得困难。我们通过 VideoCanvas 解决了这一挑战,这是一个新颖的框架,它采用上下文调节 (ICC) 范式来适应这种零新参数的细粒度控制任务。我们提出了一种分离空间和时间控制的混合调节策略:空间放置通过零填充处理,而时间对齐通过时间 RoPE 插值实现,该插值为每个条件分配潜在序列内的连续分数位置。这解决了 VAE 的时间模糊性,并在冻结的主干上实现像素帧感知控制。为了评估这一新功能,我们开发了 VideoCanvasBench,这是任意时空视频完成的第一个基准,涵盖场景内保真度和场景间创造力。实验表明,VideoCanvas 显着优于现有的调节范例,在灵活且统一的视频生成方面建立了新的技术水平。
TaTToo:基于工具的思维 PRM,用于表格推理中的测试时间缩放
- 标题: TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
- 作者: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
- 日期: 2025-10-07
- ArXiv主页 : https://arxiv.org/abs/2510.06217
- 论文链接 : https://arxiv.org/pdf/2510.06217
英文摘要
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
中文摘要
过程奖励模型 (PRM) 最近已成为增强大型推理模型 (LRM) 推理能力的强大框架,特别是在测试时间扩展 (TTS) 的背景下。然而,它们在表格推理领域监督 LRM 的潜力仍未得到充分开发。通过详细的实证分析,我们发现现有的 PRM 尽管广泛用于监督纯文本推理步骤,但仍难以处理特定于表的操作,例如子表检索和模式交互,从而导致关键的性能瓶颈。为了解决这个限制,我们提出了 TaTToo,一种新颖的基于表格的 PRM 框架,它(i)通过表格推理步骤进行明确的推理,(ii)集成基于工具的验证以提供精确的奖励监督。具体来说,我们首先设计一个可扩展的数据管理管道,通过将表验证原理与基于工具的执行相集成,构建超过 60k 的高质量步骤级注释。基于收集到的数据,我们使用双阶段范式训练 TaTToo:冷启动监督微调以捕获工具使用推理模式,然后通过基于工具的奖励塑造进行强化学习,以使我们的模型与基于表格的验证保持一致。我们对新设计的 PRM 带来的政策改进进行了全面评估。在涵盖数值推理、事实检查和数据分析的 5 个具有挑战性的表格推理基准中,TaTToo 在推理时将下游策略 LRM 提高了 30.9%,超越了强大的 PRM 基线,例如仅 8B 参数的 Qwen-2.5-Math-PRM-72B,并在不同的 TTS 策略中表现出了强大的通用性。
大型推理模型从有缺陷的思维中学习更好的一致性
- 标题: Large Reasoning Models Learn Better Alignment from Flawed Thinking
- 作者: ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
- 日期: 2025-10-01
- ArXiv主页 : https://arxiv.org/abs/2510.00938
- 论文链接 : https://arxiv.org/pdf/2510.00938
- 项目链接 : https://shengyun-peng.github.io/papers/lrm-safety
英文摘要
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.
中文摘要
大型推理模型(LRM)在产生最终答案之前通过生成结构化思想链(CoT)来"思考",但它们仍然缺乏对安全一致性进行批判性推理的能力,并且当将有缺陷的前提注入其思维过程时很容易产生偏见。我们提出了 RECAP(通过反对齐预填充实现的鲁棒安全对齐),这是一种用于训练后的有原则的强化学习 (RL) 方法,可明确教导模型覆盖有缺陷的推理轨迹并重新路由到安全且有用的响应。RECAP 在综合生成的反向对齐 CoT 预填充和标准提示的混合物上进行训练,除了来自人类反馈的普通强化学习 (RLHF) 之外,不需要额外的训练成本或修改,并且大大提高了安全性和越狱稳健性,减少了过度拒绝,并保留了核心推理能力 - 同时保持了推理令牌预算。广泛的分析表明,经过 RECAP 训练的模型更频繁地进行自我反思,并且在自适应攻击下保持鲁棒性,即使在反复尝试推翻其推理后也能保持安全性。
元意识增强推理模型:自对准强化学习
- 标题: Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning
- 作者: Yoonjeon Kim, Doohyuk Jang, Eunho Yang
- 日期: 2025-09-26
- ArXiv主页 : https://arxiv.org/abs/2510.03259
- 论文链接 : https://arxiv.org/pdf/2510.03259
英文摘要
Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.
中文摘要
最近关于推理模型的研究探索了语言模型的元意识,即知道如何自行思考的能力。我们认为,大型推理模型缺乏这种元意识属性,因为事实证明真实的推出和预测的元信息之间存在严重的不一致。我们认为,将元预测与真实的部署相结合将带来显着的性能提升。为了验证这一假设,我们设计了一个通过自对准(MASA)增强元意识的训练管道,并证明增强的元意识可以直接转化为提高的准确性。与现有的元认知推理模型不同,我们的方法不需要外部训练源,而是利用自身生成的信号来训练元意识。此外,我们的方法通过 i) 过滤掉微不足道或无法解决的零方差提示,以及 ii) 在不太可能得出正确答案时切断冗长的推出,从而实现有效的训练。结果令人鼓舞:我们的策略在域内任务的准确性和训练效率方面都取得了显着的提高,并显示出对域外基准的强大泛化能力。更具体地说,我们的方法可以将 GRPO 训练速度提高 1.28 倍以上,以达到相同的性能,并在 AIME25 上实现 19.3% 的准确度提升,比六个数学基准平均提升 6.2%。使用元认知指导进行训练可增强域外泛化能力,在跨越逻辑、科学和编码领域的 13 个基准测试中,GPQA-Diamond 提高了 3.87%,总体准确率提高了 2.08%。
Fast-dLLM v2:高效的块扩散法学硕士
- 标题: Fast-dLLM v2: Efficient Block-Diffusion LLM
- 作者: Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
- 日期: 2025-09-30
- ArXiv主页 : https://arxiv.org/abs/2509.26328
- 论文链接 : https://arxiv.org/pdf/2509.26328
- 项目链接 : https://nvlabs.github.io/Fast-dLLM/v2/
- gitHub仓库 : https://github.com/NVlabs/Fast-dLLM
英文摘要
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.
中文摘要
自回归 (AR) 大语言模型 (LLM) 在各种自然语言任务中取得了卓越的性能,但其固有的顺序解码限制了推理效率。在这项工作中,我们提出了 Fast-dLLM v2,这是一种精心设计的块扩散语言模型 (dLLM),它可以有效地将预训练的 AR 模型适应 dLLM 以进行并行文本生成,仅需要大约 1B 个令牌进行微调。与 Dream(580B 代币)等全注意力扩散 LLM 相比,这意味着训练数据减少了 500 倍,同时保留了原始模型的性能。我们的方法引入了一种新颖的训练方法,它将块扩散机制与互补的注意力掩模相结合,从而在不牺牲 AR 训练目标的情况下实现块式双向上下文建模。为了进一步加速解码,我们设计了一种分层缓存机制:存储跨块的历史上下文表示的块级缓存,以及在部分解码的块内实现高效并行生成的子块缓存。与我们的并行解码管道相结合,Fast-dLLM v2 的速度比标准 AR 解码高达 2.5 倍,而不会影响生成质量。跨不同基准的大量实验表明,Fast-dLLM v2 在准确性方面匹配或超过 AR 基线,同时提供 dLLM 中最先进的效率 - 标志着向快速、准确的 LLM 的实际部署迈出了重要一步。代码和模型将公开发布。
Lumina-DiMOO:用于多模态生成和理解的全方位扩散大语言模型
- 标题: Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
- 作者: Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu
- 日期: 2025-10-07
- ArXiv主页 : https://arxiv.org/abs/2510.06308
- 论文链接 : https://arxiv.org/pdf/2510.06308
- 项目链接 : https://synbol.github.io/Lumina-DiMOO/
- gitHub仓库 : https://github.com/Alpha-VLLM/Lumina-DiMOO
英文摘要
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
中文摘要
我们介绍 Lumina-DiMOO,这是一种用于无缝多模态生成和理解的开源基础模型。Lumina-DiMOO 与之前的统一模型不同,它利用完全离散的扩散模型来处理各种模式的输入和输出。与之前的自回归(AR)或混合 AR-扩散范例相比,这种创新方法使 Lumina-DiMOO 能够实现更高的采样效率,并熟练地支持广泛的多模态任务,包括文本到图像生成、图像到图像生成(例如图像编辑、主题驱动生成和图像修复等)以及图像理解。Lumina-DiMOO 在多个基准测试中实现了最先进的性能,超越了现有的开源统一多模态模型。为了促进多模式和离散扩散模型研究的进一步进步,我们向社区发布了我们的代码和检查点。项目页面:https://synbol.github.io/Lumina-DiMOO。
当想法遇到事实:长上下文 LM 的可重用推理
- 标题: When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
- 作者: Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang
- 日期: 2025-10-08
- ArXiv主页 : https://arxiv.org/abs/2510.07499
- 论文链接 : https://arxiv.org/pdf/2510.07499
英文摘要
Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).
中文摘要
最近的长上下文语言模型(LCLM)可以在单个提示中处理数十万个标记,通过集成大量检索到的文档或在某些情况下直接集成所有必要信息,为知识密集型多跳推理提供新的机会。然而,仅仅将更多文档输入上下文窗口无法捕获证据应如何连接。我们通过思维模板来解决这一差距,思维模板将推理重新构建为可重用的思维缓存,源自先前的问题解决轨迹,构建证据的组合方式并指导事实文档的多跳推理。为了保持这些模板的有效性,我们提出了一种更新策略,通过自然语言反馈迭代地细化从训练数据派生的模板。在不同的基准和 LCLM 系列中,我们的方法在基于检索和无检索的设置中都比强大的基线提供了一致的收益。此外,我们表明优化的模板可以被提炼成更小的开源模型,展示其广泛的适用性和透明的推理重用。我们将我们的框架称为思想模板增强 LCLM (ToTAL)。
Video-LMM 后训练:深入研究大型多模态模型的视频推理
-
标题: Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
-
作者: Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu
-
日期: 2025-10-06
-
ArXiv主页 : https://arxiv.org/abs/2510.05034
-
gitHub仓库 : https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
英文摘要
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
中文摘要
视频理解代表了计算机视觉中最具挑战性的前沿,需要模型来推理复杂的时空关系、长期依赖性和多模态证据。最近出现的视频大多模态模型(Video-LMM)将视觉编码器与强大的基于解码器的语言模型集成在一起,在视频理解任务中表现出了卓越的能力。然而,将这些模型从基本感知系统转变为复杂的推理引擎(训练后)的关键阶段在文献中仍然支离破碎。这项调查首次全面检查了视频 LMM 的训练后方法,包括三个基本支柱:具有思想链的监督微调 (SFT)、来自可验证目标的强化学习 (RL) 以及通过增强推理计算的测试时间缩放 (TTS)。我们提出了一种结构化的分类法,阐明了这些技术的作用、互连性和视频特定的适应性,解决了时间定位、时空基础、长视频效率和多模式证据集成等独特的挑战。通过对代表性方法的系统分析,我们综合了关键的设计原则、见解和评估协议,同时确定了奖励设计、可扩展性和成本性能优化方面的关键开放挑战。我们进一步策划重要的基准、数据集和指标,以促进对培训后有效性的严格评估。这项调查旨在为研究人员和从业者提供一个统一的框架来推进 Video-LMM 功能。其他资源和更新维护于:https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
从什么到为什么:基于证据的化学反应条件推理的多智能体系统
- 标题: From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
- 作者: Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin
- 日期: 2025-09-28
- ArXiv主页 : https://arxiv.org/abs/2509.23768
- 论文链接 : https://arxiv.org/pdf/2509.23768
英文摘要
The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.
中文摘要
化学反应建议是为化学反应选择合适的反应条件参数,这对于加速化学科学发展至关重要。随着大型语言模型(LLM)的快速发展,人们越来越有兴趣利用其推理和规划能力来推荐反应条件。尽管取得了成功,但现有方法很少解释推荐反应条件背后的基本原理,限制了它们在高风险科学工作流程中的实用性。在这项工作中,我们提出了 ChemMAS,这是一种多智能体系统,它将条件预测重新构建为基于证据的推理任务。ChemMAS 将任务分解为机械基础、多渠道回忆、约束感知代理辩论和基本原理聚合。每个决定都有基于化学知识和可检索先例的可解释理由的支持。实验表明,ChemMAS 比特定领域的基线提高了 20-35%,在 Top-1 准确率方面比通用 LLM 提高了 10-15%,同时提供了可证伪的、人类可信的基本原理,这为科学发现中的可解释人工智能建立了新的范式。
免训练组相关策略优化
- 标题: Training-Free Group Relative Policy Optimization
- 作者: Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08191
- 论文链接 : https://arxiv.org/pdf/2510.08191
英文摘要
Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.
中文摘要
大型语言模型(LLM)代理的最新进展已经证明了它们有前途的通用能力。然而,由于有效集成外部工具和特定提示策略方面的挑战,它们在专门的现实世界领域中的表现常常会下降。虽然已经提出了代理强化学习等方法来解决这个问题,但它们通常依赖于昂贵的参数更新,例如,通过使用监督微调 (SFT) 的过程,然后使用组相对策略优化 (GRPO) 的强化学习 (RL) 阶段来改变输出分布。然而,我们认为法学硕士可以通过学习经验知识作为令牌先验来实现对输出分布的类似效果,这是一种更轻量级的方法,不仅可以解决实际数据稀缺问题,还可以避免过度拟合的常见问题。为此,我们提出了免训练组相对策略优化(免训练GRPO),这是一种经济高效的解决方案,无需任何参数更新即可增强LLM代理性能。我们的方法利用组相对语义优势,而不是每组推出中的数值优势,在最小真实数据的多时代学习过程中迭代地提炼出高质量的经验知识。这些知识作为学习到的 token 先验,在 LLM API 调用期间无缝集成以指导模型行为。数学推理和网络搜索任务的实验表明,将 Training-Free GRPO 应用于 DeepSeek-V3.1-Terminus 时,可以显着提高域外性能。只需几十个训练样本,Training-Free GRPO 的性能就优于具有边际训练数据和成本的微调小型 LLM。
MITS:通过逐点互信息增强法学硕士的树搜索推理
- 标题: MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information
- 作者: Jiaxi Li, Yucheng Shi, Jin Lu, Ninghao Liu
- 日期: 2025-10-04
- ArXiv主页 : https://arxiv.org/abs/2510.03632
- 论文链接 : https://arxiv.org/pdf/2510.03632
英文摘要
Tree search has become as a representative framework for test-time reasoning with large language models (LLMs), exemplified by methods such as Tree-of-Thought and Monte Carlo Tree Search that explore multiple reasoning paths. However, it remains difficult to provide instant and reliable quantitative assessments of intermediate reasoning step quality, and extensive path exploration is computationally costly. To address this, we propose Mutual Information Tree Search (MITS), a novel framework that guides reasoning with information-theoretic principles. MITS introduces an effective scoring function based on pointwise mutual information (PMI), which enables step-wise evaluation of reasoning paths and search tree expansion via beam search without expensive look-ahead simulations, achieving superior reasoning performances while maintaining computational efficiency. The framework is complemented by an entropy-based dynamic sampling strategy that adaptively allocates computational resources to uncertain reasoning steps where exploration is most beneficial. For final prediction, MITS employs a weighted voting scheme that combines PMI scores with prediction consensus. Through comprehensive experiments on diverse reasoning benchmarks, MITS consistently surpasses baseline methods, establishing a principled and efficient framework for LLM reasoning.
中文摘要
树搜索已成为大型语言模型 (LLM) 测试时推理的代表性框架,例如探索多种推理路径的思想树和蒙特卡洛树搜索等方法。然而,仍然难以对中间推理步骤质量提供即时且可靠的定量评估,并且广泛的路径探索在计算上成本高昂。为了解决这个问题,我们提出了互信息树搜索(MITS),这是一种利用信息论原理指导推理的新颖框架。MITS 引入了基于点互信息 (PMI) 的有效评分函数,可通过波束搜索逐步评估推理路径和搜索树扩展,无需昂贵的前瞻模拟,在保持计算效率的同时实现卓越的推理性能。该框架得到了基于熵的动态采样策略的补充,该策略自适应地将计算资源分配给探索最有益的不确定推理步骤。对于最终预测,MITS 采用了将 PMI 分数与预测共识相结合的加权投票方案。通过对各种推理基准的综合实验,MITS 不断超越基线方法,为 LLM 推理建立了原则性且高效的框架。
CoDA:通过扩散适应对 LM 进行编码
- 标题: CoDA: Coding LM via Diffusion Adaptation
- 作者: Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
- 日期: 2025-09-27
- ArXiv主页 : https://arxiv.org/abs/2510.03270
- 论文链接 : https://arxiv.org/pdf/2510.03270
- 项目链接 : https://huggingface.co/Salesforce/CoDA-v0-Instruct
英文摘要
Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.
中文摘要
扩散语言模型承诺双向上下文和自回归编码器所缺乏的填充功能,但实用系统仍然是重量级的。我们引入了 CoDA,这是一种在 TPU 上训练的 1.7B 参数扩散编码器,具有完全开源的训练管道。CoDA 将大规模扩散预训练与以代码为中心的中期训练和指令调整相结合,实现置信引导采样,从而保持推理延迟的竞争力。在 Humaneval、MBPP 和 EvalPlus 上,CoDA-1.7B-Instruct 在多达 7B 个参数上匹配或超越扩散模型。我们的版本包括模型检查点、评估工具和 TPU 训练管道,以加速基于轻量级扩散的编码助手的研究。
结盟华尔兹:联合培训代理以确保安全合作
- 标题: The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
- 作者: Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08240
- 论文链接 : https://arxiv.org/pdf/2510.08240
英文摘要
Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
中文摘要
利用法学硕士的力量需要在有益和无害之间进行微妙的平衡。这在两个相互竞争的挑战之间造成了根本的紧张关系:容易受到引发不安全内容的对抗性攻击,以及过度拒绝良性但敏感提示的倾向。当前的方法通常使用安全模型来引导这种舞蹈,完全拒绝任何包含不安全部分的内容。这种方法完全切断了音乐------它可能会加剧过度拒绝,并且无法为其拒绝的查询提供细致入微的指导。为了教授模型更加协调的编排,我们提出了 WaltzRL,这是一种新颖的多智能体强化学习框架,它将安全协调制定为协作的正和游戏。WaltzRL 联合训练对话代理和反馈代理,后者被激励提供有用的建议,以提高对话代理响应的安全性和有用性。WaltzRL 的核心是动态改进奖励 (DIR),它根据对话代理整合反馈的程度随时间而变化。在推理时,来自对话代理的不安全或过度拒绝的响应会得到改进,而不是被丢弃。反馈代理与对话代理一起部署,仅在需要时自适应地参与,从而保留安全查询的帮助性和低延迟。我们在五个不同数据集上进行的实验表明,与各种基线相比,WaltzRL 显着减少了不安全反应(例如,在 WildJailbreak 上从 39.0% 减少到 4.6%)和过度拒绝(在 OR-Bench 上从 45.3% 减少到 9.9%)。通过使对话和反馈代理能够共同进化并自适应地应用反馈,WaltzRL 在不降低一般能力的情况下增强了 LLM 的安全性,从而在有帮助和无害之间推进了帕累托前沿。
通过渐进一致性蒸馏的高效多模态大型语言模型
- 标题: Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
- 作者: Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang
- 日期: 2025-10-01
- ArXiv主页 : https://arxiv.org/abs/2510.00515
- 论文链接 : https://arxiv.org/pdf/2510.00515
- 项目链接 : https://zichenwen1.github.io/EPIC
- gitHub仓库 : https://github.com/ZichenWen1/EPIC
英文摘要
Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.
中文摘要
视觉标记在多模态大型模型 (MLLM) 中消耗大量计算资源,从而显着降低其效率。最近的工作试图通过在训练期间压缩视觉标记来提高效率,或者通过修改模型组件或通过引入额外的参数。然而,他们经常忽视这种压缩导致的学习难度增加,因为模型的参数空间难以快速适应由令牌压缩引起的特征空间中的实质性扰动。在这项工作中,我们建议通过渐进式一致性蒸馏(EPIC)(一种渐进式学习框架)开发高效的 MLLM。具体来说,通过沿标记和分层维度分解标记压缩引入的特征空间扰动,我们分别引入标记一致性蒸馏和层一致性蒸馏,旨在通过利用教师模型的指导并遵循渐进式学习轨迹来降低训练难度。大量的实验证明了我们提出的框架的卓越有效性、鲁棒性和泛化能力。
RLinf-VLA:VLA+RL 训练的统一高效框架
- 标题: RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
- 作者: Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang
- 日期: 2025-10-08
- ArXiv主页 : https://arxiv.org/abs/2510.06710
- 论文链接 : https://arxiv.org/pdf/2510.06710
- 项目链接 : https://rlinf.readthedocs.io/en/latest/
- gitHub仓库 : https://github.com/RLinf/RLinf
英文摘要
Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11% across 130 LIBERO tasks and 97.66% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.
中文摘要
视觉和语言基础模型的最新进展显着推进了多模态理解、推理和生成,激发了人们对通过视觉-语言-动作(VLA)模型将此类功能扩展到具体环境的兴趣激增。然而,大多数 VLA 模型仍然采用监督微调 (SFT) 进行训练,由于误差累积,该模型很难在分布变化下进行泛化。强化学习(RL)通过交互直接优化任务性能,提供了一种有前途的替代方案,但现有的尝试仍然支离破碎,并且缺乏一个统一的平台来对模型架构和算法设计进行公平和系统的比较。为了解决这一差距,我们引入了 RLinf-VLA,这是一个统一且高效的框架,用于 VLA 模型的可扩展 RL 训练。该系统采用高度灵活的资源分配设计,解决了RL+VLA训练中渲染、训练、推理一体化的挑战。特别是,对于GPU并行模拟器,RLinf-VLA实现了一种新颖的混合细粒度管道分配模式,在训练中实现了1.61x-1.88x的加速。通过统一的接口,RLinf-VLA无缝支持各种VLA架构(例如OpenVLA、OpenVLA-OFT)、多种RL算法(例如PPO、GRPO)和各种模拟器(例如ManiSkill、LIBERO)。在仿真中,统一模型在 130 个 LIBERO 任务中实现了 98.11%,在 25 个 ManiSkill 任务中实现了 97.66%。除了实证表现之外,我们的研究还提炼了一套将强化学习应用于 VLA 训练的最佳实践,并揭示了这种集成中的新兴模式。此外,我们在现实世界的 Franka 机器人上进行了初步部署,其中 RL 训练的策略比 SFT 训练的策略表现出更强的泛化能力。我们设想 RLinf-VLA 作为加速和标准化具身智能研究的基础。
VChain:视频生成推理的视觉思维链
- 标题: VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
- 作者: Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu
- 日期: 2025-10-06
- ArXiv主页 : https://arxiv.org/abs/2510.05094
- 论文链接 : https://arxiv.org/pdf/2510.05094
- 项目链接 : https://eyeline-labs.github.io/VChain/
- gitHub仓库 : https://github.com/Eyeline-Labs/VChain
英文摘要
Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
中文摘要
最近的视频生成模型可以生成流畅且具有视觉吸引力的剪辑,但它们通常难以合成具有连贯后果链的复杂动态。随着时间的推移,准确地建模视觉结果和状态转换仍然是一个核心挑战。相比之下,大语言和多模态模型(例如 GPT-4o)表现出强大的视觉状态推理和未来预测能力。为了弥补这些优势,我们引入了 VChain,这是一种新颖的推理时间视觉思维链框架,它将来自多模态模型的视觉推理信号注入视频生成中。具体来说,VChain 包含一个专用管道,利用大型多模态模型生成一组稀疏的关键关键帧作为快照,然后仅在这些关键时刻使用这些关键帧来指导预训练视频生成器的稀疏推理时间调整。我们的方法调整效率高,引入的开销最小,并且避免了密集的监督。对复杂、多步骤场景的大量实验表明,VChain 显着提高了生成视频的质量。