中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 通过循环语言模型扩展潜在推理
- [Concerto:联合 2D-3D 自监督学习出现空间表示](#Concerto:联合 2D-3D 自监督学习出现空间表示)
- ReCode:统一计划和行动以实现通用粒度控制
- [Kimi Linear:一种富有表现力、高效的注意力架构](#Kimi Linear:一种富有表现力、高效的注意力架构)
- 手动解码的终结:走向真正的端到端语言模型
- Emu3.5:原生多模式模型是世界学习者
- DeepAgent:具有可扩展工具集的通用推理代理
- 统一深研技术报告
- InteractComp:使用不明确的查询评估搜索代理
- JanusCoder:迈向代码智能的基础视觉编程接口
- 视频思考者:通过强化学习激发"用视频思考"
- [AgentFold:具有主动上下文管理功能的长期 Web 代理](#AgentFold:具有主动上下文管理功能的长期 Web 代理)
- 数据代理调查:新兴范式还是夸大的炒作?
- 扩散模型的原理
- FARMER:像素上的流自回归变压器
- RoboOmni:全模态环境中的主动机器人操作
- Game-TARS:可扩展通才多模式游戏代理的预训练基础模型
- 通过采样进行推理:您的基本模型比您想象的更聪明
- [Agent 能征服网络吗?探索ChatGPT Atlas Agent在网页游戏中的前沿](#Agent 能征服网络吗?探索ChatGPT Atlas Agent在网页游戏中的前沿)
- 监督强化学习:从专家轨迹到逐步推理
- 工具十项全能:针对多样化、现实和长期任务执行的语言代理基准测试
- 视频提示:视频生成的统一语义控制
- [使用流程挖掘的推理感知 GRPO](#使用流程挖掘的推理感知 GRPO)
- [WorldGrow:生成无限 3D 世界](#WorldGrow:生成无限 3D 世界)
- 前瞻锚定:在音频驱动的人类动画中保留角色身份
- VITA-E:同时看到、听到、说和做的自然体现互动
- [Uniform Discrete Diffusion with Metric Path for Video Generation](#Uniform Discrete Diffusion with Metric Path for Video Generation)
- [IGGT:用于语义 3D 重建的基于实例的几何变换器](#IGGT:用于语义 3D 重建的基于实例的几何变换器)
- 探索机器人控制中扩散模型的条件
通过循环语言模型扩展潜在推理
- 标题: Scaling Latent Reasoning via Looped Language Models
- 作者: Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian
- 日期: 2025-10-29
- ArXiv主页 : https://arxiv.org/abs/2510.25741
- 论文链接 : https://arxiv.org/pdf/2510.25741
- 项目链接 : https://ouro-llm.github.io/
英文摘要
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.
中文摘要
现代法学硕士主要通过显式文本生成进行"思考"训练,例如思维链 (CoT),它将推理推迟到训练后,而未充分利用训练前数据。我们提出并开源 Ouro,以递归 Ouroboros 命名,是一系列预训练的循环语言模型 (LoopLM),它通过 (i) 潜在空间中的迭代计算,(ii) 用于学习深度分配的熵正则化目标,以及 (iii) 扩展到 7.7T 令牌,将推理构建到预训练阶段。Ouro 1.4B 和 2.6B 型号具有卓越的性能,在各种基准测试中可与高达 12B SOTA LLM 的结果相媲美。通过对照实验,我们证明这种优势并非源于知识容量的增加,而是源于卓越的知识操纵能力。我们还表明,LoopLM 产生的推理轨迹比显式 CoT 更符合最终输出。我们希望我们的结果能够展示 LoopLM 作为推理时代新颖的扩展方向的潜力。我们的模型可以在:http://ouro-llm.github.io 中找到。
Concerto:联合 2D-3D 自监督学习出现空间表示
- 标题: Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
- 作者: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
- 日期: 2025-10-27
- ArXiv主页 : https://arxiv.org/abs/2510.23607
- 论文链接 : https://arxiv.org/pdf/2510.23607
- 项目链接 : https://pointcept.github.io/Concerto/
- gitHub仓库 : https://github.com/Pointcept/Pointcept
英文摘要
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
中文摘要
人类通过多感官协同学习抽象概念,一旦形成,通常可以从单一模态中回忆起这种表征。受这一原理的启发,我们引入了 Concerto,这是一种对人类空间认知概念学习的极简模拟,将 3D 模态内自蒸馏与 2D-3D 跨模态联合嵌入相结合。尽管很简单,Concerto 仍能学习更连贯且信息丰富的空间特征,如零样本可视化所示。在 3D 场景感知的线性探测中,它的性能分别优于独立的 SOTA 2D 和 3D 自监督模型 14.2% 和 4.8%,以及它们的特征串联。通过全面微调,Concerto 在多个场景理解基准上设置了新的 SOTA 结果(例如,ScanNet 上的 80.7% mIoU)。我们进一步提出了专为视频提升点云空间理解而定制的 Concerto 变体,以及将 Concerto 表示线性投影到 CLIP 语言空间中的翻译器,从而实现开放世界感知。这些结果强调,Concerto 呈现出具有卓越的细粒度几何和语义一致性的空间表示。
ReCode:统一计划和行动以实现通用粒度控制
-
标题: ReCode: Unify Plan and Action for Universal Granularity Control
-
作者: Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu
-
日期: 2025-10-27
-
ArXiv主页 : https://arxiv.org/abs/2510.23564
-
gitHub仓库 : https://github.com/FoundationAgents/ReCode
英文摘要
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
中文摘要
现实世界的任务需要不同粒度的决策,而人类通过利用统一的认知表示来擅长这一点,其中规划从根本上被理解为一种高级行动形式。然而,当前基于大语言模型(LLM)的代理缺乏这种跨决策粒度流畅操作的关键能力。这种限制源于现有的范式,这些范式强制执行高层规划和低层行动之间的严格分离,这损害了动态适应性并限制了泛化。我们提出了 ReCode(递归代码生成),这是一种新颖的范例,通过在单个代码表示中统一规划和操作来解决此限制。在这种表示中,ReCode 将高级计划视为抽象占位符函数,然后代理将其递归分解为更细粒度的子函数,直到达到原始操作。这种递归方法消除了计划和行动之间的严格界限,使代理能够动态控制其决策粒度。此外,递归结构本身会生成丰富的多粒度训练数据,使模型能够学习分层决策过程。大量实验表明,ReCode 在推理性能方面显着超越了先进基线,并在训练中展示了卓越的数据效率,验证了我们的核心见解:通过递归代码生成统一规划和行动是实现通用粒度控制的强大而有效的方法。该代码可从 https://github.com/FoundationAgents/ReCode 获取。
Kimi Linear:一种富有表现力、高效的注意力架构
-
标题: Kimi Linear: An Expressive, Efficient Attention Architecture
-
作者: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du
-
日期: 2025-10-30
-
ArXiv主页 : https://arxiv.org/abs/2510.26692
-
gitHub仓库 : https://github.com/MoonshotAI/Kimi-Linear
英文摘要
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
中文摘要
我们引入了 Kimi Linear,这是一种混合线性注意力架构,它首次在各种场景(包括短上下文、长上下文和强化学习 (RL) 缩放机制)的公平比较下表现优于完全注意力。其核心在于 Kimi Delta Attention (KDA),这是一个富有表现力的线性注意力模块,它通过更细粒度的门控机制扩展了 Gated DeltaNet,从而能够更有效地使用有限的有限状态 RNN 内存。我们定制的分块算法通过对角加低秩 (DPLR) 转换矩阵的特殊变体实现了高硬件效率,与一般 DPLR 公式相比,它大大减少了计算量,同时与经典的 delta 规则保持更加一致。我们基于 KDA 和多头潜在注意力 (MLA) 的分层混合,预训练具有 3B 个激活参数和 48B 个总参数的 Kimi 线性模型。我们的实验表明,使用相同的训练方案,Kimi Linear 在所有评估的任务中均优于完整的 MLA,同时将 KV 缓存使用量减少高达 75%,并在 1M 上下文中实现高达 6 倍的解码吞吐量。这些结果表明,Kimi Linear 可以直接替代全注意力架构,具有卓越的性能和效率,包括具有较长输入和输出长度的任务。为了支持进一步的研究,我们开源了 KDA 内核和 vLLM 实现,并发布了预训练和指令调整的模型检查点。
手动解码的终结:走向真正的端到端语言模型
-
标题: The End of Manual Decoding: Towards Truly End-to-End Language Models
-
作者: Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang
-
日期: 2025-10-30
-
ArXiv主页 : https://arxiv.org/abs/2510.26697
-
gitHub仓库 : https://github.com/Zacks917/AutoDeco
英文摘要
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
中文摘要
法学硕士的"端到端"标签是用词不当。在实践中,它们依赖于不可微的解码过程,该过程需要费力地手动调整温度和 top-p 等超参数。本文介绍了 AutoDeco,这是一种新颖的架构,它通过学习控制自己的解码策略来实现真正的"端到端"生成。我们用轻量级头增强了标准变压器,在每一步中,动态预测上下文特定的温度和 top-p 值以及下一个令牌 logits。这种方法将解码转换为参数化的、令牌级的过程,允许模型在单个前向传递中自我调节其采样策略。通过对八个基准测试的大量实验,我们证明 AutoDeco 不仅显着优于默认解码策略,而且还实现了与"破解测试集"(任何静态方法的实用上限)导出的 Oracle 调整基线相当的性能。至关重要的是,我们发现了基于指令的解码控制的新兴功能:该模型学习解释自然语言命令(例如,"以低随机性生成")并在逐个令牌的基础上调整其预测温度和 top-p,为可操纵和交互式 LLM 解码开辟了新范例。
Emu3.5:原生多模式模型是世界学习者
- 标题: Emu3.5: Native Multimodal Models are World Learners
- 作者: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
- 日期: 2025-10-30
- ArXiv主页 : https://arxiv.org/abs/2510.26583
- 论文链接 : https://arxiv.org/pdf/2510.26583
- 项目链接 : https://emu.world/
- gitHub仓库 : https://github.com/baaivision/Emu3.5
英文摘要
We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
中文摘要
我们介绍 Emu3.5,这是一种大规模多模式世界模型,可以跨视觉和语言本机预测下一个状态。Emu3.5 经过端到端预训练,在包含超过 10 万亿个令牌的视觉语言交错数据语料库上具有统一的下一个令牌预测目标,这些令牌主要源自互联网视频的连续帧和转录本。该模型自然地接受交错的视觉语言输入并生成交错的视觉语言输出。Emu3.5进一步通过大规模强化学习进行后训练,以增强多模态推理和生成。为了提高推理效率,我们提出了离散扩散适应(DiDA),它将逐个令牌解码转换为双向并行预测,在不牺牲性能的情况下将每图像推理速度提高约 20 倍。Emu3.5 展示了强大的原生多模态功能,包括长视野视觉语言生成、任意图像 (X2I) 生成和复杂的富含文本的图像生成。它还表现出通用的世界建模能力,能够实现时空一致的世界探索和跨不同场景和任务的开放世界体现操作。相比之下,Emu3.5 在图像生成和编辑任务上实现了与 Gemini 2.5 Flash Image (Nano Banana) 相当的性能,并在一系列交错生成任务上展示了出色的结果。我们在 https://github.com/baaivision/Emu3.5 开源 Emu3.5 以支持社区研究。
DeepAgent:具有可扩展工具集的通用推理代理
-
标题: DeepAgent: A General Reasoning Agent with Scalable Toolsets
-
作者: Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
-
日期: 2025-10-24
-
ArXiv主页 : https://arxiv.org/abs/2510.21618
-
gitHub仓库 : https://github.com/RUC-NLPIR/DeepAgent
英文摘要
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
中文摘要
大型推理模型已表现出强大的解决问题的能力,但现实世界的任务通常需要外部工具和长期交互。现有的代理框架通常遵循预定义的工作流程,这限制了自主和全局任务的完成。在本文中,我们介绍了 DeepAgent,这是一种端到端深度推理代理,可以在单个连贯的推理过程中执行自主思考、工具发现和动作执行。为了解决长范围交互的挑战,特别是多个工具调用带来的上下文长度爆炸和交互历史的积累,我们引入了一种自主记忆折叠机制,将过去的交互压缩为结构化的情景、工作和工具记忆,减少错误积累,同时保留关键信息。为了高效、稳定地教授通用工具的使用,我们开发了一种端到端强化学习策略,即 ToolPO,它利用 LLM 模拟的 API 并应用工具调用优势归因来为工具调用令牌分配细粒度的信用。对八个基准测试(包括一般工具使用任务(ToolBench、API-Bank、TMDB、Spotify、ToolHop)和下游应用程序(ALFWorld、WebShop、GAIA、HLE))的广泛实验表明,DeepAgent 在标记工具和开放集工具检索场景中始终优于基线。这项工作朝着更通用、更强大的现实应用代理迈出了一步。代码和演示可在 https://github.com/RUC-NLPIR/DeepAgent 获取。
统一深研技术报告
- 标题: Tongyi DeepResearch Technical Report
- 作者: Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang
- 日期: 2025-10-28
- ArXiv主页 : https://arxiv.org/abs/2510.24701
- 论文链接 : https://arxiv.org/pdf/2510.24701
- 项目链接 : https://tongyi-agent.github.io/blog
- gitHub仓库 : https://github.com/Alibaba-NLP/DeepResearch
英文摘要
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
中文摘要
我们推出了 Tongyi DeepResearch,这是一种代理大语言模型,专为长期、深度信息寻求的研究任务而设计。为了激励自主深度研究机构,Tongyi DeepResearch 通过端到端培训框架开发,该框架结合了代理中期培训和代理后培训,从而实现了跨复杂任务的可扩展推理和信息搜索。我们设计了一个高度可扩展的数据合成管道,该管道是全自动的,无需依赖昂贵的人工注释,并支持所有培训阶段。通过为每个阶段构建定制环境,我们的系统可以在整个过程中实现稳定一致的交互。Tongyi DeepResearch 拥有 305 亿个总参数,每个代币仅激活 33 亿个参数,在一系列代理深度研究基准测试中实现了最先进的性能,包括 Humanity's Last Exam、BrowseComp、BrowseComp-ZH、WebWalkerQA、xbench-DeepSearch、FRAMES 和 xbench-DeepSearch-2510。我们开源模型、框架和完整的解决方案,为社区提供支持。
InteractComp:使用不明确的查询评估搜索代理
-
标题: InteractComp: Evaluating Search Agents With Ambiguous Queries
-
作者: Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
-
日期: 2025-10-28
-
ArXiv主页 : https://arxiv.org/abs/2510.24668
英文摘要
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.
中文摘要
语言代理在网络搜索和信息检索方面表现出了巨大的潜力。然而,这些搜索代理假设用户查询是完整且明确的,这种假设与现实不同,现实中用户从不完整的查询开始,需要通过交互进行澄清。然而,大多数智能体在搜索过程中缺乏交互机制,现有的基准无法评估这种能力。为了解决这一差距,我们引入了 InteractComp,这是一个基准测试,旨在评估搜索代理是否能够识别查询歧义并在搜索过程中主动交互来解决它。遵循易于验证、交互消除歧义的原则,我们通过目标干扰方法构建了 9 个领域的 210 个专家策划的问题,该方法产生了只有通过交互才能解决的真正模糊性。对 17 个模型的评估显示出惊人的失败:最好的模型仅达到 13.73% 的准确率,尽管有完整的上下文,准确率达到 71.50%,暴露出系统性的过度自信,而不是推理缺陷。强制互动会产生巨大的收益,展示出当前策略无法发挥的潜在能力。纵向分析显示,交互能力在 15 个月内停滞不前,而搜索性能却提高了七倍,揭示了一个关键盲点。这种停滞,加上搜索任务固有的即时反馈,使 InteractComp 成为评估和培训搜索代理交互能力的宝贵资源。该代码可从 https://github.com/FoundationAgents/InteractComp 获取。
JanusCoder:迈向代码智能的基础视觉编程接口
-
标题: JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
-
作者: Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan
-
日期: 2025-10-27
-
ArXiv主页 : https://arxiv.org/abs/2510.23538
-
gitHub仓库 : https://github.com/InternLM/JanusCoder
英文摘要
The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at https://github.com/InternLM/JanusCoder.
中文摘要
神经代码智能的范围正在迅速扩展到基于文本的源代码之外,涵盖程序生成的丰富的视觉输出。这种视觉维度对于灵活的内容生成和精确的、程序驱动的可视化编辑等高级应用程序至关重要。然而,高质量多模式代码数据的稀缺阻碍了进展,这是合成和质量评估挑战带来的瓶颈。为了应对这些挑战,我们从数据和建模的角度做出了贡献。我们首先介绍一个完整的综合工具包,它利用数据模式之间的相互协同作用来有效地生成大规模、高质量的语料库,涵盖从标准图表到复杂的交互式 Web UI 和代码驱动的动画。利用这个工具包,我们构建了 JanusCode-800K,这是迄今为止最大的多模态代码语料库。这为我们的模型 JanusCoder 和 JanusCoderV 的训练提供了动力,它们建立了一个可视化编程界面,用于从文本指令、可视化输入或两者的组合生成代码。我们的统一模型不同于为孤立任务构建专门模型的现有方法。对以文本为中心和以视觉为中心的编码任务的大量实验证明了JanusCoder系列的卓越性能,我们的7B到14B比例模型接近甚至超过商业模型的性能。此外,广泛的分析提供了协调程序逻辑与其视觉表达的关键见解。我们的代码和检查点可在 https://github.com/InternLM/JanusCoder 上找到。
视频思考者:通过强化学习激发"用视频思考"
-
标题: Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
-
作者: Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng
-
日期: 2025-10-27
-
ArXiv主页 : https://arxiv.org/abs/2510.23473
-
gitHub仓库 : https://github.com/shijian2001/Video-Thinker
英文摘要
Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
中文摘要
图像推理方法的最新进展,特别是"用图像思考",在多模态大语言模型(MLLM)方面取得了显着的成功;然而,这种动态推理范式尚未扩展到视频推理任务。在本文中,我们提出了 Video-Thinker,它通过自主利用其固有的"基础"和"字幕"功能在整个推理过程中生成推理线索,使 MLLM 能够用视频进行思考。为了激发这种能力,我们构建了 Video-Thinker-10K,这是一个精心策划的数据集,具有在思想链推理序列中自主使用工具的特点。我们的训练策略首先是有监督微调(SFT)来学习推理格式,然后是组相对策略优化(GRPO)来加强这种推理能力。通过这种方法,Video-Thinker 使 MLLM 能够自主导航视频推理的基础和字幕任务,从而无需构建和调用外部工具。大量实验表明,Video-Thinker 在域内任务和具有挑战性的域外视频推理基准(包括 Video-Holmes、CG-Bench-Reasoning 和 VRBench)上均取得了显着的性能提升。我们的 Video-Thinker-7B 的性能大大优于 Video-R1 等现有基准,并在 7B 尺寸的 MLLM 中确立了最先进的性能。
AgentFold:具有主动上下文管理功能的长期 Web 代理
-
标题: AgentFold: Long-Horizon Web Agents with Proactive Context Management
-
作者: Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang
-
日期: 2025-10-28
-
ArXiv主页 : https://arxiv.org/abs/2510.24699
-
gitHub仓库 : https://github.com/Alibaba-NLP/DeepResearch
英文摘要
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
中文摘要
基于法学硕士的网络代理在信息搜索方面显示出巨大的前景,但它们在长期任务上的有效性受到上下文管理中基本权衡的阻碍。流行的基于 ReAct 的代理会因积累嘈杂的原始历史而遭受上下文饱和的困扰,而在每一步固定总结完整历史的方法可能会导致关键细节的不可逆转的丢失。为了解决这些问题,我们引入了 AgentFold,这是一种以主动上下文管理为中心的新型代理范例,其灵感来自于人类回顾性整合的认知过程。AgentFold 将其上下文视为需要主动塑造的动态认知工作空间,而不是需要填充的被动日志。在每一步中,它都会学习执行"折叠"操作,该操作在多个尺度上管理其历史轨迹:它可以执行粒度压缩以保留重要的细粒度细节,或进行深度合并以抽象出整个多步骤子任务。著名基准测试的结果是惊人的:通过简单的监督微调(无需持续的预训练或强化学习),我们的 AgentFold-30B-A3B 代理在 BrowseComp 上达到了 36.2%,在 BrowseComp-ZH 上达到了 47.3%。值得注意的是,这种性能不仅超越或匹配更大规模的开源模型,例如 DeepSeek-V3.1-671B-A37B,而且还超越了领先的专有代理,例如 OpenAI 的 o4-mini。
数据代理调查:新兴范式还是夸大的炒作?
-
标题: A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
-
作者: Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, Yuyu Luo
-
日期: 2025-10-27
-
ArXiv主页 : https://arxiv.org/abs/2510.23587
英文摘要
The rapid advancement of large language models (LLMs) has spurred the emergence of data agents--autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth. Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation. Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
中文摘要
大型语言模型(LLM)的快速发展刺激了数据代理的出现,这是一种旨在协调数据+人工智能生态系统以处理复杂的数据相关任务的自治系统。然而,术语"数据代理"目前存在术语模糊性和不一致的采用,将简单的查询响应器与复杂的自治架构混为一谈。这种术语的模糊性导致了用户期望的不匹配、责任挑战以及行业发展的障碍。受驾驶自动化 SAE J3016 标准的启发,本次调查引入了数据代理的第一个系统分层分类法,包括六个级别,描绘和跟踪自主性的渐进转变,从手动操作 (L0) 到生成、完全自主的数据代理 (L5) 的愿景,从而澄清能力边界和责任分配。通过这个视角,我们对现有研究进行结构化审查,通过增加自主性来安排,包括用于数据管理、准备和分析的专门数据代理,以及针对具有增强自主性的多功能、综合系统的新兴努力。我们进一步分析了推进数据代理的关键演进飞跃和技术差距,特别是正在进行的 L2 到 L3 的过渡,其中数据代理从程序执行演变为自主编排。最后,我们以前瞻性的路线图作为总结,设想主动的生成数据代理的出现。
扩散模型的原理
- 标题: The Principles of Diffusion Models
- 作者: Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon
- 日期: 2025-10-24
- ArXiv主页 : https://arxiv.org/abs/2510.21890
- 论文链接 : https://arxiv.org/pdf/2510.21890
英文摘要
This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the monograph discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.
中文摘要
本专着介绍了指导扩散模型发展的核心原则,追溯了它们的起源,并展示了如何从共同的数学思想中产生不同的公式。扩散建模首先定义一个逐渐将数据破坏为噪声的前向过程,通过中间分布的连续体将数据分布与简单的先验联系起来。目标是学习逆向过程,将噪声转换回数据,同时恢复相同的中间体。我们描述了三种互补的观点。受变分自动编码器启发,变分观点将扩散视为学习逐步消除噪声。基于分数的视图植根于基于能量的建模,学习不断变化的数据分布的梯度,指示如何将样本推向更有可能的区域。基于流的视图与归一化流相关,将生成视为遵循平滑路径,在学习的速度场下将样本从噪声移动到数据。这些观点有一个共同的主干:一个与时间相关的速度场,其流动先于数据传输。采样相当于求解微分方程,将噪声沿着连续轨迹演化为数据。在此基础上,该专着讨论了可控生成、高效数值求解器以及学习任意时间之间直接映射的扩散驱动流图模型的指导。它为具有基本深度学习知识的读者提供了对扩散模型的概念和数学基础的理解。
FARMER:像素上的流自回归变压器
- 标题: FARMER: Flow AutoRegressive Transformer over Pixels
- 作者: Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
- 日期: 2025-10-27
- ArXiv主页 : https://arxiv.org/abs/2510.23588
- 论文链接 : https://arxiv.org/pdf/2510.23588
英文摘要
Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
中文摘要
直接对原始数据分布的显式可能性进行建模是机器学习领域的关键主题,它通过自回归建模在大型语言模型中取得了扩展的成功。然而,基于视觉像素数据的连续 AR 建模面临着极长的序列和高维空间的问题。在本文中,我们提出了 FARMER,这是一种新颖的端到端生成框架,它统一了归一化流(NF)和自回归(AR)模型,用于直接从原始像素进行易于处理的似然估计和高质量图像合成。FARMER 采用可逆自回归流将图像转换为潜在序列,其分布由自回归模型隐式建模。为了解决像素级建模中的冗余和复杂性,我们提出了一种自监督降维方案,将 NF 潜在通道划分为信息组和冗余组,从而实现更有效和高效的 AR 建模。此外,我们设计了一种一步蒸馏方案来显着加快推理速度,并引入一种基于重采样的无分类器引导算法来提高图像生成质量。大量实验表明,与现有基于像素的生成模型相比,FARMER 实现了具有竞争力的性能,同时提供了精确的可能性和可扩展的训练。
RoboOmni:全模态环境中的主动机器人操作
- 标题: RoboOmni: Proactive Robot Manipulation in Omni-modal Context
- 作者: Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yugang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu
- 日期: 2025-10-27
- ArXiv主页 : https://arxiv.org/abs/2510.23763
- 论文链接 : https://arxiv.org/pdf/2510.23763
- 项目链接 : https://OpenMOSS.github.io/RoboOmni
- gitHub仓库 : https://github.com/OpenMOSS/RoboOmni
英文摘要
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
中文摘要
多模态大语言模型 (MLLM) 的最新进展推动了机器人操作的视觉-语言-动作 (VLA) 模型的快速进步。尽管在许多场景中有效,但当前的方法很大程度上依赖于显式指令,而在现实世界的交互中,人类很少直接发出指令。有效的协作需要机器人主动推断用户意图。在这项工作中,我们引入了跨模式上下文指令,这是一种新的设置,其中意图源自口头对话、环境声音和视觉提示,而不是明确的命令。为了应对这一新环境,我们推出了 RoboOmni,这是一个基于端到端全模态法学硕士的感知器-思考者-说话者-执行器框架,它统一了意图识别、交互确认和动作执行。RoboOmni 融合听觉和视觉信号,实现强大的意图识别,同时支持直接语音交互。为了解决机器人操作中主动意图识别训练数据的缺乏问题,我们构建了 OmniAction,其中包括 140k 个片段、5k+ 个扬声器、2.4k 个事件声音、640 个背景和六种上下文指令类型。模拟和现实环境中的实验表明,RoboOmni 在成功率、推理速度、意图识别和主动协助方面超越了基于文本和 ASR 的基线。
Game-TARS:可扩展通才多模式游戏代理的预训练基础模型
- 标题: Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents
- 作者: Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan, Xiangyang Li, Yitao Liang, Yujia Qin, Guang Shi
- 日期: 2025-10-27
- ArXiv主页 : https://arxiv.org/abs/2510.23691
- 论文链接 : https://arxiv.org/pdf/2510.23691
- 项目链接 : https://seed-tars.com/game-tars
英文摘要
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.
中文摘要
我们推出了 Game-TARS,这是一种通用游戏代理,经过统一、可扩展的动作空间的训练,该空间锚定于人类对齐的本机键盘鼠标输入。与基于 API 或 GUI 的方法不同,这种范例可以跨异构领域(包括操作系统、网络和模拟游戏)进行大规模持续预训练。Game-TARS 在超过 500B 个具有不同轨迹和多模态数据的代币上进行了预训练。关键技术包括减少因果混乱的衰减连续损失和平衡推理深度和推理成本的有效稀疏思维策略。实验表明,Game-TARS 在开放世界 Minecraft 任务上的成功率是之前 sota 模型的约 2 倍,在未见过的网页 3D 游戏中接近新人类的普遍性,并且在 FPS 基准测试中优于 GPT-5、Gemini-2.5-Pro 和 Claude-4-Sonnet。训练时间和测试时间的扩展结果证实,当扩展到跨游戏和多模式数据时,统一的动作空间持续改进。我们的结果表明,简单、可扩展的动作表示与大规模预训练相结合,为具有广泛计算机使用能力的通才代理提供了一条有希望的道路。
通过采样进行推理:您的基本模型比您想象的更聪明
- 标题: Reasoning with Sampling: Your Base Model is Smarter Than You Think
- 作者: Aayush Karan, Yilun Du
- 日期: 2025-10-16
- ArXiv主页 : https://arxiv.org/abs/2510.14901
- 论文链接 : https://arxiv.org/pdf/2510.14901
- 项目链接 : https://aakaran.github.io/reasoning_with_sampling/
- gitHub仓库 : https://github.com/aakaran/reasoning-with-sampling
英文摘要
Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
中文摘要
在具有强化学习 (RL) 的后训练大型语言模型 (LLM) 的推动下,前沿推理模型在多个学科中展现了令人难以置信的能力。然而,尽管这种范式取得了广泛的成功,但许多文献都致力于解开强化学习过程中出现但不存在于基础模型中的真正新颖的行为。在我们的工作中,我们从不同的角度解决这个问题,而不是询问是否可以在推理时通过纯采样从基本模型中得出可比较的推理能力,而无需任何额外的训练。受用于从锐化分布中采样的马尔可夫链蒙特卡罗(MCMC)技术的启发,我们提出了一种利用基本模型自身似然的简单迭代采样算法。在不同的基础模型上,我们表明我们的算法在推理方面提供了显着的提升,在各种单次任务(包括 MATH500、HumanEval 和 GPQA)上几乎与 RL 的算法相匹配,甚至优于 RL 的算法。此外,我们的采样器避免了多个样本的多样性崩溃,这是 RL 后训练的特征。至关重要的是,我们的方法不需要训练、整理数据集或验证者,这表明其广泛的适用性超出了易于验证的领域。
Agent 能征服网络吗?探索ChatGPT Atlas Agent在网页游戏中的前沿
- 标题: Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
- 作者: Jingran Zhang, Ning Li, Justin Cui
- 日期: 2025-10-30
- ArXiv主页 : https://arxiv.org/abs/2510.26298
- 论文链接 : https://arxiv.org/pdf/2510.26298
- 项目链接 : https://atlas-game-eval.github.io/
英文摘要
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.
中文摘要
OpenAI 的 ChatGPT Atlas 引入了新的 Web 交互功能,使模型能够分析网页、处理用户意图以及直接在浏览器中执行光标和键盘输入。虽然它的信息检索任务能力已经得到证明,但它在动态、交互式环境中的性能仍然很少被探索。在本研究中,我们使用基于浏览器的游戏作为测试场景,对 Atlas 的网络交互能力进行了早期评估,包括 Google 的 T-Rex Runner、数独、Flappy Bird 和 Stein.world。我们使用游戏中的表现得分作为定量指标来评估不同任务类型的表现。我们的结果表明,Atlas 在数独等逻辑推理任务中表现出色,完成谜题的速度明显快于人类基线,但在需要精确计时和运动控制的实时游戏中表现不佳,常常无法超越最初的障碍。这些发现表明,虽然 Atlas 展示了强大的分析处理能力,但在需要实时交互的动态 Web 环境中仍然存在明显的局限性。我们项目的网站可以在 https://atlas-game-eval.github.io 找到。
监督强化学习:从专家轨迹到逐步推理
- 标题: Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- 作者: Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee
- 日期: 2025-10-29
- ArXiv主页 : https://arxiv.org/abs/2510.25992
- 论文链接 : https://arxiv.org/pdf/2510.25992
英文摘要
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
中文摘要
大型语言模型 (LLM) 经常会遇到需要多步骤推理的问题。对于小规模开源模型,可验证奖励强化学习(RLVR)在多次尝试后也很少采样到正确的解决方案时会失败,而监督微调(SFT)往往会通过严格的逐个令牌模仿来过度拟合长时间的演示。为了解决这一差距,我们提出了监督强化学习(SRL),这是一个将问题解决重新表述为生成一系列逻辑"动作"的框架。SRL 训练模型在执行每个操作之前生成内部推理独白。它根据模型动作与从 SFT 数据集中提取的专家动作之间的相似性,以逐步的方式提供更平滑的奖励。即使所有的展示都不正确,这种监督也能提供更丰富的学习信号,同时鼓励在专家演示的指导下进行灵活的推理。因此,SRL 使小型模型能够学习以前 SFT 或 RLVR 无法学习的挑战性问题。此外,在使用 RLVR 进行优化之前使用 SRL 进行初始化训练可以产生最强的整体性能。除了推理基准之外,SRL 还有效地推广到代理软件工程任务,将其建立为面向推理的法学硕士的强大且多功能的培训框架。
工具十项全能:针对多样化、现实和长期任务执行的语言代理基准测试
- 标题: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
- 作者: Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He
- 日期: 2025-10-29
- ArXiv主页 : https://arxiv.org/abs/2510.25726
- 论文链接 : https://arxiv.org/pdf/2510.25726
- 项目链接 : https://toolathlon.xyz/introduction
- gitHub仓库 : https://github.com/hkust-nlp/Toolathlon
英文摘要
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
中文摘要
现实世界的语言代理必须处理跨不同应用程序的复杂、多步骤的工作流程。例如,代理可以通过与日历和文件系统协调来管理电子邮件,或者监视生产数据库以检测异常并按照操作手册生成报告。然而,现有的语言代理基准通常侧重于狭窄的领域或简化的任务,缺乏评估代理实际性能所需的多样性、现实性和长期复杂性。为了解决这一差距,我们引入了工具十项全能(简称为 Toolathlon),这是语言代理的基准,提供多样化的应用程序和工具、现实的环境设置以及可靠的基于执行的评估。Toolathlon 涵盖 32 个软件应用程序和 604 个工具,从 Google Calendar 和 Notion 等日常平台到 WooCommerce、Kubernetes 和 BigQuery 等专业平台。大多数工具都基于一组高质量的模型上下文协议(MCP)服务器,我们可能已经自行修改或实现了这些服务器。与之前的工作主要确保功能真实性但提供有限的环境状态多样性不同,我们通过真实软件提供真实的初始环境状态,例如包含数十名学生的 Canvas 课程或真实的财务电子表格。该基准测试总共包括 108 个手动来源或制作的任务,平均需要与多个应用程序交互超过 20 轮才能完成。每项任务都可以通过专用的评估脚本进行严格验证。对SOTA模型的综合评价凸显了它们的显着缺点:性能最好的模型Claude-4.5-Sonnet仅实现了38.6%的成功率,平均工具调用次数为20.2次,而顶级的开放权重模型DeepSeek-V3.2-Exp达到了20.1%。我们期望 Toolathlon 能够推动开发更强大的语言代理,以执行现实世界的长期任务。
视频提示:视频生成的统一语义控制
- 标题: Video-As-Prompt: Unified Semantic Control for Video Generation
- 作者: Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu
- 日期: 2025-10-23
- ArXiv主页 : https://arxiv.org/abs/2510.20888
- 论文链接 : https://arxiv.org/pdf/2510.20888
- 项目链接 : https://bytedance.github.io/Video-As-Prompt/
- gitHub仓库 : https://github.com/bytedance/Video-As-Prompt
英文摘要
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.
中文摘要
视频生成中统一的、可推广的语义控制仍然是一个关键的开放挑战。现有的方法要么通过基于结构的控制强制执行不适当的像素先验来引入伪像,要么依赖于不可泛化的、特定于条件的微调或特定于任务的架构。我们引入了视频提示(VAP),这是一种新的范式,将这个问题重新定义为上下文生成。VAP 利用参考视频作为直接语义提示,通过即插即用的混合变压器 (MoT) 专家来指导冻结的视频扩散变压器 (DiT)。该架构可防止灾难性遗忘,并由时间偏置位置嵌入引导,消除虚假映射先验,实现稳健的上下文检索。为了支持这种方法并促进未来的研究,我们构建了 VAP-Data,这是用于语义控制视频生成的最大数据集,包含跨越 100 个语义条件的超过 10 万个配对视频。作为单一统一模型,VAP 为开源方法树立了新的最先进水平,实现了 38.7% 的用户偏好率,可与领先的特定条件商业模型相媲美。VAP 强大的零样本泛化能力和对各种下游应用的支持标志着通用、可控视频生成的重大进步。
使用流程挖掘的推理感知 GRPO
-
标题: Reasoning-Aware GRPO using Process Mining
-
作者: Taekhyun Park, Yongjae Lee, Hyerim Bae
-
日期: 2025-10-29
-
ArXiv主页 : https://arxiv.org/abs/2510.25065
-
gitHub仓库 : https://github.com/Thrillcrazyer/THIP
英文摘要
Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
中文摘要
基于强化学习 (RL) 的后期训练对于在大型推理模型 (LRM) 中实现多步推理至关重要,但当前的奖励方案通常以结果为中心。我们提出 PM4GRPO,一种推理感知组相对策略优化 (GRPO),它通过推理过程中的信号来增强标准答案/格式奖励。为此,利用过程挖掘技术来计算标量一致性奖励,以衡量策略模型的推理与预训练的教师模型的一致性程度。五个基准的实证结果表明,PM4GRPO 显着优于基于 GRPO 的后培训的现有方法。这些结果表明,利用流程挖掘进行推理感知 GRPO 可以有效增强策略模型的推理能力。
WorldGrow:生成无限 3D 世界
- 标题: WorldGrow: Generating Infinite 3D World
- 作者: Sikuang Li, Chen Yang, Jiemin Fang, Taoran Yi, Jia Lu, Jiazhong Cen, Lingxi Xie, Wei Shen, Qi Tian
- 日期: 2025-10-24
- ArXiv主页 : https://arxiv.org/abs/2510.21682
- 论文链接 : https://arxiv.org/pdf/2510.21682
- 项目链接 : https://world-grow.github.io/
- gitHub仓库 : https://github.com/world-grow/WorldGrow
英文摘要
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.
中文摘要
我们应对生成无限可扩展的 3D 世界的挑战 - 具有连贯几何形状和逼真外观的大型连续环境。现有方法面临关键挑战:2D 提升方法面临着不同视图之间几何和外观不一致的问题,3D 隐式表示难以扩展,当前的 3D 基础模型大多以对象为中心,限制了它们在场景级生成中的适用性。我们的主要见解是利用预训练 3D 模型的强大生成先验来生成结构化场景块。为此,我们提出了 WorldGrow,一个用于无界 3D 场景合成的分层框架。我们的方法具有三个核心组件:(1)数据管理管道,提取高质量场景块进行训练,使 3D 结构化潜在表示适合场景生成;(2) 3D 块修复机制,支持上下文感知场景扩展;(3)从粗到细的生成策略,确保全局布局的合理性和局部几何/纹理的保真度。在大规模 3D-FRONT 数据集上进行评估,WorldGrow 在几何重建方面实现了 SOTA 性能,同时独特地支持无限场景生成,具有逼真且结构一致的输出。这些结果凸显了其构建大规模虚拟环境的能力和构建未来世界模型的潜力。
前瞻锚定:在音频驱动的人类动画中保留角色身份
- 标题: Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
- 作者: Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen
- 日期: 2025-10-27
- ArXiv主页 : https://arxiv.org/abs/2510.23581
- 论文链接 : https://arxiv.org/pdf/2510.23581
- 项目链接 : https://lookahead-anchoring.github.io/
- gitHub仓库 : https://github.com/j0seo/lookahead-anchoring
英文摘要
Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.
中文摘要
音频驱动的人类动画模型在时间自回归生成过程中经常会遭受身份漂移的影响,其中角色随着时间的推移逐渐失去其身份。一种解决方案是生成关键帧作为防止退化的中间时间锚点,但这需要额外的关键帧生成阶段,并且会限制自然运动动态。为了解决这个问题,我们提出了前向锚定(Lookahead Anchoring),它利用当前生成窗口之前(而不是内部)未来时间步长的关键帧。这将关键帧从固定边界转变为定向信标:模型不断地追寻这些未来的锚点,同时响应即时的音频提示,通过持久的指导保持一致的身份。这还支持自关键帧,其中参考图像充当前瞻目标,完全消除了关键帧生成的需要。我们发现时间前瞻距离自然地控制着表现力和一致性之间的平衡:较大的距离允许更大的运动自由度,而较小的距离则增强身份依从性。当应用于最近的三个人类动画模型时,Lookahead Anchoring 实现了卓越的唇形同步、身份保留和视觉质量,展示了跨几种不同架构的改进的时间调节。视频结果可通过以下链接获取:https://lookahead-anchoring.github.io。
VITA-E:同时看到、听到、说和做的自然体现互动
- 标题: VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
- 作者: Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He
- 日期: 2025-10-21
- ArXiv主页 : https://arxiv.org/abs/2510.21817
- 论文链接 : https://arxiv.org/pdf/2510.21817
- 项目链接 : https://ltbai.github.io/VITA-VLA
- gitHub仓库 : https://github.com/VITA-MLLM/VITA
英文摘要
Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an Active Model'' and a Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.
中文摘要
当前的视觉-语言-动作(VLA)模型通常受到严格的静态交互范例的限制,缺乏同时看、听、说和行动以及动态处理实时用户中断的能力。这阻碍了无缝的具体协作,导致不灵活且反应迟钝的用户体验。为了解决这些限制,我们引入了 VITA-E,这是一种新颖的体现交互框架,专为行为并发和近实时中断而设计。我们方法的核心是双模型架构,其中两个并行的 VLA 实例作为"活动模型"和"备用模型"运行,允许实体代理观察其环境、聆听用户语音、提供口头响应并执行操作,所有这些都是同时且可中断的,模仿类人的多任务处理能力。我们进一步提出了"模型即控制器"范例,其中我们对 VLM 进行微调以生成用作直接系统级命令的特殊令牌,将模型的推理与系统的行为耦合起来。在物理人形平台上进行的实验表明,VITA-E能够可靠地处理复杂的交互场景。我们的框架与各种双系统VLA模型兼容,在紧急停止和语音中断方面实现了极高的成功率,同时还成功地执行了并发语音和动作。这代表着向更自然、更有能力的实体助理迈出了重要一步。
Uniform Discrete Diffusion with Metric Path for Video Generation
- 标题: Uniform Discrete Diffusion with Metric Path for Video Generation
- 作者: Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang
- 日期: 2025-10-28
- ArXiv主页 : https://arxiv.org/abs/2510.24717
- 论文链接 : https://arxiv.org/pdf/2510.24717
- 项目链接 : https://bitterdhg.github.io/URSA_page
- gitHub仓库 : https://github.com/baaivision/URSA
英文摘要
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA
中文摘要
连续空间视频生成发展迅速,而离散方法由于错误累积和长上下文不一致而落后。在这项工作中,我们重新审视离散生成建模,并提出带有度量路径的均匀光盘扩散(URSA),这是一个简单但功能强大的框架,它弥补了与可扩展视频生成的连续方法之间的差距。URSA 的核心是将视频生成任务制定为离散时空标记的迭代全局细化。它集成了两个关键设计:线性化度量路径和分辨率相关的时间步长转换机制。这些设计使 URSA 能够有效地扩展到高分辨率图像合成和长时间视频生成,同时需要显着减少的推理步骤。此外,我们引入了一种异步时间微调策略,该策略将多种任务统一在单个模型中,包括插值和图像到视频生成。在具有挑战性的视频和图像生成基准上进行的大量实验表明,URSA 始终优于现有的离散方法,并实现了与最先进的连续扩散方法相当的性能。代码和模型可在 https://github.com/baaivision/URSA 获取
IGGT:用于语义 3D 重建的基于实例的几何变换器
- 标题: IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
- 作者: Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu
- 日期: 2025-10-26
- ArXiv主页 : https://arxiv.org/abs/2510.22706
- 论文链接 : https://arxiv.org/pdf/2510.22706
- 项目链接 : https://lifuguan.github.io/IGGT_official
- gitHub仓库 : https://github.com/lifuguan/IGGT_official
英文摘要
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
中文摘要
人类自然地将 3D 世界的几何结构和语义内容视为相互交织的维度,从而能够对复杂场景进行连贯而准确的理解。然而,大多数现有方法优先考虑训练大型几何模型以进行低级 3D 重建,并孤立地处理高级空间理解,忽略了 3D 场景分析的这两个基本方面之间的关键相互作用,从而限制了泛化并导致下游 3D 理解任务的性能不佳。最近的尝试通过简单地将 3D 模型与特定语言模型对齐来缓解这个问题,从而限制对对齐模型容量的感知并限制对下游任务的适应性。在本文中,我们提出了实例接地几何变换器(IGGT),这是一种端到端的大型统一变换器,用于统一空间重建和实例级上下文理解的知识。具体来说,我们设计了一种 3D 一致对比学习策略,指导 IGGT 仅通过 2D 视觉输入来编码具有几何结构和基于实例的聚类的统一表示。这种表示支持将 2D 视觉输入一致提升到具有明确不同对象实例的连贯 3D 场景中。为了促进这项任务,我们进一步构建了 InsScene-15K,这是一个具有高质量 RGB 图像、姿势、深度图和 3D 一致的实例级掩模注释的大型数据集,并具有新颖的数据管理管道。
探索机器人控制中扩散模型的条件
- 标题: Exploring Conditions for Diffusion models in Robotic Control
- 作者: Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15510
- 论文链接 : https://arxiv.org/pdf/2510.15510
- 项目链接 : https://orca-rc.github.io/
英文摘要
While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
中文摘要
虽然预先训练的视觉表示显着促进了模仿学习,但它们通常与任务无关,因为它们在政策学习期间保持冻结状态。在这项工作中,我们探索利用预先训练的文本到图像扩散模型来获得用于机器人控制的任务自适应视觉表示,而无需对模型本身进行微调。然而,我们发现天真地应用文本条件(在其他视觉领域的成功策略)在控制任务中产生的收益很小甚至是负收益。我们将此归因于扩散模型的训练数据和机器人控制环境之间的域差距,导致我们争论考虑控制所需的特定动态视觉信息的条件。为此,我们提出了 ORCA,它引入了适应控制环境的可学习任务提示和捕获细粒度、特定于帧的细节的视觉提示。通过利用我们新设计的条件促进任务自适应表示,我们的方法在各种机器人控制基准上实现了最先进的性能,显着超越了以前的方法。