【论文速递】2025年第34周(Aug-17-23)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

dinov3

英文摘要

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

中文摘要

自我监督的学习有望消除对手动数据注释的需求,从而使模型能够毫不费力地扩展到大规模的数据集和较大的体系结构。通过不针对特定的任务或领域量身定制,这种训练范式有可能使用单个算法从不同的来源学习视觉表示形式,从自然到航空图像。该技术报告介绍了Dinov3,这是通过利用简单而有效的策略来实现这一愿景的主要里程碑。首先,我们利用仔细的数据准备,设计和优化来扩展数据集和模型大小的好处。其次,我们介绍了一种称为GRAM锚定的新方法,该方法有效地解决了长期训练时间表中已知但未解决的密集特征映射降解的问题。最后,我们采用事后策略,进一步增强了模型在分辨率,模型大小和与文本一致性方面的灵活性。结果,我们提出了一个多功能的视觉基础模型,该模型在无需微调的情况下就超过了广泛的设置的专业状态。Dinov3产生的高质量密集特征,可在各种视觉任务上取得出色的性能,从而超过了以前的自我和弱监督的基础模型。我们还共享Dinov3愿景模型套件,旨在通过为各种资源限制和部署方案提供可扩展的解决方案,以各种各样的任务和数据来推进最新技术。


Intern-S1:科学多模式模型

  • 标题: Intern-S1: A Scientific Multimodal Foundation Model

  • 作者: Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou

  • 日期: 2025-08-21

  • ArXiv主页 : https://arxiv.org/abs/2508.15763

  • 论文链接 : https://arxiv.org/pdf/2508.15763

  • gitHub仓库 : https://github.com/InternLM/Intern-S1

英文摘要

In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

中文摘要

近年来,已经出现了众多开源基础模型,在一些广泛参加的领域取得了显着的进步,其性能非常接近封闭源模型。但是,在高价值但更具挑战性的科学专业领域中,这些领域仍然依赖专家模型,或者与流行地区的一般基础模型的进步显着落后,远远不足以改变科学研究并在这些科学领域中的开源模型和封闭的模型之间存在很大的差距。为了减轻这一差距并探索迈向人工通用情报(AGI)的一步,我们介绍了Intern-S1,这是一位专门的通用专家,配备了一般理解和推理能力,并具有专业知识,可以分析多个科学模态数据。Intern-S1是具有280亿个活化参数和2410亿个总参数的多模式专家(MOE)模型,不断在5T代币上进行预训练,包括来自科学域的2.5吨代币。在训练后阶段,Intern-S1在InternBootCamp中进行离线增强学习(RL),我们在其中提议奖励(MOR)同时对1000多个任务的RL培训协同。Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting晶体的热力学稳定性。我们的模型可在https://huggingface.co/internlm/intern-s1上找到。


代理链:通过多代理蒸馏和代理RL的端到端代理基础模型

英文摘要

Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.

中文摘要

大型语言模型(LLM)和多代理系统的最新进展已在复杂的解决问题的任务中表现出显着的功能,例如深入研究,氛围编码和数学推理。但是,大多数现有的多代理系统都是基于具有复杂的代理框架的手动提示/工作流程工程构建的,使它们在计算效率低下,功能较低,无法从中心学习中受益。在这项工作中,我们介绍了代理链(COA),这是一种新颖的LLM推理范式,可以在一个模型中以与多代理系统(即使用多个工具和多个试剂解决多转弯问题解决)相同的方式来解决本地端到端的复杂问题解决。在解决问题的代理链中,该模型动态激活不同的工具代理和角色扮演代理,以端到端的方式模拟多代理协作。为了在LLMS中引起端到端的代理解决问题的能力,我们引入了一个多代理蒸馏框架,以将最新的多代理系统蒸馏到代理监督的微调中。然后,我们在可验证的代理任务上使用代理增强学习学习,以进一步提高模型解决方案解决问题的能力。我们称之为生成的模型代理基础模型(AFMS)。我们的实证研究表明,AFM在Web代理和代码代理设置中都建立了各种基准测试的新最先进的性能。我们进行了整个研究,包括模型权重,培训和评估代码以及完全开源的培训数据,这为未来对代理模型和代理RL的研究提供了一个可靠的起点。


OVIS2.5技术报告

  • 标题: Ovis2.5 Technical Report

  • 作者: Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

  • 日期: 2025-08-15

  • ArXiv主页 : https://arxiv.org/abs/2508.11737

  • 论文链接 : https://arxiv.org/pdf/2508.11737

  • gitHub仓库 : https://github.com/AIDC-AI/Ovis

英文摘要

We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.

中文摘要

我们提出了OVIS2.5,这是OVIS2的继任者,专为天然分辨率的视觉感知和强大的多模式推理而设计。OVIS2.5集成了一个本地分辨率视觉变压器,该变压器在其本地可变分辨率上处理图像,从而避免了固定分辨率瓷砖的降解,并保留了细节和全局布局 - 对于像复杂图表(如复杂图表)的视觉密集内容至关重要。为了加强推理,我们训练该模型以超越线性链链并进行反思 - 包括自我检查和修订。这种高级功能在推理时被视为可选的"思考模式",使用户可以在困难输入方面交易延迟以提高准确性。该模型通过综合的五阶段课程进行培训,该课程逐渐建立了其技能。该过程始于基础视觉和多模式预处理,通过大规模指导调整进步,并使用DPO和GRPO在对齐和推理增强方面达到顶点。为了有效地扩展这些升级,我们采用了多模式数据包装和混合并行性,从而产生了显着的端到端速度。我们发布了两个开源模型:OVIS2.5-9B和OVIS2.5-2B。后者延续了OVIS2的"小型模型,大型绩效"哲学,使其非常适合资源受限的,设备的场景。在Opencompass多模式排行榜上,OVIS2.5-9B平均78.3,标志着其前身OVIS2-8B的实质性改善,并在Sub-40B参数范围内的开源MLLM中实现了最先进的结果;OVIS2.5-2B得分为73.9,建立了SOTA的大小。除总分外,OVIS2.5还取得了茎基准的领先结果,在接地和视频任务上表现出强大的功能,并在其规模上实现开源SOTA进行复杂的图表分析。


SSRL:自我搜索强化学习

英文摘要

We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

中文摘要

我们研究了大型语言模型(LLMS)作为加强学习(RL)中代理搜索任务有效的模拟器的潜力,从而减少了对与外部搜索引擎的昂贵交互的依赖。为此,我们首先通过结构化提示和重复采样来量化LLM的内在搜索能力,我们将其称为自我搜索。我们的结果表明,LLM在推理预算方面表现出很强的缩放行为,在提问基准的基准(包括具有挑战性的Browsecomp任务)上实现高通@K。在这些观察结果的基础上,我们介绍了自我搜索RL(SSRL),从而通过基于格式的基于规则的奖励来增强LLMS的自我搜索能力。SSRL可以使模型在内部迭代地完善其知识利用率,而无需访问外部工具。经验评估表明,SSRL训练的政策模型为搜索驱动的RL培训提供了一个具有成本效益且稳定的环境,从而减少了对外部搜索引擎的依赖,并促进了强大的SIM到现实转移。我们得出以下结论:1)LLM具有可以有效提出来实现高性能的世界知识;2)SSRL证明了利用内部知识减少幻觉的潜力;3)SSRL训练的模型无需额外努力即可与外部搜索引擎无缝集成。我们的发现突出了LLMS支持更可扩展的RL代理培训的潜力。


充满信心

英文摘要

Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.

中文摘要

大型语言模型(LLMS)通过测试时间缩放方法(例如自遇到性能和多数投票)在推理任务方面表现出了巨大的潜力。但是,这种方法通常会导致准确性和高计算开销的回报降低。为了应对这些挑战,我们充满信心地介绍了深层思考(DeepConf),这是一种简单而强大的方法,可在测试时提高推理效率和性能。DeepConf利用模型内部置信信号在生成期间或之后动态滤除低质量的推理痕迹。它不需要其他模型培训或超参数调整,并且可以无缝集成到现有的服务框架中。我们在各种推理任务和最新的开源模型中评估了DeepConf,包括QWEN 3和GPT-soss系列。值得注意的是,在诸如AIME 2025之类的具有挑战性的基准上,DeepConf@512的准确性高达99.9%,并且与完全并行思维相比,产生的令牌最多可将代币产生的代币降低多达84.7%。


DUPO:通过双重优先优化启用可靠的LLM自我验证

英文摘要

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

中文摘要

我们提出DUPO,这是一个基于双学习的偏好优化框架,该框架通过广义双重性生成无注释的反馈。DUPO解决了两个关键局限性:具有可验证的奖励(RLVR)对昂贵的标签和适用性仅限于可验证任务的依赖的增强学习,以及传统的双重学习限制严格双重任务对(例如,翻译和背面式换装)。具体而言,DUPO将原始任务的输入分解为已知和未知组件,然后构建其双重任务,以使用原始输出和已知信息(例如,逆转数学解决方案以恢复隐​​藏的变量)重建未知零件,从而扩大了对不可糊化任务的适用性。这种重建的质量是一种自制的奖励,以优化原始任务,并与LLMS通过单个模型实例化这两个任务的能力协同作用。从经验上讲,DUPO在各种任务中实现了可观的增长:它在756个方向上提高了平均翻译质量2.13彗星,将数学推理精度提高了三个挑战基准的平均6.4分,并将表现增强9.3点,以提高9.3点的效果。这些结果将DUPO定位为LLM优化的可扩展,一般和无注释范式。


百里香:超越图像

英文摘要

Following OpenAI's introduction of the thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

中文摘要

在Openai介绍了用图像的思考''概念的介绍之后,最近的努力探索了在推理过程中使用视觉信息的使用,以增强感知和推理任务中的模型性能。但是,据我们所知,目前尚无开源工作提供的功能集,如专有模型(O3),可以执行多种图像操作,并同时通过代码增强逻辑推理能力。在本文中,我们通过引入百里香(思考超出图像)来朝这个方向进行初步尝试,这是一种新颖的范式,它是使MLLM能够通过自动生成和执行可执行的代码来自主产生和执行多样化的图像处理和计算操作来超越现有的图像''方法。这种方法不仅有助于一组丰富的图像操作(例如,裁剪,旋转,对比度增强),而且还允许进行数学计算,同时在决定何时以及如何应用这些操作方面保持了高度自主权。我们通过两阶段的培训策略来激活此功能:在500K样品的策划数据集上的初始SFT来教授代码生成,然后进行RL阶段以完善决策。对于RL阶段,我们手动收集和设计高分辨率的问题解答以增加学习难度,并提出了GRPO-ATS(组相对策略优化,使用自适应温度采样),该算法将不同的温度应用于文本和代码生成,以平衡推理探索的代码探索,并具有代码执行精确。我们进行广泛的实验分析和消融研究。对近20个基准的全面评估表明,百里香可以带来显着且一致的性能提高,尤其是在挑战高分辨率感知和复杂的推理任务方面。


Comorag:由认知启发的记忆组织的抹布,用于状态的长期叙事推理

英文摘要

Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG

中文摘要

对长篇小说和小说的叙事理解是一个富有挑战性的领域,归因于其复杂的情节线,并纠缠不清,经常在人物和实体之间发展。考虑到LLM在扩展上下文和高计算成本上的推理减少,基于检索的方法在实践中仍然是关键作用。但是,由于其无状态的单步检索过程,传统的抹布方法可能会缺乏,这通常会忽略在远程上下文中捕获相互联系的动态性质。在这项工作中,我们提出了Comorag,认为叙事推理不是一个镜头的过程,而是一种动态的,不断发展的相互作用,在新的证据获取与过去的知识巩固之间,类似于与人类认知与大脑中与记忆相关的信号的推理。具体而言,在遇到推理僵局时,Comorag在与动态内存工作空间互动时会经历迭代推理周期。在每个周期中,它都会生成探测查询以设计新的探索路径,然后将检索到的新方面的证据集成到全球内存池中,从而支持查询分辨率相干上下文的出现。在四个具有挑战性的长篇小说叙事基准(200k+令牌)中,Comorag的表现优于强大的抹布基线,相对相对增长,与最强的基线相比,相对相对稳定的基线增长了11%。进一步的分析表明,Comorag对于需要全球理解的复杂查询尤其有利,为基于检索的长篇小说理解对国家推理提供了有原则的,认知动机的范式。我们的代码在https://github.com/eternityjune25/comorag上公开发布


未来:LLM代理商在未来预测中的先进实时基准测试

  • 标题: FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
  • 作者: Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang
  • 日期: 2025-08-16
  • ArXiv主页 : https://arxiv.org/abs/2508.11987
  • 论文链接 : https://arxiv.org/pdf/2508.11987
  • 项目链接 : https://futurex-ai.github.io/

英文摘要

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

中文摘要

对于LLM代理人来说,未来的预测是一项复杂的任务,需要在不确定性下进行高水平的分析思维,信息收集,上下文理解和决策。代理商不仅必须收集和解释大量的动态信息,而且还必须将各种数据源,权衡不确定性和基于新兴趋势的预测进行整合,就像人类专家在政治,经济学和财务等领域一样。尽管它很重要,但对于对未来预测的代理进行评估的尚无大规模基准,这在很大程度上是由于处理实时更新和检索及时,准确的答案的挑战。为了解决这个问题,我们介绍了Futurex,这是一种专门为执行未来预测任务的LLM代理设计的动态和实时评估基准。FutureX是未来预测的最大,最多样化的实时基准,支持实时的每日更新,并通过自动化管道来收集和回答收集来消除数据污染。我们评估了25种LLM/代理模型,包括具有推理,搜索功能的人,以及外部工具的集成,例如开源深度研究代理和封闭源深度研究模型。这种全面的评估评估了代理在动态环境中的适应性推理和性能。此外,我们在未来的任务中对代理的故障模式和性能陷阱进行深入分析,包括伪造网页的脆弱性和时间效度。我们的目标是建立一个动态的,无污染的评估标准,该标准驱动能够在复杂的推理和预测性思维中以专业人类分析师级别执行的LLM代理的发展。


网格编码器:LLM驱动的结构化网状代码从点云

英文摘要

Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.

中文摘要

将3D对象重建为可编辑程序对于逆向工程和形状编辑等应用至关重要。但是,现有方法通常依赖于有限的特定域语言(DSL)和小规模数据集,从而限制了它们对复杂几何和结构进行建模的能力。为了应对这些挑战,我们介绍了MeshCoder,这是一个新颖的框架,可将点云从点云重建复杂的3D对象重建为可编辑的搅拌器Python脚本。我们开发了一组能够综合复杂几何形状的表达搅拌器Python API。利用这些API,我们构建了一个大规模的配对对象代码数据集,其中每个对象的代码被分解为不同的语义部分。随后,我们训练一个多模式大语言模型(LLM),该模型(LLM)将3D点云转化为可执行的搅拌器Python脚本。我们的方法不仅在形状对代码重建任务中实现了卓越的性能,而且还通过方便的代码修改来促进直观的几何和拓扑编辑。此外,我们基于代码的表示增强了LLMS在3D形状理解任务中的推理能力。这些贡献共同将Meshcoder作为编程3D形状重建和理解的强大而灵活的解决方案。


移动代理-V3:GUI自动化的创始代理

英文摘要

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

中文摘要

本文介绍了Gui-Owl,这是一种基础GUI代理模型,在台面和移动环境的十个GUI基准上,在开源端到端模型之间实现了最先进的性能,涵盖了接地,问题答案,计划,计划,决策,决策和程序知识。GUI-OWL-7B在Androidworld上达到66.4,OSWORLD上的29.4。在此基础上,我们提出了一种通用GUI代理框架移动设施-V3,该框架进一步提高了Androidworld上的绩效,在OSWorld上的绩效为73.3,为开源GUI代理框架设定了新的最先进。GUI-OWL结合了三个关键创新:(1)大规模环境基础架构:跨越Android,Ubuntu,MacOS和Windows的基于云的虚拟环境,使我们的自我发展的GUI轨迹生产框架。这通过自动查询产生和正确性验证生成高质量的交互数据,利用Gui-Owl迭代地完善轨迹,形成自我改善的循环。它支持各种数据管道并减少手动注释。(2)各种基础代理能力:通过整合UI接地,计划,行动语义和推理模式,GUI-OWL支持端到端的决策,并可以充当多机构系统中的模块化组件。(3)可扩展的环境RL:我们通过完全异步培训来开发一个可扩展的增强学习框架,以实现现实对齐。我们还为在线RL引入了轨迹感知的相对策略优化(TRPO),在OSWorld上达到34.9。GUI-OWL和移动代理-V3在https://github.com/x-plug/mobileagent上开源。


4DNEX:馈送4D生成型建模变得容易

英文摘要

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

中文摘要

我们提出4DNex,这是第一个用于从单个图像生成4D(即动态3D)场景表示的馈送框架。与依靠计算密集型优化或需要多帧视频输入的现有方法相反,4DNex通过微调预验证的视频扩散模型来实现有效的,端到端的图像至4D生成。具体来说,1)为了减轻4D数据的稀缺性,我们构建了4DNEX-10M,这是一种使用高级重建方法生成的具有高质量4D注释的大型数据集。2)我们引入了一个统一的6D视频表示形式,该表示共同对RGB和XYZ序列进行建模,从而促进了外观和几何学的结构化学习。3)我们提出了一组简单但有效的适应策略,以重新利用预审计的4D建模的视频扩散模型。4DNEX产生高质量的动态点云,从而实现新型视频综合。广泛的实验表明,4DNEX在效率和通用性方面的表现优于现有的4D生成方法,为图像到4D建模提供了可扩展的解决方案,并为模拟动态场景演变的生成4D世界模型奠定了基础。


BeyondWeb:从缩放合成数据的经验教训

  • 标题: BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
  • 作者: Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
  • 日期: 2025-08-14
  • ArXiv主页 : https://arxiv.org/abs/2508.10975
  • 论文链接 : https://arxiv.org/pdf/2508.10975

英文摘要

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

中文摘要

大型语言模型(LLM)预处理的最新进展表明,简单地缩放数据数量最终会导致回报减少,并击中数据墙。作为响应,将合成数据用于预处理已成为推动性能前沿的有希望的范式。尽管如此,影响综合数据质量的因素仍然知之甚少。在这项工作中,我们介绍了BeyondWeb,这是一个合成数据生成框架,该框架可产生高质量的合成数据进行预处理。Beyondweb显着扩展了传统的网络尺度数据集的功能,超过了最先进的合成预读数据集,例如Cosmopedia和Nemotron-CC的高质量合成子集(Nemotron-nemotron-synth),分别在评估14 bench时,分别评估了14个BENCH,分别评估了2.6个百分点(pp)和2.6ppp。它比打开的网络数据更快地提供了高达7.7倍的训练,并且比Nemotron-synth更快。值得注意的是,在BeyondWeb上为180B代币培训的3B型号优于训练Cosmopedia的8B模型。我们还提供了《超越网络》的几个见解,以了解综合数据的预处理:是什么驱动其好处,哪些数据重塑以及模型大小和家庭对数据质量的影响。总体而言,我们的工作表明,没有用于生成高质量合成预审计数据的银弹。最好的结果需要共同优化许多因素,这是一项具有挑战性的任务,需要严格的科学和实践专业知识。天真的方法可以产生适度的改进,可能以巨大的成本产生适度的改进,而执行良好的方法可以产生变革性的改进,例如BeyondWeb的例子。


LONGSPLAT:强大的未经牙齿的3D高斯碎片,用于休闲视频

英文摘要

LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/

中文摘要

Longsplat从随意捕获的长视频中,以不规则的相机运动,未知的相机姿势和广阔的场景为特征,解决了新型视图合成(NVS)中的关键挑战。当前的方法通常遭受姿势漂移,几何初始化和严重的记忆限制。为了解决这些问题,我们介绍了Longsplat,这是一个强大的3D高斯脱衣框架,其中包括:(1)同时优化相机姿势和3D高斯人的增量关节优化,以避免局部最小值并确保全球一致性;(2)利用良好的3D先验的稳健姿势估计模块;(3)有效的OCTREE锚固形成机制,该机制将密集的点云转换为基于空间密度的锚固。关于具有挑战性的基准测试的广泛实验表明,与先前的方法相比,Longsplat取得了最先进的结果,从而大大提高了渲染质量,姿势准确性和计算效率。项目页面:https://linjohnss.github.io/longsplat/


从分数到技能:评估金融大语模型的认知诊断框架

  • 标题: From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
  • 作者: Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou
  • 日期: 2025-08-19
  • ArXiv主页 : https://arxiv.org/abs/2508.13491
  • 论文链接 : https://arxiv.org/pdf/2508.13491

英文摘要

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

中文摘要

大型语言模型(LLMS)已显示出对财务应用的希望,但是由于现有基准的不足,它们对这个高风险领域的适用性在很大程度上仍未得到证实。现有基准仅依赖得分级评估,以单个分数总结性能,从而掩盖了对模型真正知道的知识及其精确局限性的细微理解。他们还依靠仅涵盖财务概念的狭窄子集的数据集,同时忽略了现实世界应用程序的其他必需品。为了解决这些差距,我们介绍了FINCDM,这是第一个针对财务LLMS量身定制的认知诊断评估框架,可以在知识技能水平上评估LLM,从而确定他们基于他们在技能标记的任务跨任务的响应模式或缺乏的财务技能和知识,而不是单个集合的数字。我们构建了CPA-QKA,这是第一个从认证的公共会计师(CPA)考试中得出的认知知情财务评估数据集,并全面涵盖了现实世界会计和财务技能。它是由领域专家严格注释的,他们用高通道的一致性和细粒度的知识标签作者,验证和注释问题。我们对30个专有,开源和域特异性LLM的广泛实验表明,FinCDM揭示了隐藏的知识差距,确定了未经传统基准测试的税收和调节推理等未经检验的领域,并在模型之间发现了行为群。FINCDM通过启用可解释的,技能感知的诊断来支持更多可信赖和有针对性的模型开发,并将所有数据集和评估脚本公开发布以支持进一步的研究,从而引入了用于财务LLM评估的新范式。


速度总是获胜:大语言模型有效体系结构的调查

英文摘要

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

中文摘要

大型语言模型(LLM)在语言理解,产生,推理和推动多模型模型的能力边界方面取得了令人印象深刻的结果。变压器模型作为现代LLM的基础,提供了具有出色缩放属性的强基线。但是,传统的变压器体系结构需要大量的计算,并为大规模培训和实际部署带来了重大障碍。在这项调查中,我们对创新的LLM体系结构进行了系统的检查,该体系结构解决了变压器的固有局限性并提高了效率。从语言建模开始,该调查涵盖了线性和稀疏序列建模方法的背景和技术细节,有效的充分注意变体,稀疏的Experts混合物,结合上述技术的混合模型体系结构以及新兴的扩散LLM。此外,我们讨论了这些技术对其他方式的应用,并考虑了它们对开发可扩展的资源感知基础模型的更广泛含义。通过将最近的研究分组为上述类别,该调查介绍了现代有效LLM体系结构的蓝图,我们希望这可以帮助将来的研究激励未来的研究朝着更高效,更多功能的AI系统迈进。


下一个视觉粒度生成

英文摘要

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.

中文摘要

我们通过将图像分解为结构化序列,提出了一种新颖的图像生成方法,在该序列中,该序列中的每个元素都具有相同的空间分辨率,但在使用的独特令牌数量上有所不同,从而捕获了不同级别的视觉粒度。图像生成是通过我们新引入的下一个视觉粒度(NVG)生成框架进行的,该框架生成了一个视觉粒度序列,从空图像开始,并以结构化的方式逐步完善了它,从全局布局到细节。这个迭代过程编码了分层的分层表示,该表示对多个粒度级别的生成过程提供了细粒度的控制。我们训练一系列的NVG模型,以在Imagenet数据集上生成类条件图像生成,并观察清晰的缩放行为。与VAR系列相比,NVG在FID得分方面始终优于它(3.30-> 3.03,2.57-> 2.44,2.09-> 2.06)。我们还进行了广泛的分析,以展示NVG框架的能力和潜力。我们的代码和模型将发布。


及时编排标记语言

英文摘要

Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.

中文摘要

大型语言模型(LLM)需要复杂的提示,但是当前的实践面临结构,数据集成,格式灵敏度和工具的挑战。现有方法缺乏组织复杂提示的全面解决方案,该提示涉及各种数据类型(文档,表格,图像)或系统地管理演示变化。为了解决这些差距,我们介绍了POML(及时编排标记语言)。POML采用基于组件的标记来实现逻辑结构(角色,任务,示例),用于无缝数据集成的专门标签以及类似CSS的样式系统来使内容与演示文稿解除,从而降低了格式化的灵敏度。它包括用于动态提示的模板和全面的开发人员工具包(IDE支持,SDKS),以改善版本控制和协作。我们通过两项案例研究验证了POML,这些案例研究表明了其对复杂应用集成(POMLINK)和准确性性能(TableQA)的影响,以及评估其在现实世界发展方案中有效性的用户研究。


LIVEMCP-101:有关挑战性查询的压力测试和诊断支持MCP的代理

  • 标题: LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
  • 作者: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
  • 日期: 2025-08-21
  • ArXiv主页 : https://arxiv.org/abs/2508.15760
  • 论文链接 : https://arxiv.org/pdf/2508.15760

英文摘要

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

中文摘要

工具调用已成为AI代理与现实世界互动并解决复杂任务的关键能力。虽然模型上下文协议(MCP)为工具集成提供了强大的标准化框架,但基于AI代理可以在现实,动态场景中使用不同的MCP工具有效地解决多步骤任务的方法有很大的差距。在这项工作中,我们介绍了LiveMCP-101,这是101个精心策划的现实查询的基准,通过迭代LLM重写和手动审查进行了完善,需要协调使用多种MCP工具,包括Web搜索,文件操作,数学推理和数据分析。此外,我们介绍了一种新型的评估方法,该方法利用基本真相执行计划而不是原始API输出,更好地反映了现实世界环境的不断发展的本质。实验表明,即使是Frontier LLMS的成功率低于60 \%,突出了工具编排的主要挑战。详细的消融和错误分析进一步揭示了令牌用法中不同的故障模式和效率低下,指出了用于推进当前模型的具体方向。LiveMCP-101设定了一个严格的标准,用于评估现实世界代理能力,朝着自主AI系统迈进,这些系统可以通过使用工具使用可靠地执行复杂的任务。


S^2引体:无训练扩散模型的随机自我指导

英文摘要

Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S^2-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S^2-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.

中文摘要

无分类器指导(CFG)是现代扩散模型中广泛使用的技术,可提高样品质量和及时粘附。然而,通过对高斯混合溶液建模的经验分析,我们观察到CFG产生的次优结果与地面真相之间存在差异。该模型过度依赖这些次优的预测通常会导致语义不一致和低质量输出。为了解决这个问题,我们首先从经验上证明,模型本身的子网络可以有效地完善模型的次优预测。在这种见解的基础上,我们提出了S^2 Guidance,这是一种新的方法,该方法在远期过程中利用随机的障碍物来构建随机子网络,有效地指导该模型从潜在的低质量预测和高质量的产出远离潜在的模型。关于文本对图像和文本对文本的定性实验和定量实验,表明S^2 Guidance提供了卓越的性能,始终超过CFG和其他高级指导策略。我们的代码将发布。


MCP-Universe:使用现实世界模型上下文协议服务器的大型语言模型基准测试

英文摘要

The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

中文摘要

模型上下文协议已成为将大型语言模型与外部数据源和工具联系起来的变革标准,从而在主要的AI提供商和开发平台上迅速获得了采用。但是,现有的基准测试过于简单,无法捕捉真正的应用程序挑战,例如长马推理和大型,陌生的工具空间。为了解决这个关键的差距,我们介绍了MCP-Universe,这是第一个专门设计的综合基准,该基准专门用于通过与现实世界中MCP服务器的互动来评估现实和艰苦任务的LLM。我们的基准包括6个跨越11个不同MCP服务器的核心域:位置导航,存储库管理,财务分析,3D设计,浏览器自动化和Web搜索。为了确保严格的评估,我们实施了基于执行的评估者,包括用于代理格式合规性的格式评估器,时间不变的内容匹配的静态评估器以及自动检索实时地面真相的动态评估器,以实现时间敏感的任务。通过对领先LLM的广泛评估,我们发现,即使是SOTA模型,例如GPT-5(43.72%),Grok-4(33.33%)和Claude-4.0-Sonnet(29.44%)也显示出显着的性能限制。此外,随着输入令牌的数量随相互作用步骤的数量而迅速增加,我们的基准对LLM代理带来了重大的长期挑战。此外,它引入了一个未知的工具挑战,因为LLM代理通常对MCP服务器的确切用法缺乏熟悉。值得注意的是,像光标这样的企业级代理人无法获得比标准React框架更好的性能。除了评估外,我们还可以通过UI支持开放我们的可扩展评估框架,使研究人员和从业人员能够无缝整合新的代理和MCP服务器,同时促进快速发展的MCP生态系统中的创新。


XQUANT:用KV缓存重新布置破坏LLM推断的记忆墙

  • 标题: XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
  • 作者: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
  • 日期: 2025-08-14
  • ArXiv主页 : https://arxiv.org/abs/2508.10395
  • 论文链接 : https://arxiv.org/pdf/2508.10395

英文摘要

Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2times memory savings compared to KV caching. By applying XQuant, we achieve up to sim 7.7times memory savings with <0.1 perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10times memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5times memory savings with only 0.1 perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.

中文摘要

尽管LLM推断已成为许多下游应用程序的关键工作量,但由于具有大量的内存足迹和带宽要求,有效推断LLM的挑战是具有挑战性的。同时,在过去的几十年中,计算能力在记忆容量和带宽方面都稳步超过了,这种趋势在现代GPU硬件中仍然很明显,并加剧了LLM推论的挑战。因此,新的算法正在出现,贸易增加了减少内存操作的计算。为此,我们提出了利用这一趋势的Xquant,从而通过低位量化来降低记忆消耗的速度顺序,相对于最先进的KV缓存量化方法具有实质精度优势。我们通过量化和缓存层输入激活x来实现这一目标,而不是使用标准的kV缓存,然后在推理过程中重新实现键和值。与KV缓存相比,这立即节省了2倍的内存。通过应用Xquant,与FP16基线相比,我们达到了SIM 7.7T时记忆的节省,<0.1的保障降解。此外,我们的方法还利用了x值在各个层之间相似的事实。在此观察结果的基础上,我们引入了Xquant-CL,该XQUANT-CL利用了X嵌入中的跨层相似性进行极端压缩。在不同的模型中,相对于FP16基线,Xquant-CL最多可获得10倍的内存节省,仅为0.01的困惑降解,而12.5次存储器节省仅为0.1的困惑降解。Xquant利用了硬件平台的迅速增加的计算功能来消除内存瓶颈,同时超过了最新的KV高速缓存量化方法并在各种型号中实现了接近FP16的准确性。


Tinker:从没有场景优化的稀疏输入中,散布的礼物送给3D--Multi-View一致编辑

英文摘要

We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker

中文摘要

我们介绍了Tinker,这是一种用于高保真3D编辑的多功能框架,在一声和少数弹药方面都可以运行,而无需进行任何单场填充。与以前需要广泛的人均优化的技术不同,以确保多视图的一致性或产生数十个一致的编辑输入视图,Tinker提供了可靠的多视图一致的一致编辑,从一两个图像。这种能力源于重新修复预处理的扩散模型,从而释放了其潜在的3D意识。为了推动该领域的研究,我们策划了第一个大规模的多视图编辑数据集和数据管道,从而涵盖了各种场景和样式。在此数据集的基础上,我们开发了能够生成一致的多视图编辑视图的框架,而无需每场训练,该视图由两个新颖的组件组成:(1)参考多视图编辑器:启用精确的,参考驱动的编辑,这些编辑在所有观点中保持一致。(2)任何视图到视频合成器:从视频扩散到实现高质量的场景完成和新颖的视图,甚至从稀疏输入中,利用时空先验。通过广泛的实验,修补匠大大降低了可推广的3D内容创建的障碍,在编辑,小说视图合成和渲染增强任务方面实现了最新性能。我们认为,Tinker代表了迈出真正可扩展,零击3D编辑的关键步骤。项目网页:https://aim-uofa.github.io/tinker


当标点符号重要时:llms及时鲁棒性方法的大规模比较

英文摘要

Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.

中文摘要

大型语言模型(LLM)对迅速措辞和格式的微妙,非语义变化高度敏感。在这项工作中,我们提出了对5种方法进行的首次系统评估,以改善统一的实验框架内的迅速鲁棒性。我们在来自自然指令数据集的52个任务中的Llama,Qwen和Gemma家族的8种型号上进行基准测试。我们的评估涵盖了来自微调和内在学习范式的鲁棒性方法,并测试了它们对多种类型的分布变化的概括。最后,我们将分析扩展到GPT-4.1和DeepSeek V3,以评估Frontier模型当前对格式扰动的鲁棒性。我们的发现提供了对这些鲁棒性方法的相对有效性的可行见解,使从业者可以在实现现实世界应用中稳定且可靠的LLM绩效时做出明智的决定。代码:https://github.com/airi-institute/when-puntrunution-matters。


Nvidia nemotron Nano 2:准确有效的混合Mamba转换器推理模型

  • 标题: NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
  • 作者: NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen
  • 日期: 2025-08-20
  • ArXiv主页 : https://arxiv.org/abs/2508.14444
  • 论文链接 : https://arxiv.org/pdf/2508.14444

英文摘要

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

中文摘要

我们介绍了Nemotron-Nano-9B-V2,这是一种混合Mamba-Transformer语言模型,旨在增加推理工作负载的吞吐量,同时与类似尺寸的模型相比,同时实现最新的精度。Nemotron-Nano-9B-V2建立在Nemotron-H架构上,其中通用变压器体系结构中的大多数自我发场层被MAMBA-2层取代,以在产生推理所需的较长思维痕迹时,可以提高推理速度。我们使用FP8培训配方在20万亿代币上首次预训练了120亿参数模型(Nemotron-Nano-12b-v2 base),从而创建了Nemotron-Nano-9b-V2。在对齐Nemotron-Nano-12b-V2碱基对齐后,我们采用了Minitron策略来压缩和提炼模型,目的是在单个NVIDIA A10G GPU上推断多达128K令牌(22GIB的存储器,Bfloat16,Bfloat16精确度)。与现有的类似模型(例如QWEN3-8B)相比,我们表明Nemotron-Nano-9B-V2在推理基准测试中实现了PAR或更高的准确性,同时在8K输入和16K输出标记(如8K输入和16K输出标记)中达到了高达6倍的推理吞吐量。我们正在释放Nemotron-Nano-9b-V2,Nemotron-Nano12b-V2碱基和Nemotron-Nano-9b-v2碱基检查点,以及大多数在拥抱面孔上的训练前和训练后数据集。


摇摆:挥舞着栩栩如生的视频生成

英文摘要

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

中文摘要

我们提出Wawver,这是一种用于统一图像和视频生成的高性能基础模型。Waver可以直接生成视频,其持续时间为5到10秒,其本机分辨率为720p,随后将其升级为1080p。该模型同时支持单个集成框架中文本对视频(T2V),图像到视频(I2V)和文本对图像(T2I)的生成。我们引入了混合流dit体系结构,以增强模态对准并加速训练融合。为了确保培训数据质量,我们建立了全面的数据策划管道,并手动注释并训练基于MLLM的视频质量模型,以过滤最高质量的样品。此外,我们提供详细的培训和推理食谱,以促进高质量视频的产生。在这些贡献的基础上,Waver擅长捕获复杂的运动,实现了视频合成中的出色运动振幅和时间一致性。值得注意的是,它在人工分析中(截至2025-07-30 10:00 GMT+8)在T2V和I2V排行榜上排名前三,始终超过现有的开源模型,并匹配或超越了最新的商业解决方案。我们希望这份技术报告能够帮助社区更有效地培训高质量的视频生成模型,并加速视频生成技术的进度。官方页面:https://github.com/foundationvision/waver。


GPT-5是否达到了空间智能?一项实证研究

  • 标题: Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
  • 作者: Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
  • 日期: 2025-08-18
  • ArXiv主页 : https://arxiv.org/abs/2508.13142
  • 论文链接 : https://arxiv.org/pdf/2508.13142

英文摘要

Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.

中文摘要

近年来,多模式模型取得了显着的进步。然而,他们继续在空间理解和推理中表现出明显的局限性,这是实现人工通用情报的基本能力。据称是迄今为止最强大的AI模型,GPT-5的最新发布是及时检查领先模型在空间智能的道路上的位置。首先,我们提出了空间任务的全面分类法,该分类法统一了现有的基准,并讨论了确保公平评估的挑战。然后,我们在八个关键基准上评估最新的专有和开源模型,成本超过了十亿个令牌。我们的实证研究表明,(1)GPT-5在空间智能中表现出了前所未有的力量,但(2)在各种各样的任务中仍然没有人类的表现。此外,我们(3)确定了多模式模型的更具挑战性的空间智能问题,并且(4)专有模型在面对最困难的问题时并不具有决定性的优势。此外,我们对各种场景进行定性评估,这些方案对人类直观,甚至使最先进的多模式模型都失败了。


从AI科学到代理科学:自治科学发现的调查

  • 标题: From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery
  • 作者: Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, Bowen Zhou
  • 日期: 2025-08-18
  • ArXiv主页 : https://arxiv.org/abs/2508.14111
  • 论文链接 : https://arxiv.org/pdf/2508.14111

英文摘要

Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement -- behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives -- process-oriented, autonomy-oriented, and mechanism-oriented -- through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.

中文摘要

人工智能(AI)正在重塑科学发现,从专门的计算工具发展成为自主研究伙伴。我们将代理科学定位为科学范式更广泛的AI中的关键阶段,在该范式中,AI系统从部分援助发展到完整的科学机构。代理AI由大语言模型(LLM),多模式系统和集成研究平台启用,显示了假设产生,实验设计,执行,分析和迭代精炼的能力 - 曾经被认为是独特的人类的行为。这项调查提供了面向领域的综述,对生命科学,化学,材料科学和物理学的自主科学发现。我们通过一个综合框架统一了三个先前零散的观点,即面向过程,面向自治的,面向自治的,面向机制,以机制为导向,该框架连接基础能力,核心过程和特定于领域的实现。在此框架的基础上,我们(i)追踪AI的科学演变,(ii)确定五个基于科学机构的核心能力,(iii)模型发现是动态的四阶段工作流程,(iv)审查上述范围内的应用程序,以及(v)合成关键挑战和未来的挑战和未来的机会。这项工作建立了以域为导向的自主科学发现的综合,并将代理科学定位为用于推进AI驱动研究的结构化范式。


相关推荐
Mintopia1 小时前
🏗️ 系统架构之:大模型 Token 计费方案
人工智能·架构·全栈
萤火虫的夏天2511 小时前
虚拟环境安装tensorflow使用GPU加速,显卡:1650ti
人工智能·python·tensorflow
亚马逊云开发者1 小时前
从误判到精准:游戏社区 AI 审核的工程化实践
人工智能
JeJe同学1 小时前
Diffusion模型相比GAN优势与缺点?
人工智能·神经网络·生成对抗网络
橙序员小站1 小时前
Java 接入Pinecone搭建知识库踩坑实记
java·开发语言·人工智能
Star abuse1 小时前
XML转YOLO格式数据集教程
xml·人工智能·yolo
楚兴1 小时前
使用 Eino 和 Ollama 构建智能 Go 应用:从简单问答到复杂 Agent
人工智能·后端
Love__Tay2 小时前
使用Upsonic实现金融合规任务自动化:技术实践与思考
人工智能·金融·自动化
腾飞开源2 小时前
23_Spring AI 干货笔记之 NVIDIA 聊天
人工智能·nvidia·spring ai·聊天模型·llm api·openai客户端·配置属性