【论文速递】2025年第52周(Dec-21-27)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

DataFlow:以数据为中心的人工智能时代统一数据准备和工作流程自动化的法学硕士驱动框架

  • 标题: DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
  • 作者: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang
  • 日期: 2025-12-18
  • ArXiv主页 : https://arxiv.org/abs/2512.16676
  • 论文链接 : https://arxiv.org/pdf/2512.16676
  • 项目链接 : https://github.com/OpenDCAI/DataFlow
  • gitHub仓库 : https://github.com/OpenDCAI/DataFlow

英文摘要

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3% execution accuracy in Text-to-SQL over SynSQL, +7% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

中文摘要

对大型语言模型 (LLM) 中高质量数据的快速增长的需求加剧了对可扩展、可靠且语义丰富的数据准备管道的需求。然而,当前的实践仍然以临时脚本和松散指定的工作流程为主,它们缺乏原则性的抽象,阻碍了可重复性,并且对模型在环数据生成的支持有限。为了应对这些挑战,我们推出了 DataFlow,这是一个统一且可扩展​​的 LLM 驱动的数据准备框架。DataFlow 采用系统级抽象设计,可实现模块化、可重用和可组合的数据转换,并提供 PyTorch 风格的管道构建 API,用于构建可调试和可优化的数据流。该框架由近 200 个可重用运算符和 6 个领域通用管道组成,涵盖文本、数学推理、代码、文本到 SQL、代理 RAG 和大规模知识提取。为了进一步提高可用性,我们引入了 DataFlow-Agent,它通过算子合成、管道规划和迭代验证自动将自然语言规范转换为可执行管道。在六个代表性用例中,DataFlow 不断提高下游 LLM 性能。我们的数学、代码和文本管道的性能优于精心策划的人类数据集和专门的合成基线,在 SynSQL 上实现了高达 +3% 的文本到 SQL 执行精度,代码基准平均提高了 +7%,并且在 MATH、GSM8K 和 AIME 上提高了 1--3 点。此外,DataFlow 生成的统一 10K 样本数据集使基础模型能够超越在 1M Infinity-Instruct 数据上训练的对应模型。这些结果表明,DataFlow 为可靠、可重复和可扩展的 LLM 数据准备提供了实用且高性能的基础,并为未来以数据为中心的 AI 开发奠定了系统级基础。


通过与科学家协调的工作流程探索法学硕士的科学通用智力

  • 标题: Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
  • 作者: Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai
  • 日期: 2025-12-18
  • ArXiv主页 : https://arxiv.org/abs/2512.16969
  • 论文链接 : https://arxiv.org/pdf/2512.16969
  • 项目链接 : https://internscience.github.io/SGI-Page/
  • gitHub仓库 : https://github.com/InternScience/SGI-Bench

英文摘要

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

中文摘要

尽管科学人工智能取得了进步,但科学通用智能(SGI)的连贯框架------跨科学领域自主构思、调查和推理的能力------仍然缺乏。我们提出了一个基于实用探究模型(PIM:审议、概念、行动、感知)的可操作 SGI 定义,并通过四个与科学家相关的任务来实施它:深入研究、想法生成、干/湿实验和实验推理。SGI-Bench 包含 1,000 多个专家策划的跨学科样本,灵感来自《科学》的 125 个大问题,能够对最先进的法学硕士进行系统评估。结果揭示了差距:尽管进行了步骤级对齐,但深度研究中的精确匹配度较低(10--20%);想法缺乏可行性和细节;干实验时代码可执行性高,但执行结果准确率低;湿方案中序列保真度低;以及持续存在的多模式比较推理挑战。我们进一步引入测试时强化学习(TTRL),它优化推理时检索增强的新颖性奖励,在没有参考答案的情况下增强假设的新颖性。我们以 PIM 为基础的定义、以工作流程为中心的基准和实证见解共同为真正参与科学发现的人工智能系统奠定了基础。


TurboDiffusion:将视频扩散模型加速 100-200 倍

英文摘要

We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

中文摘要

我们推出 TurboDiffusion,这是一种视频生成加速框架,可以将端到端扩散生成速度提高 100-200 倍,同时保持视频质量。TurboDiffusion主要依靠几个组件来进行加速:(1)注意力加速:TurboDiffusion使用低位SageAttention和可训练的稀疏线性注意力(SLA)来加速注意力计算。(2) 分级蒸馏:TurboDiffusion采用rCM进行高效的分级蒸馏。(3) W8A8量化:TurboDiffusion将模型参数和激活量化到8位,以加速线性层并压缩模型。此外,TurboDiffusion 还结合了其他一些工程优化。我们在Wan2.2-I2V-14B-720P、Wan2.1-T2V-1.3B-480P、Wan2.1-T2V-14B-720P和Wan2.1-T2V-14B-480P模型上进行了实验。实验结果表明,即使在单个 RTX 5090 GPU 上,TurboDiffusion 也能实现 100-200 倍的视频生成加速,同时保持相当的视频质量。GitHub 存储库包含模型检查点和易于使用的代码,可从 https://github.com/thu-ml/TurboDiffusion 获取。


SemanticGen:语义空间中的视频生成

英文摘要

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

中文摘要

最先进的视频生成模型通常会学习 VAE 空间中视频潜伏的分布,并使用 VAE 解码器将它们映射到像素。虽然这种方法可以生成高质量的视频,但它的收敛速度较慢,并且在生成长视频时计算成本较高。在本文中,我们介绍了 SemanticGen,这是一种通过在语义空间中生成视频来解决这些限制的新颖解决方案。我们的主要见解是,由于视频固有的冗余性,生成过程应该从紧凑的高级语义空间开始进行全局规划,然后添加高频细节,而不是使用双向注意力直接对大量低级视频标记进行建模。SemanticGen 采用两阶段生成过程。在第一阶段,扩散模型生成紧凑的语义视频特征,定义视频的全局布局。在第二阶段,另一个扩散模型根据这些语义特征生成 VAE 潜伏,以产生最终输出。我们观察到,与 VAE 潜在空间相比,语义空间中的生成会导致更快的收敛。当扩展到长视频生成时,我们的方法也是有效且计算效率高的。大量的实验表明,SemanticGen 可以生成高质量的视频,并且性能优于最先进的方法和强大的基线。


Step-DeepResearch技术报告

  • 标题: Step-DeepResearch Technical Report

  • 作者: Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

  • 日期: 2025-12-23

  • ArXiv主页 : https://arxiv.org/abs/2512.20491

  • 论文链接 : https://arxiv.org/pdf/2512.20491

  • gitHub仓库 : https://github.com/stepfun-ai/StepDeepResearch

英文摘要

As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

中文摘要

随着法学硕士转向自主代理,深度研究已成为一个关键指标。然而,像 BrowseComp 这样的现有学术基准往往无法满足现实世界对开放式研究的需求,这需要在意图识别、长期决策和跨源验证方面具备强大的技能。为了解决这个问题,我们引入了 Step-DeepResearch,这是一种经济高效的端到端代理。我们提出了一种基于原子能力的数据合成策略,以加强规划和报告编写,并结合从代理中期训练到 SFT 和 RL 的渐进式训练路径。通过检查表式判断器的增强,这种方法显着提高了稳健性。此外,为了弥合中国领域的评估差距,我们针对现实的深入研究场景建立了 ADR-Bench。实验结果表明,Step-DeepResearch (32B) 在 Scale AI Research Rubrics 上得分为 61.4%。在 ADR-Bench 上,它的性能显着优于同类模型,可与 OpenAI 和 Gemini DeepResearch 等 SOTA 闭源模型相媲美。这些发现证明,精细化训练能够使中型模型以行业领先的成本效率实现专家级能力。


PhysBrain:人类以自我为中心的数据作为从视觉语言模型到物理智能的桥梁

英文摘要

Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9%), demonstrating effective transfer from human egocentric supervision to downstream robot control.

中文摘要

机器人的泛化依赖于物理智能:在以自我为中心的感知和行动下推理状态变化、丰富的接触交互以及长期规划的能力。然而,大多数 VLM 主要是根据第三人称数据进行训练的,这为人形机器人造成了基本的视点不匹配。由于成本高昂和多样性有限,扩展机器人以自我为中心的数据收集仍然不切实际,而大规模人类以自我为中心的视频提供了一种可扩展的替代方案,可以自然地捕获丰富的交互上下文和因果结构。关键的挑战是将原始的以自我为中心的视频转换为结构化且可靠的体现培训监督。因此,我们提出了一种 Egocentric2Embodiment 翻译管道,将第一人称视频转换为多层次、模式驱动的 VQA 监督,具有强制证据基础和时间一致性,从而能够大规模构建 Egocentric2Embodiment 数据集 (E2E-3M)。通过在 E2E-3M 数据集上进行训练获得了一个具有自我中心意识的实体大脑,称为 PhysBrain。PhysBrain 表现出显着改善的以自我为中心的理解,特别是对于 EgoThink 的规划。它提供了以自我为中心的感知初始化,可以实现更高效的 VLA 微调和更高的 SimplerEnv 成功率 (53.9%),证明了从人类以自我为中心的监督到下游机器人控制的有效转移。


Robust-R1:用于鲁棒视觉理解的退化感知推理

英文摘要

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.

中文摘要

多模态大型语言模型在极端的现实世界视觉退化下难以保持可靠的性能,这阻碍了它们的实际稳健性。现有的稳健 MLLM 主要依赖于仅关注视觉编码器泛化的隐式训练/适应,受到有限的可解释性和孤立的优化的影响。为了克服这些限制,我们提出了 Robust-R1,这是一种通过结构化推理链显式模拟视觉退化的新颖框架。我们的方法集成了:(i)针对退化感知推理基础的监督微调,(ii)用于准确感知退化参数的奖励驱动对齐,以及(iii)适应退化强度的动态推理深度缩放。为了促进这种方法,我们引入了一个专门的 11K 数据集,该数据集具有跨四个关键现实世界视觉处理阶段合成的真实退化,每个阶段都用连接退化参数、感知影响、原始语义推理链和结论的结构化链进行注释。综合评估证明了最先进的稳健性:Robust-R1 在现实世界退化基准 R-Bench 上优于所有通用和稳健的基线,同时在 MMMB、MMStar 和 RealWorldQA 上的多强度对抗性退化下保持卓越的抗退化性能。


潜在的内隐视觉推理

英文摘要

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

中文摘要

虽然大型多模态模型 (LMM) 取得了重大进展,但它们仍然主要以文本为中心,依赖语言作为其核心推理模式。因此,他们处理主要是视觉的推理任务的能力受到限制。最近的方法试图通过使用辅助图像、深度图或图像裁剪来监督中间视觉步骤来解决这个问题。然而,这些策略对"有用的"视觉抽象的外观施加了限制性先验,增加了沉重的注释成本,并且难以跨任务进行概括。为了解决这一关键限制,我们提出了一种与任务无关的机制,可以训练 LMM 在没有明确监督的情况下发现和使用视觉推理标记。这些令牌在全球范围内参与,并以任务自适应的方式重新编码图像,使模型能够在没有手工监督的情况下提取相关的视觉信息。我们的方法优于直接微调,并在各种以视觉为中心的任务上实现了最先进的结果------包括那些难以指定中间抽象的任务------同时还推广到多任务指令调整。


棱镜假设:通过统一自动编码协调语义和像素表示

英文摘要

Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

中文摘要

跨模态的深度表征本质上是相互交织的。在本文中,我们系统地分析了各种语义和像素编码器的频谱特性。有趣的是,我们的研究揭示了编码器的特征谱与其功能角色之间的高度鼓舞人心且很少探索的对应关系:语义编码器主要捕获编码抽象含义的低频分量,而像素编码器还保留传达细粒度细节的高频信息。这一启发式的发现提供了一个统一的视角,将编码器的行为与其底层的频谱结构联系起来。我们将其定义为棱镜假设,其中每个数据模态都可以被视为自然世界在共享特征谱上的投影,就像棱镜一样。基于这一见解,我们提出了统一自动编码(UAE),这是一种通过创新的频带调制器协调语义结构和像素细节的模型,使它们能够无缝共存。针对 ImageNet 和 MS-COCO 基准的大量实验验证了我们的 UAE 有效地将语义抽象和像素级保真度统一到具有最先进性能的单个潜在空间中。


自下而上的策略优化:你的语言模型策略秘密包含内部策略

英文摘要

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.

中文摘要

现有的强化学习 (RL) 方法将大型语言模型 (LLM) 视为单一的统一策略,而忽略了其内部机制。因此,了解策略如何跨层和模块演变对于实现更有针对性的优化和阐明复杂的推理机制至关重要。在本文中,我们利用 Transformer 残差流的内在分裂以及具有非嵌入矩阵的隐藏状态的组成与结果可采样策略之间的等价性来分解语言模型策略。这种分解揭示了内部层策略(对应于各个层的贡献)和内部模块化策略(与每层内的自注意力和前馈网络(FFN)组件保持一致)。通过分析内部策略的熵,我们发现:(a)早期层保持高熵用于探索,顶层收敛到接近零熵以进行细化,不同模型系列的收敛模式各不相同。(b) LLama 的预测空间在最后一层迅速收敛,而 Qwen 系列模型,尤其是 Qwen3,表现出更类似于人类、渐进结构化的推理模式。受这些发现的启发,我们提出了自下而上的策略优化(BuPO),这是一种新颖的 RL 范式,可在早期训练期间直接优化内部层策略。通过调整较低层的训练目标,BuPO 重建基础推理能力并获得卓越的性能。对复杂推理基准的大量实验证明了我们方法的有效性。我们的代码可在 https://github.com/Trae1ounG/BuPO 获取。


自回归模型中的紧急时间抽象支持分层强化学习

  • 标题: Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
  • 作者: Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
  • 日期: 2025-12-23
  • ArXiv主页 : https://arxiv.org/abs/2512.20605
  • 论文链接 : https://arxiv.org/pdf/2512.20605

英文摘要

Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.

中文摘要

针对下一个标记预测进行预训练并通过强化学习 (RL) 进行微调的大规模自回归模型在许多问题领域取得了前所未有的成功。在强化学习期间,这些模型通过生成新输出(一次一个令牌)来进行探索。然而,逐个对动作进行采样可能会导致学习效率极低,尤其是在奖励稀疏的情况下。在这里,我们证明可以通过在自回归模型的内部表示中进行行动和探索来克服这个问题。具体来说,为了发现时间抽象动作,我们引入了一个高阶、非因果序列模型,其输出控制基本自回归模型的残差流激活。在具有层次结构的网格世界和基于 MuJoCo 的任务中,我们发现高阶模型学习将长激活序列块压缩到内部控制器上。至关重要的是,每个控制器执行一系列具有行为意义的动作,这些动作在很长的时间尺度上展开,并伴随着学习的终止条件,这样随着时间的推移组合多个控制器可以有效地探索新任务。我们证明,直接内部控制器强化(我们称之为"内部强化学习"的过程)可以在标准强化学习微调失败的情况下从稀疏奖励中进行学习。我们的结果证明了自回归模型中潜在动作生成和强化的好处,表明内部强化学习是在基础模型中实现分层强化学习的有前途的途径。


当推理符合其规律时

英文摘要

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/

中文摘要

尽管大型推理模型(LRM)具有卓越的性能,但它们的推理行为往往违反直觉,导致推理能力不佳。为了从理论上形式化所需的推理行为,本文提出了推理定律 (LoRe),这是一个表征 LRM 内在推理模式的统一框架。我们首先提出计算定律,假设推理计算应随问题复杂性线性扩展。除了计算之外,我们还用补充精度定律扩展了 LoRe。由于问题的复杂性在实践中很难量化,我们通过规律的两个属性------单调性和组合性来检验这些假设。因此,我们引入了 LoRe-Bench,这是一个系统地测量大型推理模型的这两个易于处理的属性的基准。评估表明,大多数推理模型表现出合理的单调性,但缺乏组合性。作为回应,我们开发了一种有效的微调方法来强制执行计算律组合性。广泛的实证研究表明,更好地遵守计算定律可以在多个基准上持续提高推理性能,并揭示属性和定律之间的协同效应。项目页面:https://lore-project.github.io/


LongVideoAgent:长视频多智能体推理

英文摘要

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

中文摘要

多模式法学硕士和使用长视频质量保证工具的系统的最新进展表明了对长达一小时的剧集进行推理的前景。然而,许多方法仍然将内容压缩成有损摘要或依赖有限的工具集,从而削弱了时间基础并丢失了细粒度的线索。我们提出了一个多代理框架,其中主法学硕士协调基础代理来定位与问题相关的片段,并协调视觉代理来提取目标文本观察。主智能体规划有步数限制,并通过强化学习进行训练,以鼓励简洁、正确、高效的多智能体合作。这种设计可以帮助主代理通过接地将注意力集中在相关片段上,用视觉细节补充字幕,并产生可解释的轨迹。在我们提出的 LongTVQA 和 LongTVQA+(从 TVQA/TVQA+ 聚合而来的剧集级数据集)上,我们的多智能体系统显着优于强大的非智能体基线。实验还表明,强化学习进一步加强了训练有素的智能体的推理和规划。代码和数据将在 https://longvideoagent.github.io/ 共享。


用于教学视频编辑的区域约束上下文生成

英文摘要

The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

中文摘要

最近,上下文生成范式在教学图像编辑方面在数据效率和合成质量方面表现出了强大的力量。然而,为基于指令的视频编辑塑造这种情境学习并非易事。在不指定编辑区域的情况下,去噪时结果会出现编辑区域不准确、编辑区域与非编辑区域之间的标记干扰等问题。为了解决这些问题,我们提出了 ReCo,一种新的教学视频编辑范例,它新颖地深入研究了上下文生成过程中编辑和非编辑区域之间的约束建模。从技术上讲,ReCo 宽度方向连接源视频和目标视频以进行联合去噪。为了校准视频扩散学习,ReCo 利用两个正则化项,即潜在正则化和注意力正则化,分别在一步后向去噪潜伏和注意力图上进行。前者增加了源视频和目标视频之间编辑区域的潜在差异,同时减少了非编辑区域的潜在差异,强调了编辑区域的修改,减轻了外部意外内容的生成。后者抑制了编辑区域中的标记对源视频对应部分中的标记的注意,从而减轻了它们在目标视频中的新对象生成期间的干扰。此外,我们提出了一个大规模、高质量的视频编辑数据集,即 ReCo-Data,包含 50 万个指令视频对,有利于模型训练。对四个主要的基于指令的视频编辑任务进行的广泛实验证明了我们建议的优越性。


学习 4D 推理:视觉语言模型的动态空间理解

英文摘要

Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

中文摘要

视觉语言模型 (VLM) 擅长一般理解,但在动态空间推理 (DSR) 方面仍然较弱,即推理 3D 空间中对象几何形状和关系随时间的演变,这很大程度上是由于可扩展的 4D 感知训练资源的稀缺。为了弥补数据集、基准测试和模型方面的差距,我们引入了 DSR Suite。首先,我们提出了一个自动化管道,可以从 DSR 的野外视频中生成多项选择问答对。通过利用现代视觉基础模型,该管道提取丰富的几何和运动信息,包括相机姿势、局部点云、对象掩模、方向和 3D 轨迹。这些几何线索使得能够构建用于学习的 DSR-Train 以及进一步人工改进的用于评估的 DSR-Bench。与以前的工作相比,我们的数据强调(i)野外视频源,(ii)对象和场景级3D要求,(iii)视点变换,(iv)多对象交互,以及(v)细粒度的程序答案。除了数据之外,我们还提出了一个轻量级几何选择模块(GSM),将几何先验无缝集成到 VLM 中,从而压缩问题语义,并将问题相关知识从预训练的 4D 重建先验中提取到一组紧凑的几何标记中。这种有针对性的提取避免了模型被不相关的知识淹没。实验表明,将 DSR-Train 和 GSM 集成到 Qwen2.5-VL-7B 中可显着增强其动态空间推理能力,同时保持通用视频理解基准的准确性。


Seed-Prover 1.5:通过从经验中学习掌握本科水平的定理证明

  • 标题: Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

  • 作者: Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu

  • 日期: 2025-12-19

  • ArXiv主页 : https://arxiv.org/abs/2512.17260

  • 论文链接 : https://arxiv.org/pdf/2512.17260

  • gitHub仓库 : https://github.com/ByteDance-Seed/Seed-Prover

英文摘要

Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present Seed-Prover 1.5, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves 88% of PutnamBench (undergraduate-level), 80% of Fate-H (graduate-level), and 33% of Fate-X (PhD-level) problems. Notably, using our system, we solved 11 out of 12 problems from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.

中文摘要

大型语言模型最近在生成严格的数学证明方面取得了重大进展。相比之下,利用法学硕士以形式语言(例如 Lean)进行定理证明仍然具有挑战性且计算成本昂贵,特别是在解决本科及以上水平的问题时。在这项工作中,我们提出了 Seed-Prover 1.5,这是一种通过大规模代理强化学习训练的正式定理证明模型,以及高效的测试时间扩展 (TTS) 工作流程。通过与Lean等工具的广泛交互,模型在强化学习过程中不断积累经验,大幅提升形式定理证明的能力和效率。此外,利用自然语言证明方面的最新进展,我们的 TTS 工作流程有效地弥合了自然语言和形式语言之间的差距。与最先进的方法相比,Seed-Prover 1.5 以更小的计算预算实现了卓越的性能。它解决了 88% 的 PutnamBench(本科水平)、80% 的 Fate-H(研究生水平)和 33% 的 Fate-X(博士水平)问题。值得注意的是,使用我们的系统,我们在 9 小时内解决了 Putnam 2025 的 12 个问题中的 11 个问题。我们的研究结果表明,在高质量的形式反馈的推动下,从经验中进行扩展学习对于形式数学推理的未来具有巨大的潜力。


SpatialTree:空间能力如何在 MLLM 中扩展

英文摘要

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

中文摘要

认知科学表明,空间能力是逐步发展的------从感知到推理和互动。然而,在多模式法学硕士(MLLM)中,这种层次结构仍然知之甚少,因为大多数研究都集中在一组狭窄的任务上。我们引入了 SpatialTree,这是一种受认知科学启发的层次结构,它将空间能力组织为四个级别:低级感知 (L1)、心理映射 (L2)、模拟 (L3) 和代理能力 (L4)。基于这种分类法,我们构建了第一个以能力为中心的分层基准,全面评估了 27 个子能力的主流 MLLM。评估结果揭示了一个清晰的结构:L1 技能在很大程度上是正交的,而更高级别的技能则具有很强的相关性,表明相互依赖性不断增强。通过有针对性的监督微调,我们发现了 L1 内令人惊讶的动态负迁移,但从低级到高级能力的跨级别迁移具有显着的协同作用。最后,我们探讨如何改进整个层次结构。我们发现,鼓励广泛"思考"的朴素强化学习是不可靠的:它有助于复杂推理,但会损害直觉感知。我们提出了一种简单的自动思考策略,可以抑制不必要的深思熟虑,使强化学习能够持续提高各个级别的性能。通过构建 SpatialTree,我们提供了一个概念验证框架,用于理解和系统地扩展 MLLM 中的空间能力。


4D-RGPT:通过感知蒸馏实现区域级 4D 理解

英文摘要

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and © R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

中文摘要

尽管多模态法学硕士 (MLLM) 取得了进步,但它们对 3D 结构和时间动态进行推理的能力仍然有限,受到 4D 感知和时间理解较弱的限制。现有的 3D 和 4D 视频问答 (VQA) 基准测试也强调静态场景,缺乏区域级提示。我们通过引入以下方法来解决这些问题:(a) 4D-RGPT,一种专门的 MLLM,旨在从具有增强时间感知的视频输入中捕获 4D 表示;(b) 感知 4D 蒸馏(P4D),一种训练框架,可将 4D 表示从冻结的专家模型转移到 4D-RGPT,以实现全面的 4D 感知;© R4D-Bench,这是具有区域级提示的深度感知动态场景的基准,通过混合自动化和人工验证的管道构建。我们的 4D-RGPT 在现有 4D VQA 基准和拟议的 R4D-Bench 基准上都取得了显着改进。


语义和重构都很重要:使表示编码器为文本到图像的生成和编辑做好准备

英文摘要

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

中文摘要

现代潜在扩散模型 (LDM) 通常在低级变分自动编码器 (VAE) 潜在空间中运行,这些空间主要针对像素级重建进行优化。为了统一视觉生成和理解,一个新兴的趋势是采用表示编码器的高维特征作为生成潜伏。然而,我们凭经验确定了该范式中的两个基本障碍:(1)判别性特征空间缺乏紧凑的正则化,使得扩散模型容易出现偏离流形的潜在问题,从而导致对象结构不准确;(2)编码器本质上较弱的像素级重建阻碍了生成器学习精确的细粒度几何和纹理。在本文中,我们提出了一个系统框架,以适应面向理解的编码器特征来执行生成任务。我们引入了语义像素重建目标来规范潜在空间,从而将语义信息和细粒度细节压缩为高度紧凑的表示(96 个通道,16x16 空间下采样)。这种设计确保潜在空间在语义上保持丰富并实现最先进的图像重建,同时保持足够紧凑以进行准确生成。利用这种表示,我们设计了统一的文本到图像(T2I)和图像编辑模型。通过对各种特征空间进行基准测试,我们证明了我们的方法在 T2I 和编辑任务中实现了最先进的重建、更快的收敛和显着的性能提升,验证了表示编码器可以有效地适应强大的生成组件。


DreaMontage:任意帧引导的一次性视频生成

英文摘要

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

中文摘要

"一次性"技术代表了电影制作中独特而复杂的美学。然而,其实际实现往往受到高昂的成本和复杂的现实限制的阻碍。尽管新兴的视频生成模型提供了虚拟替代方案,但现有方法通常依赖于简单的剪辑串联,这常常无法保持视觉平滑度和时间连贯性。在本文中,我们介绍了DreaMontage,这是一种专为任意帧引导生成而设计的综合框架,能够从不同的用户提供的输入中合成无缝、富有表现力和长时间的一次性视频。为了实现这一目标,我们通过三个主要维度应对挑战。(i) 我们将轻量级中间调节机制集成到 DiT 架构中。通过采用有效利用基础训练数据的自适应调整策略,我们解锁了强大的任意帧控制功能。(ii) 为了增强视觉保真度和电影表现力,我们策划了高质量的数据集并实现了视觉表达 SFT 阶段。在解决主体运动合理性和过渡平滑性等关键问题时,我们采用了定制的DPO方案,显着提高了生成内容的成功率和可用性。(iii)为了促进扩展序列的生成,我们设计了一种以内存高效方式运行的分段自回归(SAR)推理策略。大量的实验表明,我们的方法实现了视觉冲击力和无缝连贯的一次性效果,同时保持计算效率,使用户能够将碎片化的视觉材料转化为生动、有凝聚力的一次性电影体验。


我们评估法学硕士法官的方法是否正确?

英文摘要

LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

中文摘要

LLM-as-a-Judge已被广泛采用作为一种评估方法,并作为模型训练中的监督奖励。然而,LLM法官的现有基准主要依赖于人类注释的基本事实,这引入了人为偏见,破坏了可靠性评估并施加了可扩展性限制。为了克服这些限制,我们引入了 Sage,这是一种新颖的评估套件,可以评估法学硕士法官的质量,而无需任何人工注释。受理性选择理论公理的启发,Sage 引入了两个新的视角来衡量法学硕士作为法官:局部自我一致性(成对偏好稳定性)和全局逻辑一致性(跨全套偏好的传递性)。我们通过将结构化基准问题与现实世界的用户查询相结合,整理了包含 650 个问题的数据集。我们的实验证明了我们指标的稳定性及其与 LLMBar 和 RewardBench2 等监督基准的高度相关性,证实了 Sage 作为 LLM 作为法官的稳健性和准确性评估套件的可靠性。基于 Sage,我们发现当前最先进的法学硕士在评分和成对环境中担任评委时表现出严重的可靠性问题;即使是表现最好的模型 Gemini-2.5-Pro 和 GPT-5,在近四分之一的困难情况下也无法保持一致的偏好。我们将此归因于一种称为情境偏好的新现象,这解释了为什么明确的评分标准或标准可以帮助模型在答案对之间做出一致的判断。我们的进一步分析表明,微调LLM作为法官是一种提高绩效的可行方法,基于小组的法官以及深度推理可以增强法官的判断一致性。我们还发现人类判断存在很大的不一致,这表明人类注释可能不是可靠的黄金标准。


QuCo-RAG:量化动态检索增强生成的预训练语料库的不确定性

英文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

中文摘要

动态检索增强生成自适应地确定生成期间何时检索,以减轻大语言模型 (LLM) 中的幻觉。然而,现有方法依赖于模型内部信号(例如,逻辑、熵),这些信号从根本上来说是不可靠的,因为 LLM 通常校准不当,并且经常对错误输出表现出很高的置信度。我们提出了 QuCo-RAG,它将主观置信度转变为根据预训练数据计算的客观统计数据。我们的方法通过两个阶段量化不确定性:(1)在生成之前,我们识别表明长尾知识差距的低频实体;(2)在生成过程中,我们验证预训练语料库中的实体共现,其中零共现通常表示幻觉风险。这两个阶段都利用 Infini-gram 对超过 4 万亿个令牌进行毫秒延迟查询,在不确定性较高时触发检索。多跳 QA 基准实验表明,QuCo-RAG 的 EM 增益比 OLMo-2 模型的最先进基线提高了 5--12 点,并且有效地转移到具有未公开的预训练数据的模型(Llama、Qwen、GPT),将 EM 提高了多达 14 点。生物医学 QA 领域的泛化进一步验证了我们范例的稳健性。这些结果将基于语料库的验证确立为动态 RAG 的有原则的、实际上与模型无关的范例。我们的代码可在 https://github.com/ZhishanQ/QuCo-RAG 上公开获取。


具有技能库的自我改进智能体的强化学习

  • 标题: Reinforcement Learning for Self-Improving Agent with Skill Library
  • 作者: Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong
  • 日期: 2025-12-18
  • ArXiv主页 : https://arxiv.org/abs/2512.17102
  • 论文链接 : https://arxiv.org/pdf/2512.17102

英文摘要

Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents' self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework's key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.

中文摘要

基于大型语言模型 (LLM) 的代理在复杂推理和多轮交互方面表现出了卓越的能力,但在部署到新环境中时却难以持续改进和适应。一种有前途的方法是实施技能库,使代理能够学习、验证和应用新技能。然而,当前的技能库方法主要依赖于法学硕士的提示,这使得一致的技能库实施具有挑战性。为了克服这些挑战,我们提出了一种基于强化学习(RL)的方法,通过技能库来增强智能体的自我改进能力。具体来说,我们引入了用于自我进化的技能增强GRPO(SAGE),这是一种新颖的强化学习框架,可以系统地将技能融入学习中。该框架的关键组件 Sequential Rollout 在每次部署时都会在一系列类似任务中迭代部署代理。当代理浏览任务链时,先前任务生成的技能会累积在库中,并可用于后续任务。此外,该框架通过技能综合奖励来增强技能的生成和利用,该奖励补充了最初的基于结果的奖励。AppWorld 上的实验结果表明,当将 SAGE 应用于具有专家经验的监督微调模型时,场景目标完成率提高了 8.9%,同时需要的交互步骤减少了 26%,生成的代币减少了 59%,在准确性和效率方面都大大优于现有方法。


WorldWarp:通过异步视频扩散传播 3D 几何

英文摘要

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: https://hyokong.github.io/worldwarp-page/{https://hyokong.github.io/worldwarp-page/}.

中文摘要

生成长距离、几何一致的视频提出了一个基本的困境:虽然一致性要求严格遵守像素空间中的 3D 几何,但最先进的生成模型在相机调节的潜在空间中运行最有效。这种脱节导致当前的方法难以应对遮挡区域和复杂的相机轨迹。为了弥补这一差距,我们提出了 WorldWarp,这是一个将 3D 结构锚与 2D 生成细化器结合起来的框架。为了建立几何基础,WorldWarp 维护了一个通过高斯溅射 (3DGS) 构建的在线 3D 几何缓存。通过明确地将历史内容扭曲成新的视图,该缓存充当结构脚手架,确保每个新框架尊重先前的几何形状。然而,静态扭曲不可避免地会因遮挡而留下孔洞和伪影。我们使用专为"填充和修改"目标而设计的时空扩散(ST-Diff)模型来解决这个问题。我们的关键创新是时空变化的噪声计划:空白区域接收全部噪声以触发生成,而扭曲区域接收部分噪声以实现细化。通过在每一步动态更新 3D 缓存,WorldWarp 可以保持视频块之间的一致性。因此,它通过确保 3D 逻辑引导结构而扩散逻辑完善纹理来实现最先进的保真度。项目页面:https://hyokong.github.io/worldwarp-page/{https://hyokong.github.io/worldwarp-page/}。


Nemotron 3 Nano:用于代理推理的开放、高效专家混合 Mamba-Transformer 模型

  • 标题: Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
  • 作者: NVIDIA, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Faisal Ladhak, Fay Wang, Fei Jia, Felipe Soares, Feng Chen, Ferenc Galko, Frankie Siino, Gal Hubara Agam, Ganesh Ajjanagadde, Gantavya Bhatt, Gargi Prasad, George Armstrong, Gerald Shen, Gorkem Batmaz, Grigor Nalbandyan, Haifeng Qian, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huizi Mao, Huy C Nguyen, Huy Q Nguyen, Iain Cunningham, Ido Shahaf, Igor Gitman, Ilya Loshchilov, Ivan Moshkov, Izzy Putterman, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jian Zhang, Jiaqi Zeng, Jie Lou, Jimmy Zhang, Jining Huang, Joey Conway, Joey Guman, John Kamalu, Johnny Greco, Jonathan Cohen, Joseph Jennings, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kai Xu, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khushi Bhardwaj, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Li Ding, Lucas Liebenwein, Luis Vega, Maanu Grover, Maarten Van Segbroeck, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Manoj Kilaru, Maor Ashkenazi, Marc Romeijn, Mark Cai, Markus Kliegl, Maryam Moosaei, Matvei Novikov, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Meredith Price, Michael Boone, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Natalie Hereth, Nave Assaf, Negar Habibi, Neta Zmora, Netanel Haber, Nicola Sessions, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nirmal Juluru, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Ouye Xie, Parth Chadha, Pasha Shamis, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pinky Xu, Piotr Januszewski, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Qing Miao, Rabeeh Karimi Mahabadi, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Rich Harang, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Roger Waleffe, Rohit Watve, Roi Koren, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Timbrook, Sadegh Mahdavi, Sahil Modi, Samuel Kriman, Sanjay Kariyappa, Sanjeev Satheesh, Saori Kaji, Satish Pasumarthi, Sean Narentharen, Sean Narenthiran, Seonmyeong Bak, Sergey Kashirsky, Seth Poulos, Shahar Mor, Shanmugam Ramasamy, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shiqing Fan, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Simeng Sun, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Stefania Alborghetti, Stephen Ge, Sugam Dipak Devare, Sumeet Kumar Barua, Suseella Panguluri, Suyog Gupta, Sweta Priyadarshi, Syeda Nahida Akter, Tan Bui, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tijmen Blankevoort, Tom Balough, Tomer Asida, Tomer Bar Natan, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Venkat Srinivasan, Venmugil Elango, Vijay Korthikanti, Vitaly Kurin, Vitaly Lavrukhin, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenfei Zhou, Will Jennings, William Zhang, Wojciech Prazuch, Xiaowei Ren, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Subara, Yoshi Suhara, Yubo Gao, Zach Moshe, Zhen Dong, Zihan Liu, Zijia Chen, Zijie Yan
  • 日期: 2025-12-23
  • ArXiv主页 : https://arxiv.org/abs/2512.20848
  • 论文链接 : https://arxiv.org/pdf/2512.20848

英文摘要

We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.

中文摘要

我们推出 Nemotron 3 Nano 30B-A3B,一种 Mixture-of-Experts 混合 Mamba-Transformer 语言模型。Nemotron 3 Nano 使用 25 万亿个文本标记进行预训练,其中包括超过 Nemotron 2 的超过 3 万亿个新的独特标记,然后在不同环境下进行监督微调和大规模 RL。Nemotron 3 Nano 比上一代 Nemotron 2 Nano 实现了更高的精度,同时每次前向传递激活的参数不到一半。与 GPT-OSS-20B 和 Qwen3-30B-A3B-Thinking-2507 等类似大小的开放模型相比,它的推理吞吐量提高了 3.3 倍,同时在流行基准测试中也更加准确。Nemotron 3 Nano 展示了增强的代理、推理和聊天能力,并支持高达 1M 令牌的上下文长度。我们在 Hugging Face 上发布了预训练的 Nemotron 3 Nano 30B-A3B Base 和后训练的 Nemotron 3 Nano 30B-A3B 检查点。


Spatia:具有可更新空间内存的视频生成

英文摘要

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

中文摘要

由于视频信号的密集、高维性质,现有的视频生成模型难以保持长期的空间和时间一致性。为了克服这一限制,我们提出了 Spatia,这是一种空间内存感知视频生成框架,它明确地将 3D 场景点云保留为持久空间内存。Spatia 根据该空间记忆迭代生成视频剪辑,并通过视觉 SLAM 不断更新它。这种动态-静态解开设计增强了整个生成过程的空间一致性,同时保留了模型生成真实动态实体的能力。此外,Spatia 支持显式摄像机控制和 3D 感知交互式编辑等应用,为可扩展、内存驱动的视频生成提供几何基础框架。


相关推荐
鲨莎分不晴9 小时前
注意力的本质:信息加权而已
人工智能
专注数据的痴汉9 小时前
「数据获取」吉林地理基础数据(道路、水系、四级行政边界、地级城市、DEM等)
大数据·人工智能·信息可视化
dagouaofei9 小时前
AI 生成 2026 年工作计划 PPT,内容质量差异在哪里
人工智能·python·powerpoint
ai_top_trends9 小时前
2026 年工作计划汇报 PPT:AI 生成方案实测对比
人工智能·python·powerpoint
JicasdC123asd9 小时前
基于YOLO11-seg的MultiSEAMHead驾驶员疲劳检测系统_计算机视觉实时监测_眼睛嘴巴状态识别
人工智能·计算机视觉
极新9 小时前
飞书产品营销&增长经理王屹煊:飞书多维表格驱动业务新增长|2025极新AIGC峰会演讲实录
人工智能
寻星探路9 小时前
【Python 全栈测开之路】Python 基础语法精讲(三):函数、容器类型与文件处理
java·开发语言·c++·人工智能·python·ai·c#
且去填词9 小时前
构建基于 DeepEval 的 LLM 自动化评估流水线
运维·人工智能·python·自动化·llm·deepseek·deepeval
xiaoxue..9 小时前
把大模型装进自己电脑:Ollama 本地部署大模型完全指南
javascript·面试·node.js·大模型·ollama