【论文速递】2025年第51周(Dec-14-20)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

Kling-Omni 技术报告

  • 标题: Kling-Omni Technical Report
  • 作者: Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu
  • 日期: 2025-12-18
  • ArXiv主页 : https://arxiv.org/abs/2512.16776
  • 论文链接 : https://arxiv.org/pdf/2512.16776

英文摘要

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

中文摘要

我们提出了 Kling-Omni,这是一个通用生成框架,旨在直接从多模态视觉语言输入合成高保真视频。Kling-Omni 采用端到端的视角,弥合了不同视频生成、编辑和智能推理任务之间的功能分离,将它们集成到一个整体系统中。与脱节的管道方法不同,Kling-Omni 支持各种用户输入,包括文本指令、参考图像和视频上下文,将它们处理成统一的多模式表示,以提供电影质量和高度智能的视频内容创建。为了支持这些功能,我们构建了一个全面的数据系统,作为多模式视频创建的基础。高效的大规模预训练策略和推理基础设施优化进一步增强了该框架的能力。综合评估表明,Kling-Omni 在上下文生成、基于推理的编辑和多模式指令遵循方面表现出卓越的能力。我们相信 Kling-Omni 超越了内容创建工具,是多模式世界模拟器的关键进步,能够感知、推理、生成动态和复杂的世界并与之交互。


Step-GUI技术报告

  • 标题: Step-GUI Technical Report
  • 作者: Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang
  • 日期: 2025-12-17
  • ArXiv主页 : https://arxiv.org/abs/2512.15431
  • 论文链接 : https://arxiv.org/pdf/2512.15431
  • 项目链接 : https://opengelab.github.io/
  • gitHub仓库 : https://github.com/stepfun-ai/gelab-zero

英文摘要

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

中文摘要

多模式大语言模型的最新进展为 GUI 自动化带来了前所未有的机遇。然而,一个根本性的挑战仍然存在:如何在保持注释可靠性的同时高效获取高质量的训练数据?我们引入了由校准步骤奖励系统提供支持的自我进化训练管道,该管道通过轨迹级校准将模型生成的轨迹转换为可靠的训练信号,以降低 10-100 倍的成本实现 >90% 的注释准确性。利用该管道,我们推出了 Step-GUI,这是一个模型系列 (4B/8B),可实现最先进的 GUI 性能(8B:80.2% AndroidWorld、48.5% OSWorld、62.6% ScreenShot-Pro),同时保持强大的通用功能。随着 GUI 代理功能的提高,实际部署需要跨异构设备的标准化接口,同时保护用户隐私。为此,我们提出了 GUI-MCP,这是第一个具有分层架构的 GUI 自动化模型上下文协议,它将低级原子操作和高级任务委托给本地专家模型相结合,从而在敏感数据保留在设备上的情况下实现高隐私执行。最后,为了评估代理是否能够处理真实的日常使用情况,我们引入了 AndroidDaily,这是一个基于现实世界移动使用模式的基准测试,在高频日常场景中包含 3146 个静态操作和 235 个端到端任务(8B:静态 89.91%,端到端 52.50%)。我们的工作推进了实用 GUI 代理的开发,并展示了在日常数字交互中实际部署的强大潜力。


MMGR:多模态生成推理

英文摘要

Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

中文摘要

视频基础模型生成视觉逼真且时间连贯的内容,但它们作为世界模拟器的可靠性取决于它们是否捕获物理、逻辑和空间约束。Frechet Video Distance (FVD) 等现有指标强调感知质量,而忽视推理失败,包括违反因果关系、物理原理和全局一致性。我们引入了 MMGR(多模态生成推理评估和基准),这是一个基于五种推理能力的原则性评估框架:物理、逻辑、3D 空间、2D 空间和时间。MMGR 评估三个领域的生成推理:抽象推理(ARC-AGI、数独)、体现导航(现实世界 3D 导航和定位)和物理常识(体育和组合交互)。MMGR 应用细粒度的指标,要求视频和图像生成的整体正确性。我们对领先的视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)进行了基准测试,揭示了跨领域的巨大性能差距。模型在物理常识任务上表现出一定的成功,但在抽象推理方面表现不佳(ARC-AGI 的准确率低于 10%),并且在具体环境中进行长视野空间规划时表现不佳。我们的分析强调了当前模型的主要局限性,包括过度依赖感知数据、全局状态一致性薄弱,以及奖励视觉合理性而非因果正确性的目标。MMGR 提供了统一的诊断基准和通往推理感知生成世界模型的路径。


人工智能代理时代的记忆

  • 标题: Memory in the Age of AI Agents

  • 作者: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan

  • 日期: 2025-12-15

  • ArXiv主页 : https://arxiv.org/abs/2512.13564

  • 论文链接 : https://arxiv.org/pdf/2512.13564

  • gitHub仓库 : https://github.com/Shichun-Liu/Agent-Memory-Paper-List

英文摘要

Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

中文摘要

内存已经成为并将继续成为基于模型的代理的核心功能。随着智能体记忆的研究迅速扩展并引起前所未有的关注,该领域也变得越来越分散。属于智能体记忆范畴的现有作品通常在动机、实现和评估协议方面存在很大差异,而松散定义的记忆术语的激增进一步模糊了概念的清晰度。事实证明,长/短期记忆等传统分类法不足以捕捉当代主体记忆系统的多样性。这项工作旨在提供当前代理记忆研究的最新情况。我们首先明确界定代理记忆的范围,并将其与 LLM 记忆、检索增强生成(RAG)和上下文工程等相关概念区分开来。然后,我们通过形式、功能和动力学的统一视角来检查主体记忆。从形式的角度来看,我们确定了代理记忆的三种主要实现方式,即令牌级记忆、参数记忆和潜在记忆。从功能的角度来看,我们提出了一种更细粒度的分类法来区分事实记忆、经验记忆和工作记忆。我们从动力学的角度分析记忆是如何随着时间的推移而形成、进化和检索的。为了支持实际开发,我们编制了内存基准测试和开源框架的全面摘要。除了整合之外,我们还阐述了对新兴研究前沿的前瞻性观点,包括记忆自动化、强化学习集成、多模态记忆、多智能体记忆和可信度问题。我们希望这项调查不仅可以作为现有工作的参考,而且可以作为重新思考记忆作为未来代理智能设计中的一流原语的概念基础。


EgoX:从单个外心视频生成以自我为中心的视频

英文摘要

Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

中文摘要

自我中心的感知使人类能够直接从自己的角度体验和理解世界。将外向中心(第三人称)视频转换为自我中心(第一人称)视频为沉浸式理解开辟了新的可能性,但由于极端的相机姿势变化和最小的视图重叠,仍然具有很大的挑战性。这项任务需要忠实地保留可见内容,同时以几何一致的方式合成不可见的区域。为了实现这一目标,我们提出了 EgoX,这是一种新颖的框架,用于从单个外中心输入生成以自我为中心的视频。EgoX 通过轻量级 LoRA 适应,利用大规模视频扩散模型的预训练时空知识,并引入统一的调节策略,通过宽度和通道级联将外心和自我中心先验结合起来。此外,几何引导的自注意力机制选择性地关注空间相关区域,确保几何一致性和高视觉保真度。我们的方法实现了连贯且真实的以自我为中心的视频生成,同时在未见过的和野外的视频中展示了强大的可扩展性和鲁棒性。


QwenLong-L1.5:长上下文推理和内存管理的训练后配方

英文摘要

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

中文摘要

我们推出了QwenLong-L1.5,该模型通过系统的训练后创新实现了卓越的长上下文推理能力。QwenLong-L1.5的关键技术突破如下:(1)长上下文数据合成管道:我们开发了一个系统的合成框架,该框架可以生成需要在全球分布的证据上进行多跳接地的具有挑战性的推理任务。通过将文档解构为原子事实及其潜在关系,然后以编程方式编写可验证的推理问题,我们的方法大规模创建高质量的训练数据,远远超出简单的检索任务,以实现真正的远程推理能力。(2) 用于长上下文训练的稳定强化学习:为了克服长上下文强化学习中的严重不稳定性,我们引入了任务平衡采样和特定于任务的优势估计,以减轻奖励偏差,并提出了自适应熵控制策略优化(AEPO)来动态调节探索-利用权衡。(3) 超长上下文的内存增强架构:认识到即使扩展的上下文窗口也无法容纳任意长的序列,我们开发了一个具有多阶段融合 RL 训练的内存管理框架,该框架将单遍推理与基于迭代内存的处理无缝集成,用于超过 4M 令牌的任务。基于Qwen3-30B-A3B-Thinking,QwenLong-L1.5在长上下文推理基准上实现了与GPT-5和Gemini-2.5-Pro相当的性能,平均超出其基线9.90分。在超长任务(1M~4M 令牌)上,QwenLong-L1.5 的内存代理框架比代理基线提高了 9.48 点。此外,获得的长上下文推理能力可以转化为科学推理、记忆工具使用和扩展对话等一般领域的表现增强。


面向生成的视觉分词器的可扩展预训练

英文摘要

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the pre-training scaling problem and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

中文摘要

视觉分词器(例如 VAE)中潜在空间的质量对于现代生成模型至关重要。然而,标准的基于重建的训练范式产生了一个偏向于低级信息的潜在空间,导致了一个基础缺陷:更好的像素级精度并不会带来更高质量的生成。这意味着将大量计算注入视觉分词器预训练中并不能很好地提高生成性能。我们将其识别为"预训练缩放问题",并建议进行必要的转变:为了有效生成,潜在空间必须简明地表示高级语义。我们提出了 VTP,一个统一的视觉分词器预训练框架,开创了图像文本对比、自监督和重建损失的联合优化。我们的大规模研究揭示了两个主要发现:(1)理解是生成的关键驱动因素,(2)更好的扩展特性,其中生成性能可以通过分配给视觉分词器预训练的计算、参数和数据有效扩展。经过大规模预训练后,我们的分词器提供了具有竞争力的配置文件(ImageNet 上的零样本精度为 78.2,rFID 为 0.36),并且与先进的蒸馏方法相比,生成收敛速度加快了 4.1 倍。更重要的是,它可以有效地扩展:在不修改标准 DiT 训练规范的情况下,仅在预训练 VTP 上投入更多 FLOPS 即可在下游生成中实现 65.8% 的 FID 改进,而传统自动编码器很早就停滞在 1/10 FLOPS。我们的预训练模型可在 https://github.com/MiniMax-AI/VTP 上获取。


代理人工智能的适应

  • 标题: Adaptation of Agentic AI

  • 作者: Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han

  • 日期: 2025-12-18

  • ArXiv主页 : https://arxiv.org/abs/2512.16301

  • 论文链接 : https://arxiv.org/pdf/2512.16301

  • gitHub仓库 : https://github.com/pat-jj/Awesome-Adaptation-of-Agentic-AI

英文摘要

Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.

中文摘要

尖端的代理人工智能系统建立在基础模型的基础上,这些模型可以适应计划、推理以及与外部工具交互,以执行日益复杂和专业的任务。随着这些系统的能力和范围不断增长,适应成为提高性能、可靠性和通用性的核心机制。在本文中,我们将快速扩展的研究领域统一到一个涵盖代理适应和工具适应的系统框架中。我们进一步将它们分解为代理适应的工具执行信号和代理输出信号形式,以及工具适应的代理不可知和代理监督形式。我们证明,该框架有助于阐明代理人工智能中适应策略的设计空间,使它们的权衡变得明确,并为系统设计过程中选择或切换策略提供实用指导。然后,我们回顾每个类别的代表性方法,分析它们的优势和局限性,并强调关键的开放挑战和未来机遇。总的来说,本文旨在为寻求构建更强大、更高效、更可靠的代理人工智能系统的研究人员和从业者提供概念基础和实践路线图。


ReFusion:具有并行自回归解码的扩散大型语言模型

英文摘要

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.

中文摘要

自回归模型 (ARM) 受到缓慢的顺序推理的阻碍。虽然掩码扩散模型 (MDM) 提供了一种并行替代方案,但它们存在严重缺点:排除键值 (KV) 缓存导致计算开销较高,以及学习难以处理的令牌组合空间上的依赖关系而导致生成不连贯。为了解决这些限制,我们引入了 ReFusion,这是一种新颖的掩码扩散模型,通过将并行解码从令牌级别提升到更高的时隙级别(其中每个时隙都是固定长度的连续子序列)来实现卓越的性能和效率。这是通过迭代"规划和填充"解码过程实现的:基于扩散的规划步骤首先识别一组弱相关的槽,然后自回归填充步骤并行解码这些选定的槽。基于槽的设计同时通过统一的因果框架解锁了完整的KV缓存重用,并将学习复杂度从令牌组合空间降低到可管理的槽级排列空间。对七个不同基准的大量实验表明,ReFusion 不仅以 34% 的性能提升和平均加速超过 18 倍的压倒性优势超越了之前的 MDM,而且还缩小了与强大 ARM 的性能差距,同时保持了 2.33 倍的平均加速。


下一个嵌入预测使视觉学习者变得强大

英文摘要

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

中文摘要

受到自然语言生成预训练成功的启发,我们询问相同的原理是否可以产生强大的自我监督视觉学习者。我们不是训练模型来输出特征以供下游使用,而是训练它们生成嵌入以直接执行预测任务。这项工作探索了从学习表征到学习模型的转变。具体来说,模型学习使用因果掩蔽和停止梯度来预测以过去的补丁嵌入为条件的未来的补丁嵌入,我们将其称为下一个嵌入预测自回归(NEPA)。我们证明,在 ImageNet-1k 上预训练的简单 Transformer 并将下一个嵌入预测作为其唯一的学习目标是有效的 - 没有像素重建、离散标记、对比损失或特定于任务的头。该公式保留了架构简单性和可扩展性,而不需要额外的设计复杂性。NEPA 在各个任务上取得了优异的成绩,经过微调后,在具有 ViT-B 和 ViT-L 主干的 ImageNet-1K 上分别获得了 83.8% 和 85.3% 的 top-1 准确率,并有效地转移到了 ADE20K 上的语义分割。我们相信,嵌入的生成预训练为视觉自我监督学习提供了一种简单、可扩展且可能与模态无关的替代方案。


LLaDA2.0:将扩散语言模型扩展到 100B

  • 标题: LLaDA2.0: Scaling Up Diffusion Language Models to 100B

  • 作者: Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang

  • 日期: 2025-12-10

  • ArXiv主页 : https://arxiv.org/abs/2512.15745

  • 论文链接 : https://arxiv.org/pdf/2512.15745

  • gitHub仓库 : https://github.com/inclusionAI/LLaDA2.0

英文摘要

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

中文摘要

本文介绍了 LLaDA2.0------一个离散扩散大型语言模型 (dLLM) 元组,通过自回归 (AR) 模型的系统转换,可将总参数扩展到 100B------为前沿规模部署建立了新的范式。LLaDA2.0 没有从头开始进行昂贵的训练,而是秉承知识继承、渐进适应和效率意识的设计原则,并通过基于新颖的三阶段块级 WSD 的训练方案将预训练的 AR 模型无缝转换为 dLLM:在块扩散中逐步增加块大小(预热)、大规模全序列扩散(稳定)和恢复到紧凑大小的块扩散(衰减)。除了与 SFT 和 DPO 的训练后对齐之外,我们还获得了 LLaDA2.0-mini (16B) 和 LLaDA2.0-flash (100B),这两个针对实际部署进行了优化的指令调整专家混合 (MoE) 变体。通过保留并行解码的优势,这些模型在前沿范围内提供卓越的性能和效率。两种模型都是开源的。


LongVie 2:多模态可控超长视频世界模型

英文摘要

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

中文摘要

在预先训练的视频生成系统上构建视频世界模型是迈向通用时空智能的重要但具有挑战性的一步。世界模型应该具备三个基本属性:可控性、长期视觉质量和时间一致性。为此,我们采取渐进的方式,首先增强可控性,然后向长期高质量发电延伸。我们提出了 LongVie 2,一个经过三个阶段训练的端到端自回归框架:(1)多模态引导,集成密集和稀疏控制信号,以提供隐式世界级监督并提高可控性;(2)对输入帧进行退化感知训练,弥合训练和长期推理之间的差距,以保持较高的视觉质量;(3) 历史上下文指导,将相邻剪辑的上下文信息对齐以确保时间一致性。我们进一步介绍了 LongVGenBench,这是一个综合基准测试,包含 100 个高分辨率的一分钟视频,涵盖不同的现实世界和合成环境。大量实验表明,LongVie 2在远程可控性、时间一致性和视觉保真度方面实现了最先进的性能,并支持持续长达五分钟的连续视频生成,标志着向统一视频世界建模迈出了重要一步。


WorldPlay:实现实时交互式世界建模的长期几何一致性

英文摘要

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

中文摘要

本文介绍了 WorldPlay,这是一种流视频传播模型,能够实现具有长期几何一致性的实时交互式世界建模,解决了限制当前方法的速度和内存之间的权衡问题。WorldPlay 从三项关键创新中汲取力量。1) 我们使用双重动作表示来实现稳健的动作控制,以响应用户的键盘和鼠标输入。2)为了实现长期一致性,我们的重构上下文记忆动态地从过去的帧中重建上下文,并使用时间重构来保持几何上重要但过去很久的帧可访问,从而有效地减轻记忆衰减。3)我们还提出了Context Forcing,一种专为内存感知模型设计的新颖蒸馏方法。调整教师和学生之间的记忆上下文可以保留学生使用远程信息的能力,实现实时速度,同时防止错误漂移。总而言之,WorldPlay 可生成 24 FPS 的长视距流式 720p 视频,具有卓越的一致性,与现有技术相比毫不逊色,并在不同场景中表现出强大的通用性。项目页面和在线演示请参见:https://3d-models.hunyuan.tencent.com/world/https://3d.hunyuan.tencent.com/sceneTo3D。


视频现实测试:AI 生成的 ASMR 视频能否欺骗 VLM 和人类?

英文摘要

Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.

中文摘要

视频生成领域的最新进展产生了通常与真实视频无法区分的生动内容,这使得人工智能生成的视频检测成为一个新兴的社会挑战。之前的 AIGC 检测基准主要评估没有音频的视频,针对广泛的叙事领域,并且仅专注于分类。然而,目前尚不清楚最先进的视频生成模型是否可以生成可靠地欺骗人类和 VLM 的沉浸式音频配对视频。为此,我们推出了Video Reality Test,这是一个源自ASMR的视频基准测试套件,用于测试紧密视听耦合下的感知真实感,具有以下维度:(i)沉浸式ASMR视频音频源。该基准建立在精心策划的真实 ASMR 视频的基础上,目标是细粒度的动作与对象交互,具有跨对象、动作和背景的多样性。(ii) 同行评审评估。一种对抗性创作者-审阅者协议,其中视频生成模型充当旨在愚弄审阅者的创作者,而 VLM 则充当寻求识别虚假内容的审阅者。我们的实验结果表明:最好的创建者 Veo3.1-Fast 甚至愚弄了大多数 VLM:最强的审稿人(Gemini 2.5-Pro)仅达到 56% 的准确率(随机 50%),远低于人类专家的准确率(81.25%)。添加音频可以提高真假辨别能力,但水印等表面线索仍然会严重误导模型。这些发现描绘了视频生成现实主义的当前边界,并暴露了 VLM 在感知保真度和视听一致性方面的局限性。我们的代码可在 https://github.com/video-reality-test/video-reality-test 获取。


Qwen-Image-Layered:通过层分解实现固有的可编辑性

英文摘要

Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling inherent editability, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on https://github.com/QwenLM/Qwen-Image-Layered{https://github.com/QwenLM/Qwen-Image-Layered}

中文摘要

由于光栅图像的纠缠性质,所有视觉内容都融合到单个画布中,最近的视觉生成模型在图像编辑过程中经常难以保持一致性。相比之下,专业设计工具采用分层表示,允许独立编辑,同时保持一致性。受此启发,我们提出了 Qwen-Image-Layered,这是一种端到端的扩散模型,它将单个 RGB 图像分解为多个语义上分离的 RGBA 层,从而实现固有的可编辑性,其中每个 RGBA 层都可以独立操作,而不影响其他内容。为了支持变长分解,我们引入了三个关键组件:(1)RGBA-VAE,用于统一 RGB 和 RGBA 图像的潜在表示;(2) VLD-MMDiT(可变层分解MMDiT)架构,能够分解可变数量的图像层;(3) 多阶段训练策略,将预训练图像生成模型适配为多层图像分解器。此外,为了解决高质量多层训练图像的稀缺问题,我们构建了一个管道来从 Photoshop 文档 (PSD) 中提取和注释多层图像。实验表明,我们的方法在分解质量方面显着超越了现有方法,并为一致的图像编辑建立了新的范例。我们的代码和模型发布在 https://github.com/QwenLM/Qwen-Image-Layered{https://github.com/QwenLM/Qwen-Image-Layered}


Finch:跨以电子表格为中心的企业工作流程的财务和会计基准测试

英文摘要

We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

中文摘要

我们引入了财务和会计基准(Finch),用于评估人工智能代理在现实世界中的企业级专业工作流程------交错数据输入、结构化、格式化、网络搜索、跨文件检索、计算、建模、验证、翻译、可视化和报告。Finch 源自安然 (Enron) 的真实企业工作空间(来自 150 名员工的 15,000 个电子表格和 500,000 封电子邮件)和其他金融机构,保留了多模式工件(文本、表格、公式、图表、代码和图像)中的混乱情况,并涵盖预算、交易和资产管理等不同领域。我们提出了一种将 LLM 辅助发现与专家注释相结合的工作流程构建过程:(1)LLM 辅助、专家验证的工作流程从现实世界的电子邮件线程和电子表格文件的版本历史中推导,以及(2)对工作流程进行细致的专家注释,需要领域专家 700 多个小时的努力。这产生了 172 个包含 384 项任务的复合工作流程,涉及 1,710 个包含 2700 万个单元格的电子表格,以及 PDF 和其他工件,捕捉了现实世界企业工作本质上混乱、长期、知识密集和协作的本质。我们对包括 GPT 5.1、Claude Sonnet 4.5、Gemini 3 Pro、Grok 4 和 Qwen 3 Max 在内的前沿人工智能系统进行了人工和自动评估,GPT 5.1 Pro 总共花费了 48 小时,但仅通过了 38.4% 的工作流程,而 Claude Sonnet 4.5 仅通过了 25.0%。全面的案例研究进一步揭示了现实世界企业工作流程给人工智能代理带来的挑战。


NL2Repo-Bench:面向编码代理的长期存储库生成评估

  • 标题: NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

  • 作者: Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang

  • 日期: 2025-12-14

  • ArXiv主页 : https://arxiv.org/abs/2512.12730

  • 论文链接 : https://arxiv.org/pdf/2512.12730

  • gitHub仓库 : https://github.com/multimodal-art-projection/NL2RepoBench

英文摘要

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

中文摘要

编码代理的最新进展表明自主软件开发取得了快速进展,但现有基准未能严格评估构建完整软件系统所需的长期能力。大多数先前的评估都集中在本地化代码生成、脚手架完成或短期修复任务上,从而留下了代理是否能够在现实世界存储库构建所需的扩展范围内维持连贯推理、规划和执行的问题。为了解决这一差距,我们提出了 NL2Repo Bench,这是一个明确设计用于评估编码代理的长期存储库生成能力的基准。仅给定一个自然语言需求文档和一个空工作区,代理必须自主设计架构、管理依赖关系、实现多模块逻辑并生成完全可安装的 Python 库。我们对最先进的开源和闭源模型进行的实验表明,长期存储库生成在很大程度上仍未得到解决:即使是最强大的代理,平均测试通过率也低于 40%,并且很少正确完成整个存储库。详细的分析揭示了基本的长期故障模式,包括过早终止、全局一致性丧失、脆弱的跨文件依赖性以及对数百个交互步骤的规划不足。NL2Repo Bench 建立了一个严格的、可验证的测试平台,用于测量持续的代理能力,并强调长期推理是下一代自主编码代理的中心瓶颈。


DEER:扩散草稿,使用自回归模型验证

英文摘要

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/

中文摘要

效率作为 LLM 驱动的代理和推理系统的关键实际挑战,越来越受到自回归 (AR) 解码固有延迟的限制。推测性解码通过草稿验证方案减轻了这种成本,但现有方法依赖于 AR 草稿模型(又名起草者),这引入了两个基本问题:(1) 逐步的不确定性积累导致目标模型和起草者之间的信任逐渐崩溃,以及 (2) AR 起草者固有的顺序解码。这些因素共同导致加速有限。在本文中,我们表明扩散大语言模型(dLLM)起草者可以通过其根本不同的概率建模和高效的并行解码策略自然地克服这些问题。基于这一见解,我们引入了 DEER,这是一种高效的推测性解码框架,可以通过扩散进行起草并通过 AR 模型进行验证。为了实现高质量的绘图,DEER 采用两阶段训练管道将基于 dLLM 的绘图器与目标 AR 模型对齐,并进一步采用单步解码来生成长草稿段。实验表明,DEER 的草稿接受长度高达 32 个令牌,远远超过 EAGLE-3 达到的 10 个令牌。此外,在使用 Qwen3-30B-A3B 的 HumanEval 上,DEER 获得了 5.54 倍的加速,而 EAGLE-3 仅实现了 2.41 倍。代码、模型、演示等,将在 https://czc726.github.io/DEER/ 提供


DentalGPT:激励牙科中的多模式复杂推理

  • 标题: DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
  • 作者: Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang
  • 日期: 2025-12-12
  • ArXiv主页 : https://arxiv.org/abs/2512.11558
  • 论文链接 : https://arxiv.org/pdf/2512.11558

英文摘要

Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

中文摘要

牙科多模态数据的可靠解释对于自动化口腔保健至关重要,但当前的多模态大语言模型(MLLM)难以捕获细粒度的牙科视觉细节,并且缺乏足够的推理能力来进行精确诊断。为了解决这些限制,我们推出了 DentalGPT,这是一种通过高质量领域知识注入和强化学习开发的专业牙科 MLLM。具体来说,迄今为止最大的带注释的牙科多模态数据集是通过聚合超过 12 万张牙科图像以及突出诊断相关视觉特征的详细描述而构建的,使其成为迄今为止拥有最广泛的牙科图像集合的多模态数据集。对该数据集的训练显着增强了 MLLM 对牙齿状况的视觉理解,而随后的强化学习阶段进一步增强了其多模态复杂推理的能力。对口内和全景基准以及医学 VQA 基准的牙科子集的综合评估表明,DentalGPT 在疾病分类和牙科 VQA 任务中实现了卓越的性能,尽管只有 7B 参数,但其性能优于许多最先进的 MLLM。这些结果表明,高质量的牙科数据与分阶段的适应相结合,为构建有能力的、领域专业化的牙科 MLLM 提供了有效的途径。


KlingAvatar 2.0技术报告

  • 标题: KlingAvatar 2.0 Technical Report
  • 作者: Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou
  • 日期: 2025-12-15
  • ArXiv主页 : https://arxiv.org/abs/2512.13313
  • 论文链接 : https://arxiv.org/pdf/2512.13313
  • 项目链接 : https://app.klingai.com/global/ai-human/image/new/

英文摘要

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

中文摘要

阿凡达视频生成模型近年来取得了显着的进步。然而,先前的工作在生成长时间高分辨率视频方面表现出有限的效率,随着视频长度的增加,会出现时间漂移、质量下降和提示跟随弱等问题。为了应对这些挑战,我们提出了 KlingAvatar 2.0,这是一个时空级联框架,可以在空间分辨率和时间维度上进行升级。该框架首先生成捕获全局语义和运动的低分辨率蓝图视频关键帧,然后使用首尾帧策略将其细化为高分辨率、时间连贯的子剪辑,同时保留长视频中的平滑时间过渡。为了增强扩展视频中的跨模态指令融合和对齐,我们引入了由三位特定模态大语言模型(LLM)专家组成的联合推理总监。这些专家推理模态优先级并推断潜在的用户意图,通过多轮对话将输入转换为详细的故事情节。负面董事进一步细化负面提示,以改善指令一致性。在这些组件的基础上,我们扩展了框架以支持特定于 ID 的多字符控制。大量的实验表明,我们的模型有效地解决了高效、多模态对齐的长格式高分辨率视频生成的挑战,提供增强的视觉清晰度、具有准确唇形同步的逼真唇齿渲染、强大的身份保留和连贯的多模态指令遵循。


Scone:通过统一理解生成模型桥接主体驱动图像生成中的合成和区分

英文摘要

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

中文摘要

主题驱动的图像生成已经从单主题合成发展到多主题合成,同时忽略了区别,即当输入包含多个候选者时识别和生成正确主题的能力。这种限制限制了复杂、现实的视觉设置中的有效性。我们提出了 Scone,一种集成了组合和区分的统一理解生成方法。Scone 使理解专家能够充当语义桥梁,传达语义信息并指导生成专家保留主题身份,同时最大限度地减少干扰。两阶段训练方案首先学习组合,然后通过语义对齐和基于注意力的掩蔽来增强区别。我们还引入了 SconeEval,这是一个评估不同场景的组成和区别的基准。实验表明,Scone 在两个基准测试中的组合和区分任务上优于现有的开源模型。我们的模型、基准测试和训练数据可从以下网址获取:https://github.com/Ryann-Ran/Scone。


无错线性注意力是免费午餐:连续时间动力学的精确解

英文摘要

Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

中文摘要

线性时间注意力和状态空间模型 (SSM) 有望解决采用 softmax 注意力的长上下文语言模型中的二次成本瓶颈。我们引入了无误差线性注意力(EFLA),这是一种数值稳定、完全并行的 Delta 规则的广义表述。具体来说,我们将在线学习更新表示为连续时间动态系统,并证明其精确解不仅可以实现,而且可以在完全并行的线性时间内计算。通过利用动力学矩阵的rank-1结构,我们直接推导了与无限阶龙格-库塔方法有效对应的精确闭式解。这种注意力机制理论上不会出现错误累积,完美捕捉连续动态,同时保持线性时间复杂度。通过一系列广泛的实验,我们表明 EFLA 在嘈杂的环境中能够实现稳健的性能,在不引入额外参数的情况下实现比 DeltaNet 更低的语言建模复杂性和卓越的下游基准性能。我们的工作为构建高保真、可扩展的线性时间注意力模型提供了新的理论基础。


使用雅可比强迫进行快速准确的因果并行解码

英文摘要

Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.

中文摘要

多令牌生成已成为加速基于变压器的大型模型推理的有前途的范例。最近的工作主要探索用于并行解码的扩散大型语言模型(dLLM),以减少推理延迟。为了实现 AR 级别的生成质量,许多技术将 AR 模型适配到 dLLM 中以实现并行解码。然而,由于训练前与训练后的不匹配,与 AR 模型相比,它们的加速有限。具体来说,训练后的屏蔽数据分布与预训练期间看到的真实数据分布存在显着偏差,并且 dLLM 依赖于双向注意力,这与预训练期间学习的因果先验发生冲突,并阻碍了精确 KV 缓存重用的集成。为了解决这个问题,我们引入了 Jacobi Forcing,这是一种渐进式蒸馏范例,其中模型根据自己生成的并行解码轨迹进行训练,将 AR 模型平滑地转换为高效的并行解码器,同时保留其预先训练的因果推理属性。在这种范式(雅可比强迫模型)下训练的模型在编码和数学基准方面实现了 3.8 倍的挂钟加速,同时性能损失最小。基于雅可比强迫模型的轨迹特征,我们引入了具有拒绝回收功能的多块解码,这使得每次迭代的令牌接受计数提高了 4.5 倍,并将挂钟加速提高了近 4.0 倍,有效地用额外的计算换取了更低的推理延迟。我们的代码可在 https://github.com/hao-ai-lab/JacobiForcing 获取。


HyperVL:适用于边缘设备的高效动态多模态大型语言模型

  • 标题: HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
  • 作者: HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang
  • 日期: 2025-12-16
  • ArXiv主页 : https://arxiv.org/abs/2512.14052
  • 论文链接 : https://arxiv.org/pdf/2512.14052

英文摘要

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

中文摘要

当前的多模态大型语言模型具有强大的感知和推理能力,但高计算和内存要求使其难以直接部署在设备环境上。虽然小参数模型逐渐被赋予了强大的通用能力,但标准 Vision Transformer (ViT) 编码器仍然是一个关键瓶颈,在处理高分辨率输入时会遇到过多的延迟和内存消耗。为了解决这些挑战,我们引入了 HyperVL,这是一种专为设备端推理量身定制的高效多模态大语言模型。HyperVL 采用图像平铺策略来限制峰值内存使用量,并结合了两项新技术:(1) 视觉分辨率压缩器 (VRC),可自适应预测最佳编码分辨率以消除冗余计算;(2) 双一致性学习 (DCL),可在统一框架内对齐多尺度 ViT 编码器,从而实现共享 LLM 下视觉分支之间的动态切换。大量实验表明,HyperVL 在跨多个基准的同等大小的模型中实现了最先进的性能。此外,它还显着降低了真实移动设备上的延迟和功耗,展示了其在设备上多模态推理中的实用性。


OpenDataArena:公平开放的训练后数据集价值基准测试平台

英文摘要

The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.

中文摘要

大型语言模型(LLM)的快速发展取决于训练后数据集的质量和多样性。然而,一个关键的二分法仍然存在:虽然模型经过了严格的基准测试,但为模型提供支持的数据仍然是一个黑匣子------其特点是组成不透明、来源不确定和缺乏系统评估。这种不透明性阻碍了可重复性,并模糊了数据特征和模型行为之间的因果联系。为了弥补这一差距,我们引入了 OpenDataArena (ODA),这是一个整体开放的平台,旨在对训练后数据的内在价值进行基准测试。ODA 建立了一个全面的生态系统,包括四个关键支柱:(i) 统一的培训评估管道,确保不同模型(例如 Llama、Qwen)和领域之间进行公平、公开的比较;(ii) 多维评分框架,沿着数十个不同的轴描述数据质量;(iii) 交互式数据谱系浏览器,用于可视化数据集谱系并剖析组件源;(iv) 完全开源的培训、评估和评分工具包,以促进数据研究。ODA 的广泛实验(涵盖 22 个基准上跨多个领域的 120 多个训练数据集,并通过 600 多次训练运行和 4000 万个处理数据点进行验证)揭示了重要的见解。我们的分析揭示了数据复杂性和任务性能之间的固有权衡,通过谱系追踪识别流行基准中的冗余,并映射跨数据集的谱系关系。我们发布所有结果、工具和配置,以实现高质量数据评估的民主化。ODA 不仅仅只是扩大排行榜,而是设想从试错数据管理转向以数据为中心的人工智能原理科学,为严格研究数据混合法则和基础模型的战略构成铺平道路。


通用推理模型

英文摘要

Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at https://github.com/zitian-gao/URM.

中文摘要

通用变压器 (UT) 已广泛用于 ARC-AGI 和数独等复杂推理任务,但其性能提升的具体来源仍未得到充分探索。在这项工作中,我们系统地分析了 UT 的变体,并表明 ARC-AGI 的改进主要源于 Transformer 的循环感应偏置和强非线性组件,而不是源于精心设计的架构设计。受这一发现的启发,我们提出了通用推理模型(URM),它通过短卷积和截断反向传播来增强 UT。我们的方法大大提高了推理性能,在 ARC-AGI 1 上实现了最先进的 53.8% pass@1,在 ARC-AGI 2 上实现了 16.0% pass@1。我们的代码可在 https://github.com/zitian-gao/URM 上获取。


Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

  • 标题: Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
  • 作者: Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo
  • 日期: 2025-12-15
  • ArXiv主页 : https://arxiv.org/abs/2512.13507
  • 论文链接 : https://arxiv.org/pdf/2512.13507
  • 项目链接 : https://seed.bytedance.com/seedance1_5_pro

英文摘要

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

中文摘要

视频生成领域的最新进展为统一视听生成铺平了道路。在这项工作中,我们展示了 Seedance 1.5 pro,这是一个专门为原生联合音频视频生成而设计的基础模型。该模型利用双分支扩散变压器架构,将跨模态联合模块与专门的多级数据管道集成在一起,实现卓越的视听同步和卓越的生成质量。为了确保实用性,我们实施了细致的训练后优化,包括对高质量数据集的监督微调(SFT)和具有多维奖励模型的人类反馈强化学习(RLHF)。此外,我们还引入了一个加速框架,可将推理速度提高 10 倍以上。Seedance 1.5 pro 通过精确的多语言和方言口型同步、动态电影摄像机控制和增强的叙事连贯性而脱颖而出,将其定位为专业级内容创作的强大引擎。Seedance 1.5 pro 现已可在 Volcano Engine 上访问:https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo。


StereoPilot:通过生成先验学习统一且高效的立体声转换

英文摘要

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

中文摘要

VR 耳机和 3D 影院等立体显示器的快速增长导致对高质量立体视频内容的需求不断增加。然而,制作 3D 视频仍然成本高昂且复杂,而自动单目到立体转换则受到多级"深度扭曲修复"(DWI) 管道的限制的阻碍。这种范例存在错误传播、深度模糊以及并行和聚合立体声配置之间格式不一致的问题。为了应对这些挑战,我们引入了 UniStereo,这是第一个用于立体视频转换的大规模统一数据集,涵盖两种立体格式,以实现公平的基准测试和强大的模型训练。在此数据集的基础上,我们提出了 StereoPilot,这是一种高效的前馈模型,可以直接合成目标视图,而不依赖于显式深度图或迭代扩散采样。StereoPilot 配备了可学习的域切换器和循环一致性损失,可无缝适应不同的立体声格式并提高一致性。大量实验表明,StereoPilot 在视觉保真度和计算效率方面均显着优于最先进的方法。项目页面:https://hit-perfect.github.io/StereoPilot/。


RoboTracer:通过机器人视觉语言模型中的推理掌握空间轨迹

英文摘要

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

中文摘要

空间追踪作为机器人的一种基本体现交互能力,本质上具有挑战性,因为它需要多步骤的基于度量的推理,加上复杂的空间引用和现实世界的度量测量。然而,现有的方法很难完成这个组合任务。为此,我们提出了 RoboTracer,这是一种 3D 感知 VLM,它首先通过通用空间编码器和回归监督解码器实现 3D 空间参考和测量,以增强监督微调 (SFT) 期间的尺度感知。此外,RoboTracer 通过具有度量敏感过程奖励的强化微调 (RFT) 来推进多步骤基于度量的推理,监督关键的中间感知线索以准确生成空间轨迹。为了支持 SFT 和 RFT 训练,我们引入了 TraceSpatial,这是一个包含 30M QA 对的大型数据集,跨越室外/室内/桌面场景并支持复杂的推理过程(最多 9 个步骤)。我们进一步提出了 TraceSpatial-Bench,这是一个具有挑战性的基准,填补了评估空间追踪的空白。实验结果表明,RoboTracer 在空间理解、测量和参考方面超越了基线,平均成功率为 79.1%,并且在 TraceSpatial-Bench 上也大幅实现了 SOTA 性能,准确率超过 Gemini-2.5-Pro 36%。值得注意的是,RoboTracer 可以与各种控制策略集成,以在杂乱的现实世界场景中跨不同机器人(UR5、G1 人形机器人)执行长期动态任务。


相关推荐
PS12323214 小时前
港口机械安全运行 风速监测技术守护物流畅通
人工智能
汗流浃背了吧,老弟!14 小时前
基于 BERT 的指令微调
人工智能·深度学习·bert
Jerryhut14 小时前
Opencv总结8——停车场项目实战
人工智能·opencv·计算机视觉
WWZZ202514 小时前
SLAM进阶——数据集
人工智能·计算机视觉·机器人·大模型·slam·具身智能
、BeYourself14 小时前
PGvector :在 Spring AI 中实现向量数据库存储与相似性搜索
数据库·人工智能·spring·springai
才兄说14 小时前
机器人租来了,谁来教怎么用?
机器人
墨_浅-14 小时前
分阶段训练金融大模型02-百度千帆实际步骤
人工智能·金融·百度云
明天好,会的14 小时前
分形生成实验(三):Rust强类型驱动的后端分步实现与编译时契约
开发语言·人工智能·后端·rust
甄心爱学习14 小时前
计算机视觉-特征提取,特征点提取与描述,图像分割
人工智能·计算机视觉