中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 从代码基础模型到代理和应用程序:代码智能实用指南
- DeepSeek-V3.2:推动开放大型语言模型的前沿
- Z-Image:具有单流扩散变压器的高效图像生成基础模型
- 实时头像:流式传输实时音频驱动的无限长度头像生成
- LongVT:通过原生工具调用激励"用长视频思考"
- DAComp:整个数据智能生命周期中数据代理的基准测试
- Qwen3-VL技术报告
- ToolOrchestra:通过高效的模型和工具编排提升智能
- 法学硕士稳定强化学习:表述与实践
- 设想:因果世界过程洞察的统一理解和生成基准测试
- DeepSeekMath-V2:迈向可自我验证的数学推理
- Nex-N1:通过统一生态系统训练的代理模型,用于大规模环境构建
- TUNA:驯服原生统一多模态模型的统一视觉表示
- 深入研究:系统调查
- MultiShotMaster:可控多镜头视频生成框架
- PaperDebugger:基于插件的多代理系统,用于编辑器内学术写作、评论和编辑
- 我们距离真正有用的深度研究代理还有多远?
- 以最少的人类监督引导自我进化的法学硕士
- 视频生成中的重力怎么样?具有可验证奖励的训练后牛顿定律
- MG-Nav:通过稀疏空间内存的双尺度视觉导航
- Skywork-R1V4:通过图像和深度研究的交叉思维实现代理多模态智能
- PretrainZero:强化主动预训练
- REASONEDIT:走向推理增强图像编辑模型
- ARM-Thinker:通过代理工具使用和视觉推理强化多模式生成奖励模型
- Infinity-RoPE:自回归自推出产生动作可控的无限视频
- [大规模 Vision Bridge 变压器](#大规模 Vision Bridge 变压器)
- DualCamCtrl:用于几何感知相机控制视频生成的双分支扩散模型
从代码基础模型到代理和应用程序:代码智能实用指南
- 标题: From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
- 作者: Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong, Shengjie Fang, Zhiwei Fei, Xiangyuan Guan, Qipeng Guo, Zhiguang Han, Joseph James, Tianqi Luo, Renyuan Li, Yuhang Li, Yiming Liang, Congnan Liu, Jiaheng Liu, Qian Liu, Ruitong Liu, Tyler Loakman, Xiangxin Meng, Chuang Peng, Tianhao Peng, Jiajun Shi, Mingjie Tang, Boyang Wang, Haowen Wang, Yunli Wang, Fanglin Xu, Zihan Xu, Fei Yuan, Ge Zhang, Jiayi Zhang, Xinhao Zhang, Wangchunshu Zhou, Hualei Zhu, King Zhu, Brown Dai, Aishan Liu, Zhoujun Li, Chenghua Lin, Tianyu Liu, Chao Peng, Kai Shen, Libo Qin, Shuangyong Song, Zizheng Zhan, Jiajun Zhang, Jie Zhang, Zhaoxiang Zhang, Bo Zheng
- 日期: 2025-11-23
- ArXiv主页 : https://arxiv.org/abs/2511.18538
- 论文链接 : https://arxiv.org/pdf/2511.18538
英文摘要
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
中文摘要
大型语言模型 (LLM) 通过将自然语言描述直接翻译为功能代码,从根本上改变了自动化软件开发,并通过 Github Copilot (Microsoft)、Cursor (Anysphere)、Trae (ByteDance) 和 Claude Code (Anthropic) 等工具推动商业采用。尽管该领域已经从基于规则的系统显着发展到基于 Transformer 的架构,但在 HumanEval 等基准测试中,性能从个位数提高到超过 95% 的成功率。在这项工作中,我们提供了关于代码法学硕士的全面综合和实践指南(一系列分析和探索实验),通过高级提示范例、代码预训练、监督微调、强化学习和自主编码代理,系统地检查从数据管理到后训练的完整模型生命周期。我们分析了通用法学硕士(GPT-4、Claude、LLaMA)和代码专业法学硕士(StarCoder、Code LLaMA、DeepSeek-Coder 和 QwenCoder)的代码能力,批判性地检查了技术、设计决策和权衡。此外,我们阐明了学术研究(例如基准和任务)和现实世界部署(例如与软件相关的代码任务)之间的研究与实践差距,包括代码正确性、安全性、大型代码库的上下文感知以及与开发工作流程的集成,并将有前景的研究方向映射到实际需求。最后,我们进行了一系列实验,对代码预训练、监督微调和强化学习进行了全面分析,涵盖尺度规律、框架选择、超参数敏感性、模型架构和数据集比较。
DeepSeek-V3.2:推动开放大型语言模型的前沿
- 标题: DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
- 作者: DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M. S. Di, M. Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S. H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Songyang Zhou, Tao Ni, Tao Yun, Tian Pei, Tian Ye, Tianyuan Yue, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjun Gao, Wentao Zhang, Xi Gao, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyuan Li, Xu Chen, Xuecheng Su, Xuehai Pan, Xuheng Lin, Xuwei Fu, Y. Q. Wang, Yang Zhang, Yanhong Xu, Yanru Ma, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yiliang Xiong, Ying He, Ying Zhou, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao, Bei Feng, Hui Li, J. L. Cai, Jiaqi Ni, Lei Xu, Meng Li, Ning Tian, R. J. Chen, R. L. Jin, S. S. Li, Shuang Zhou, Tianyu Sun, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xinnan Song, Xinyi Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Dongjie Ji, Jian Liang, Jianzhong Guo, Jin Chen, Leyi Xia, Miaojun Wang, Mingming Li, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, Shengfeng Ye, T. Wang, W. L. Xiao, Wei An, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Ying Tang, Yukun Zha, Zekai Zhang, Zhe Ju, Zhen Zhang, Zihua Qu
- 日期: 2025-12-02
- ArXiv主页 : https://arxiv.org/abs/2512.02556
- 论文链接 : https://arxiv.org/pdf/2512.02556
英文摘要
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
中文摘要
我们推出 DeepSeek-V3.2,该模型将高计算效率与卓越的推理和代理性能相结合。DeepSeek-V3.2的关键技术突破如下:(1)DeepSeek稀疏注意力(DSA):我们引入了DSA,一种高效的注意力机制,可以大大降低计算复杂度,同时在长上下文场景中保持模型性能。(2) 可扩展的强化学习框架:通过实施强大的强化学习协议和扩展训练后计算,DeepSeek-V3.2 的性能与 GPT-5 相当。值得注意的是,我们的高计算变体 DeepSeek-V3.2-Speciale 超越了 GPT-5,并表现出与 Gemini-3.0-Pro 相当的推理能力,在 2025 年国际数学奥林匹克(IMO)和国际信息学奥林匹克(IOI)中均获得金牌。(3) 大规模代理任务合成管道:为了将推理集成到工具使用场景中,我们开发了一种新颖的合成管道,可以系统地大规模生成训练数据。这种方法有利于可扩展的代理后训练,在复杂的交互式环境中显着提高泛化性和指令遵循的鲁棒性。
Z-Image:具有单流扩散变压器的高效图像生成基础模型
- 标题: Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
- 作者: Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
- 日期: 2025-11-27
- ArXiv主页 : https://arxiv.org/abs/2511.22699
- 论文链接 : https://arxiv.org/pdf/2511.22699
- 项目链接 : https://tongyi-mai.github.io/Z-Image-blog/
- gitHub仓库 : https://github.com/Tongyi-MAI/Z-Image
英文摘要
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
中文摘要
高性能图像生成模型目前由 Nano Banana Pro 和 Seedream 4.0 等专有系统主导。领先的开源替代方案,包括 Qwen-Image、Hunyuan-Image-3.0 和 FLUX.2,其特点是参数数量庞大(20B 到 80B),这使得它们对于消费级硬件的推理和微调来说不切实际。为了解决这一差距,我们提出了 Z-Image,这是一种基于可扩展单流扩散变压器 (S3-DiT) 架构的高效 6B 参数基础生成模型,挑战了"不惜一切代价扩展"范式。By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K).我们的带有奖励后训练的几步蒸馏方案进一步产生了 Z-Image-Turbo,在企业级 H800 GPU 上提供亚秒级推理延迟,并与消费级硬件(<16GB VRAM)兼容。此外,我们的全方位预训练范例还可以对 Z-Image-Edit 进行高效训练,Z-Image-Edit 是一种具有令人印象深刻的指令跟踪功能的编辑模型。定性和定量实验都表明,我们的模型在各个方面都达到了与领先竞争对手相当或超过的性能。Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead.我们公开发布我们的代码、权重和在线演示,以促进可访问、预算友好且最先进的生成模型的开发。
实时头像:流式传输实时音频驱动的无限长度头像生成
- 标题: Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
- 作者: Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi
- 日期: 2025-12-04
- ArXiv主页 : https://arxiv.org/abs/2512.04677
- 论文链接 : https://arxiv.org/pdf/2512.04677
- 项目链接 : https://liveavatar.github.io/
- gitHub仓库 : https://github.com/Alibaba-Quark/LiveAvatar
英文摘要
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
中文摘要
现有的基于扩散的视频生成方法从根本上受到顺序计算和长期不一致的限制,限制了它们在实时、流媒体音频驱动的化身合成中的实际采用。我们推出了 Live Avatar,这是一种算法系统共同设计的框架,可使用 140 亿参数扩散模型实现高效、高保真和无限长度的头像生成。我们的方法引入了时间步长管道并行(TPP),这是一种分布式推理范例,可以跨多个 GPU 管道化去噪步骤,有效打破自回归瓶颈并确保稳定、低延迟的实时流。为了进一步增强时间一致性并减轻身份漂移和颜色伪影,我们提出了滚动接收器帧机制(RSFM),该机制通过使用缓存的参考图像动态重新校准外观来保持序列保真度。此外,我们利用自强迫分布匹配蒸馏来促进大型模型的因果、可流式适应,而不会牺牲视觉质量。Live Avatar 展示了最先进的性能,在 5 个 H800 GPU 上达到 20 FPS 端到端生成速度,据我们所知,它是第一个实现这种规模的实用、实时、高保真头像生成的产品。我们的工作建立了在工业长格式视频合成应用中部署先进扩散模型的新范例。
LongVT:通过原生工具调用激励"用长视频思考"
- 标题: LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- 作者: Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
- 日期: 2025-11-25
- ArXiv主页 : https://arxiv.org/abs/2511.20785
- 论文链接 : https://arxiv.org/pdf/2511.20785
- 项目链接 : https://evolvinglmms-lab.github.io/LongVT/
- gitHub仓库 : https://github.com/EvolvingLMMs-Lab/LongVT
英文摘要
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
中文摘要
大型多模态模型 (LMM) 在使用文本思想链进行视频推理方面表现出了巨大的潜力。然而,他们仍然容易产生幻觉,尤其是在处理证据稀疏且时间分散的长视频时。受到人类如何理解长视频的启发------首先浏览全局,然后检查相关剪辑的细节------我们引入了 LongVT,这是一个端到端的代理框架,可以通过交错的多模态工具思维链实现"用长视频思考"。具体来说,我们利用 LMM 固有的时间基础能力作为本机视频裁剪工具来放大特定视频剪辑并重新采样更细粒度的视频帧。这种从全局到局部的推理循环一直持续到答案基于检索到的视觉证据为止。鉴于长视频推理任务的细粒度问答(QA)数据的稀缺性,我们策划并将发布一个名为 VideoSIAH 的数据套件,以促进培训和评估。具体来说,我们的训练数据集分别包含 247.9K 个用于工具集成冷启动监督微调的样本、1.6K 个用于代理强化学习的样本和 15.4K 个用于代理强化微调的样本。我们的评估基准由 1,280 个 QA 对组成,这些对是通过半自动数据管道和人机交互验证精心策划的。凭借精心设计的三阶段训练策略和广泛的实证验证,LongVT 在四个具有挑战性的长视频理解和推理基准测试中始终优于现有的强大基线。我们的代码、数据和模型检查点可在 https://github.com/EvolvingLMMs-Lab/LongVT 上公开获取。
DAComp:整个数据智能生命周期中数据代理的基准测试
- 标题: DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
- 作者: Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu
- 日期: 2025-12-03
- ArXiv主页 : https://arxiv.org/abs/2512.04324
- 论文链接 : https://arxiv.org/pdf/2512.04324
- 项目链接 : https://da-comp.github.io/
- gitHub仓库 : https://github.com/ByteDance-Seed/DAComp
英文摘要
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
中文摘要
现实世界的企业数据智能工作流程包括将原始数据源转化为可分析的表格的数据工程,以及将这些表格转化为面向决策的见解的数据分析。我们引入了 DAComp,这是一个包含 210 项任务的基准,反映了这些复杂的工作流程。数据工程 (DE) 任务需要对工业模式进行存储库级工程,包括从头开始设计和构建多阶段 SQL 管道,以及根据不断变化的需求改进现有系统。数据分析 (DA) 任务提出了开放式业务问题,需要战略规划、通过迭代编码进行探索性分析、解释中间结果以及综合可行的建议。工程任务通过基于执行的多指标评估进行评分。开放式任务由可靠的、经过实验验证的法学硕士评审进行评估,该评审以分层的、精心设计的评估标准为指导。我们的实验表明,即使是最先进的代理在 DAComp 上也会出现问题。DE 任务的性能特别低,成功率低于 20%,暴露了整体管道编排的关键瓶颈,而不仅仅是代码生成。DA 任务的平均得分也低于 40%,凸显了开放式推理的严重缺陷,并表明工程和分析是截然不同的能力。通过清楚地诊断这些限制,DAComp 提供了严格且现实的测试平台,以推动为企业环境开发真正强大的自主数据代理。我们的数据和代码可在 https://da-comp.github.io 获取
Qwen3-VL技术报告
- 标题: Qwen3-VL Technical Report
- 作者: Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu
- 日期: 2025-11-26
- ArXiv主页 : https://arxiv.org/abs/2511.21631
- 论文链接 : https://arxiv.org/pdf/2511.21631
- 项目链接 : https://chat.qwen.ai
- gitHub仓库 : https://github.com/QwenLM/Qwen3-VL
英文摘要
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
中文摘要
我们推出 Qwen3-VL,这是迄今为止 Qwen 系列中功能最强大的视觉语言模型,在广泛的多模式基准测试中实现了卓越的性能。它本身支持多达 256K 个令牌的交错上下文,无缝集成文本、图像和视频。该模型系列包括密集型 (2B/4B/8B/32B) 和专家混合型 (30B-A3B/235B-A22B) 变体,以适应不同的延迟质量权衡。Qwen3-VL 提供三个核心支柱:(i)明显更强的纯文本理解,在某些情况下超越了类似的纯文本主干;(ii) 具有针对文本和交错多模式输入的原生 256K 令牌窗口的强大长上下文理解能力,从而能够跨长文档和视频进行忠实保留、检索和交叉引用;(iii) 跨单图像、多图像和视频任务的高级多模态推理,在 MMMU 和视觉数学基准(例如 MathVista 和 MathVision)等综合评估中表现出领先的性能。在架构上,我们引入了三个关键升级:(i)增强的交错 MRoPE,用于跨图像和视频进行更强的时空建模;(ii) DeepStack 集成,有效利用多级 ViT 功能来加强视觉语言对齐;(iii) 基于文本的视频时间对齐,从 T-RoPE 发展到显式文本时间戳对齐,以实现更精确的时间基础。在类似的代币预算和延迟限制下,Qwen3-VL 在密集架构和专家混合 (MoE) 架构中都实现了卓越的性能。我们设想 Qwen3-VL 作为现实工作流程中基于图像的推理、代理决策和多模式代码智能的基础引擎。
ToolOrchestra:通过高效的模型和工具编排提升智能
- 标题: ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
- 作者: Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov
- 日期: 2025-11-26
- ArXiv主页 : https://arxiv.org/abs/2511.21689
- 论文链接 : https://arxiv.org/pdf/2511.21689
- 项目链接 : https://research.nvidia.com/labs/lpr/ToolOrchestra/
英文摘要
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
中文摘要
大型语言模型是强大的多面手,但解决诸如人类期末考试 (HLE) 之类的深刻而复杂的问题在概念上仍然具有挑战性,而且计算成本高昂。我们证明,管理其他模型和各种工具的小型协调器既可以推动智能的上限,又可以提高解决困难的代理任务的效率。我们介绍了 ToolOrchestra,这是一种训练协调智能工具的小型编排器的方法。ToolOrchestra 明确使用强化学习,并具有结果、效率和用户偏好感知奖励。使用 ToolOrchestra,我们生产了 Orchestrator,这是一种 8B 模型,与以前的工具使用代理相比,它以更低的成本实现了更高的准确性,同时与用户对给定查询使用哪些工具的偏好保持一致。在 HLE 上,Orchestrator 的得分为 37.1%,优于 GPT-5 (35.1%),同时效率提高了 2.5 倍。在 tau2-Bench 和 FRAMES 上,Orchestrator 大幅超越了 GPT-5,而成本仅为 GPT-5 的 30% 左右。广泛的分析表明,Orchestrator 在多个指标下实现了性能和成本之间的最佳权衡,并且可以稳健地推广到未见过的工具。这些结果表明,使用轻量级编排模型组合多种工具比现有方法更高效、更有效,为实用且可扩展的工具增强推理系统铺平了道路。
法学硕士稳定强化学习:表述与实践
- 标题: Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
- 作者: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin
- 日期: 2025-12-01
- ArXiv主页 : https://arxiv.org/abs/2512.01374
- 论文链接 : https://arxiv.org/pdf/2512.01374
英文摘要
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
中文摘要
本文提出了一种使用大型语言模型的强化学习 (RL) 的新颖公式,解释了为什么以及在什么条件下可以通过 REINFORCE 等策略梯度方法中的代理令牌级目标来优化真正的序列级奖励。具体来说,通过一阶近似,我们表明只有当训练推理差异和策略陈旧性都最小化时,这种替代方法才变得越来越有效。这一见解为几种广泛采用的技术在稳定 RL 训练中的关键作用提供了原则性解释,包括重要性采样校正、裁剪,特别是专家混合 (MoE) 模型的路由重放。通过对 30B MoE 模型总计数十万 GPU 小时的大量实验,我们表明,对于在策略训练,具有重要性采样校正的基本策略梯度算法实现了最高的训练稳定性。当引入离策略更新来加速收敛时,结合裁剪和路由重放对于减轻策略过时造成的不稳定至关重要。值得注意的是,一旦训练稳定,无论冷启动初始化如何,长时间的优化都会始终产生可比的最终性能。我们希望共享的见解和开发的稳定强化学习训练方法能够促进未来的研究。
设想:因果世界过程洞察的统一理解和生成基准测试
- 标题: Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
- 作者: Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
- 日期: 2025-12-01
- ArXiv主页 : https://arxiv.org/abs/2512.01816
- 论文链接 : https://arxiv.org/pdf/2512.01816
- 项目链接 : https://opendatalab-raiser.github.io/Envision/
- gitHub仓库 : https://github.com/opendatalab-raiser/Envision
英文摘要
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
中文摘要
当前的多模态模型旨在通过统一理解和生成来超越单模态表示的局限性,通常使用文本到图像(T2I)任务来校准语义一致性。然而,它们在训练和评估中对静态单图像生成的依赖导致了对静态模式匹配和语义融合的过度拟合,同时从根本上阻碍了它们对随时间展开的动态过程进行建模的能力。为了解决这些限制,我们提出了 Envision------用于链式文本到多图像生成的因果事件进展基准。它以世界知识为基础,以时空因果关系为结构,重新组织了现有的评估维度,并包括跨越六个科学和人文领域的 1,000 个四阶段提示。为了将评估从单个图像过渡到连续帧,并评估模型是否真正内化了世界知识,同时遵守因果时间约束,我们引入了 Envision-Score,这是一种集成多维一致性、物理性和美学的整体指标。对 15 个模型(10 个专业 T2I 模型,5 个统一模型)的综合评估发现:专业 T2I 模型表现出审美渲染的熟练程度,但缺乏内在的世界知识。统一的多模态模型弥补了这一差距,在因果叙事连贯性方面始终优于专业模型。然而,即使这些统一的架构仍然服从于闭源模型,并且难以克服时空一致性的核心挑战。这表明,对因果隔离的单个图像的关注会阻碍多帧推理和生成,促进静态模式匹配而不是动态世界建模,最终限制世界知识的内化和生成。
DeepSeekMath-V2:迈向可自我验证的数学推理
-
标题: DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
-
作者: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang
-
日期: 2025-11-27
-
ArXiv主页 : https://arxiv.org/abs/2511.22570
英文摘要
Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
中文摘要
大型语言模型在数学推理方面取得了重大进展,数学推理是人工智能的重要测试平台,如果进一步发展,可能会影响科学研究。通过用奖励正确最终答案的强化学习来扩展推理,法学硕士在一年内从糟糕的表现改善到在 AIME 和 HMMT 等定量推理竞赛中达到饱和。然而,这种方法面临着根本的局限性。追求更高的最终答案准确性并不能解决关键问题:正确的答案并不能保证正确的推理。此外,许多数学任务(例如定理证明)需要严格的逐步推导,而不是数字答案,这使得最终答案奖励不适用。为了突破深度推理的极限,我们认为有必要验证数学推理的全面性和严谨性。自我验证对于扩展测试时计算尤其重要,尤其是对于没有已知解决方案的开放问题。为了实现可自我验证的数学推理,我们研究了如何训练准确且忠实的基于 LLM 的验证者来进行定理证明。然后,我们使用验证者作为奖励模型来训练证明生成器,并激励生成器在最终确定证明之前识别并解决其证明中尽可能多的问题。为了在生成器变得更强时保持生成验证差距,我们建议扩展验证计算以自动标记新的难以验证的证明,创建训练数据以进一步改进验证器。我们的最终模型 DeepSeekMath-V2 展示了强大的定理证明能力,在 IMO 2025 和 CMO 2024 上获得了黄金级分数,在 Putnam 2024 上通过扩展测试时间计算获得了近乎完美的 118/120。
Nex-N1:通过统一生态系统训练的代理模型,用于大规模环境构建
-
标题: Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
-
作者: Nex-AGI Team, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun Shi, Wentao Shu, Peng Sun, Yiran Suo, Tian Tang, Boyu Tian, Guoteng Wang, Junzhe Wang, Peixin Wang, Zhiheng Xi, Hang Yan, Jie Yang, Zhixiong Yang, Tianchu Yao, Guangze Ye, Qianxi Yu, Shuo Zhang, Xinyue Zhang, Yiqi Zhang, Jiarong Zhao, Miao Zheng, Rui Zheng, Enyu Zhou, Jiazheng Zhou, Maosen Zhou, Yuhao Zhou, Tao Gui, Yining Zheng, Xinchi Chen, Jie Zhou, Siyuan Feng, Qin Chen, Liang He, Qi Zhang, Xuanjing Huang, Xipeng Qiu
-
日期: 2025-12-04
-
ArXiv主页 : https://arxiv.org/abs/2512.04987
-
gitHub仓库 : https://github.com/nex-agi/Nex-N1
英文摘要
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.
中文摘要
大型语言模型(LLM)从被动响应者到自主代理的演变需要学习范式的根本转变------从静态模仿到激励驱动的决策。然而,由于缺乏能够构建高质量交互信号以进行有效政策学习的可扩展基础设施,这种转变受到严重阻碍。为了解决这个问题,我们引入了一种综合方法,旨在系统地扩展交互式环境的多样性和复杂性。我们的方法通过解决三个正交维度来实现这种扩展:(1)复杂性:NexAU,一个灵活的代理框架,支持通过简单的配置构建复杂的代理层次结构;(2) 多样性:NexA4A 自动从自然语言生成多样化的代理层次结构,以覆盖无限领域;(3) 保真度:NexGAP 通过集成动态现实环境进行接地轨迹合成,弥合了模拟与现实之间的差距。我们在我们的基础设施建立的多样化且复杂的交互环境中训练 Nex-N1。SWE-bench 和 tau2 等基准测试的实证结果表明,Nex-N1 始终优于 SOTA 开源模型,并在复杂代理任务上实现了与前沿专有模型相比的竞争性能。我们开源 Nex 生态系统和模型权重,以促进进一步的研究。
TUNA:驯服原生统一多模态模型的统一视觉表示
- 标题: TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
- 作者: Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong
- 日期: 2025-12-01
- ArXiv主页 : https://arxiv.org/abs/2512.02014
- 论文链接 : https://arxiv.org/pdf/2512.02014
- 项目链接 : https://tuna-ai.org/
- gitHub仓库 : https://github.com/wren93/tuna
英文摘要
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.
中文摘要
统一多模态模型(UMM)旨在在单一框架内联合执行多模态理解和生成。我们提出了 TUNA,一种原生 UMM,它通过级联 VAE 编码器和表示编码器来构建统一的连续视觉表示。这种统一的表示空间允许对图像和视频进行端到端处理,以实现理解和生成任务。与之前具有解耦表示的 UMM 相比,TUNA 的统一视觉空间避免了单独编码器引入的表示格式不匹配,在理解和生成方面都优于解耦替代方案。此外,我们观察到,更强的预训练表示编码器在所有多模态任务中始终能产生更好的性能,这凸显了表示编码器的重要性。最后,在这个统一的环境中,对理解和生成数据的联合训练使这两个任务能够相互受益而不是相互干扰。我们在多模态理解和生成基准方面进行的大量实验表明,TUNA 在图像和视频理解、图像和视频生成以及图像编辑方面取得了最先进的结果,展示了其统一表示设计的有效性和可扩展性。
深入研究:系统调查
- 标题: Deep Research: A Systematic Survey
- 作者: Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, Shiyu Ni, Yougang Lyu, Run-Ze Fan, Bowen Jin, Yixuan Weng, Minjun Zhu, Qiujie Xie, Xinyu Guo, Qu Yang, Jiayi Wu, Jujia Zhao, Xiaqiang Tang, Xinbei Ma, Cunxiang Wang, Jiaxin Mao, Qingyao Ai, Jen-Tse Huang, Wenxuan Wang, Yue Zhang, Yiming Yang, Zhaopeng Tu, Zhaochun Ren
- 日期: 2025-11-24
- ArXiv主页 : https://arxiv.org/abs/2512.02038
- 论文链接 : https://arxiv.org/pdf/2512.02038
- 项目链接 : https://deep-research-survey.github.io/
- gitHub仓库 : https://github.com/mangopy/Deep-Research-Survey
英文摘要
Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.
中文摘要
大型语言模型 (LLM) 已从文本生成器迅速发展为强大的问题解决器。然而,许多开放任务需要批判性思维、多源和可验证的输出,这超出了单次提示或标准检索增强生成的范围。最近,大量研究探索了深度研究(DR),旨在将法学硕士的推理能力与搜索引擎等外部工具相结合,从而使法学硕士能够充当能够完成复杂、开放式任务的研究代理。本次调查对深度研究系统进行了全面、系统的概述,包括清晰的路线图、基础组成部分、实际实施技术、重要挑战和未来方向。具体来说,我们的主要贡献如下:(i)我们正式制定了一个三阶段路线图,并将深度研究与相关范式区分开来;(ii)我们引入四个关键组件:查询规划、信息获取、内存管理和答案生成,每个组件都与细粒度的子分类法配对;(iii)我们总结了优化技术,包括提示、监督微调和代理强化学习;(四)统一评价标准,开放挑战,引导和促进未来发展。随着深度研究领域不断快速发展,我们致力于不断更新本次调查,以反映该领域的最新进展。
MultiShotMaster:可控多镜头视频生成框架
- 标题: MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
- 作者: Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia
- 日期: 2025-12-02
- ArXiv主页 : https://arxiv.org/abs/2512.03041
- 论文链接 : https://arxiv.org/pdf/2512.03041
- 项目链接 : https://tianhao-qi.github.io/Mask2DiTProject/
英文摘要
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
中文摘要
目前的视频生成技术擅长单镜头剪辑,但难以制作叙事性多镜头视频,这需要灵活的镜头安排、连贯的叙事以及超越文本提示的可控性。为了应对这些挑战,我们提出了 MultiShotMaster,一个用于高度可控的多镜头视频生成的框架。我们通过集成 RoPE 的两种新颖变体来扩展预训练的单次模型。首先,我们介绍多镜头叙事 RoPE,它在镜头过渡时应用显式相移,实现灵活的镜头安排,同时保留时间叙事顺序。其次,我们设计了时空位置感知 RoPE 以合并参考令牌和接地信号,从而实现时空接地参考注入。此外,为了克服数据稀缺的问题,我们建立了一个自动数据注释管道来提取多镜头视频、字幕、交叉镜头接地信号和参考图像。我们的框架利用内在的架构属性来支持多镜头视频生成,具有文本驱动的镜头间一致性、具有运动控制的定制主题以及背景驱动的定制场景。拍摄次数和持续时间均可灵活配置。大量的实验证明了我们的框架的优越性能和出色的可控性。
PaperDebugger:基于插件的多代理系统,用于编辑器内学术写作、评论和编辑
- 标题: PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing
- 作者: Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He
- 日期: 2025-12-02
- ArXiv主页 : https://arxiv.org/abs/2512.02589
- 论文链接 : https://arxiv.org/pdf/2512.02589
- 项目链接 : https://www.paperdebugger.com/
- gitHub仓库 : https://github.com/PaperDebugger/PaperDebugger
英文摘要
Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.
中文摘要
大型语言模型越来越多地嵌入到学术写作工作流程中,但现有的助手仍然位于编辑器之外,从而阻碍了与文档状态、结构和修订历史的深入交互。这种分离使得无法直接在 LaTeX 编辑器(例如 Overleaf)中支持代理、上下文感知操作。我们推出了 PaperDebugger,这是一个编辑器内、多代理和基于插件的学术写作助手,可将 LLM 驱动的推理直接带入写作环境。实现这种编辑器内交互在技术上并不简单:它需要与编辑器进行可靠的双向同步、细粒度的版本控制和修补、安全状态管理、多代理调度以及与外部工具的可扩展通信。PaperDebugger 通过 Chrome 批准的扩展、Kubernetes 原生编排层以及集成了文献搜索、参考文献查找、文档评分和修订管道的模型上下文协议 (MCP) 工具链来解决这些挑战。我们的演示展示了完全集成的工作流程,包括本地化编辑、结构化审查、并行代理执行和基于差异的更新,封装在最小侵入用户界面 (UI) 中。早期的聚合分析展示了活跃的用户参与度,并验证了编辑器原生的代理写作助手的实用性。有关此演示和视频的更多详细信息,请访问 https://github.com/PaperDebugger/PaperDebugger。
我们距离真正有用的深度研究代理还有多远?
-
标题: How Far Are We from Genuinely Useful Deep Research Agents?
-
作者: Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou
-
日期: 2025-12-01
-
ArXiv主页 : https://arxiv.org/abs/2512.01948
英文摘要
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
中文摘要
深度研究代理 (DRA) 旨在通过迭代信息检索和综合自动生成分析师级别的报告。然而,大多数现有的 DRA 都是在问答基准上进行验证的,而生成综合报告的研究仍然被忽视。更糟糕的是,当前的报告合成基准受到任务复杂性和主观指标的影响------这无法反映用户需求并限制了生成报告的实际效用。为了弥补这些差距,我们推出了细粒度深度研究基准 (FINDER),这是一个增强的基准,由 100 项人工策划的研究任务和 419 个结构化清单项目组成,这些项目标准化了报告结构、分析深度和事实基础。基于主流 DRA 生成的约 1,000 份报告,我们进一步提出深度研究失败分类法(DEFT),这是深度研究代理的第一个失败分类法。DEFT 包含 14 种跨越推理、检索和生成的细粒度故障模式,并建立在具有人类法学硕士联合注释和注释者间可靠性验证的扎根理论之上。我们的实验结果表明,当前的 DRA 并非在任务理解方面存在困难,而是在证据整合、验证和推理弹性规划方面存在困难。
以最少的人类监督引导自我进化的法学硕士
- 标题: Guided Self-Evolving LLMs with Minimal Human Supervision
- 作者: Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu
- 日期: 2025-12-02
- ArXiv主页 : https://arxiv.org/abs/2512.02472
- 论文链接 : https://arxiv.org/pdf/2512.02472
英文摘要
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
中文摘要
长期以来,人工智能的自我进化一直被视为通往超级智能的道路,模型可以从自己的学习经验中自主获取、完善和内化知识。然而在实践中,无指导的自我进化系统往往会随着训练的进展而迅速趋于稳定甚至退化。这些失败源于概念漂移、多样性崩溃和错误进化等问题,因为模型强化了自身的偏见并趋向于低熵行为。为了使模型能够以稳定、可控的方式自我进化,同时最大限度地减少对人类监督的依赖,我们引入了 R-Few,这是一种引导式自我对战挑战者-求解器框架,通过上下文基础和混合训练结合了轻量级人类监督。在每次迭代中,挑战者都会对一小部分人类标记的示例进行采样,以指导合成问题的生成,而求解器则在基于难度的在线课程下联合训练人类和合成示例。在数学和一般推理基准测试中,R-Few 实现了一致和迭代的改进。例如,Qwen3-8B-Base 在数学任务上比 R-Zero 提高了 3.0 分,并且达到了与 General-Reasoner 相当的性能,尽管后者接受了 20 倍多的人类数据的训练。消融研究证实了扎根挑战者训练和基于课程的求解器训练的互补贡献,进一步分析表明 R-Few 减轻了漂移,产生更稳定、更可控的共同进化动力学。
视频生成中的重力怎么样?具有可验证奖励的训练后牛顿定律
- 标题: What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
- 作者: Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, Dimitris Samaras
- 日期: 2025-11-29
- ArXiv主页 : https://arxiv.org/abs/2512.00425
- 论文链接 : https://arxiv.org/pdf/2512.00425
- 项目链接 : https://cvlab-stonybrook.github.io/NewtonRewards
- gitHub仓库 : https://github.com/cvlab-stonybrook/NewtonRewards
英文摘要
Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose NewtonRewards, the first physics-grounded post-training framework for video generation based on verifiable rewards. Instead of relying on human or VLM feedback, NewtonRewards extracts measurable proxies from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate NewtonRewards on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, NewtonBench-60K. Across all primitives in visual and physics metrics, NewtonRewards consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.
中文摘要
最近的视频扩散模型可以合成视觉上引人注目的剪辑,但常常违反基本物理定律------物体漂浮、加速度漂移和碰撞行为不一致------揭示了视觉真实感和物理真实感之间持续存在的差距。我们提出了 NewtonRewards,这是第一个基于物理的后训练框架,用于基于可验证奖励的视频生成。NewtonRewards 不依赖人类或 VLM 反馈,而是使用冻结实用模型从生成的视频中提取可测量的代理:光流充当速度的代理,而高级外观特征充当质量的代理。这些代理可以通过两个互补的奖励来显式执行牛顿结构:强制恒定加速度动力学的牛顿运动学约束,以及防止琐碎的退化解决方案的质量守恒奖励。我们使用我们新构建的大型基准 NewtonBench-60K 在五个牛顿运动基元(自由落体、水平/抛物线投掷和斜坡向下/向上滑动)上评估 NewtonRewards。在视觉和物理指标的所有基元中,NewtonRewards 始终比之前的训练后方法提高了物理合理性、运动平滑度和时间连贯性。它进一步在高度、速度和摩擦力的分布外变化下保持强劲的性能。我们的结果表明,基于物理的可验证奖励为生成物理感知视频提供了一条可扩展的路径。
MG-Nav:通过稀疏空间内存的双尺度视觉导航
- 标题: MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
- 作者: Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi
- 日期: 2025-11-27
- ArXiv主页 : https://arxiv.org/abs/2511.22609
- 论文链接 : https://arxiv.org/pdf/2511.22609
英文摘要
We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.
中文摘要
我们提出了 MG-Nav(记忆引导导航),这是一种用于零镜头视觉导航的双尺度框架,它将全局记忆引导规划与局部几何增强控制相结合。其核心是稀疏空间内存图(SMG),这是一种紧凑的、以区域为中心的内存,其中每个节点聚合多视图关键帧和对象语义,捕获外观和空间结构,同时保留视点多样性。在全局层面上,代理在 SMG 上进行本地化,并通过图像到实例混合检索来规划目标条件节点路径,从而生成一系列可到达的路径点以进行长视野引导。在本地级别,导航基础策略通过障碍物感知控制以点目标模式执行这些航路点,并在从最终节点导航到视觉目标时切换到图像目标模式。为了进一步增强视点对齐和目标识别,我们引入了 VGGT 适配器,这是一个基于预先训练的 VGGT 模型构建的轻量级几何模块,可在共享的 3D 感知空间中对齐观察和目标特征。MG-Nav 以不同的频率进行全局规划和本地控制,并使用定期重新定位来纠正错误。HM3D Instance-Image-Goal 和 MP3D Image-Goal 基准测试表明 MG-Nav 实现了最先进的零样本性能,并在动态重新排列和看不见的场景条件下保持鲁棒性。
Skywork-R1V4:通过图像和深度研究的交叉思维实现代理多模态智能
- 标题: Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
- 作者: Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou
- 日期: 2025-12-02
- ArXiv主页 : https://arxiv.org/abs/2512.02395
- 论文链接 : https://arxiv.org/pdf/2512.02395
- 项目链接 : https://docs.skyworkmodel.ai/r1v4/api-reference/completions.html
英文摘要
Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
中文摘要
尽管多模式代理系统最近取得了进展,但现有方法通常将图像处理和网络搜索视为不相交的功能,严重依赖昂贵的强化学习,并且缺乏基于真实工具执行轨迹的规划。为了解决这些限制,我们提出了 Skywork-R1V4,这是一种 30B (A3B) 参数多模态代理模型,它统一了多模态规划、主动图像处理("用图像思考")、深度多模态搜索,以及最关键的是,在视觉操作和外部知识检索之间动态交替的交错推理。Skywork-R1V4 仅通过对不到 30,000 个高质量、计划执行一致的轨迹进行监督微调进行训练,并通过逐步一致性过滤进行验证,在感知和多模态搜索基准测试中取得了最先进的结果:它在 MMSearch 上得分为 66.1,在 FVQA 上得分为 67.2,在所有 11 个指标上都超过了 Gemini 2.5 Flash。Skywork-R1V4 在推理时展示了紧急长视野推理,成功协调了 10 多个工具调用来解决复杂的多步骤任务。我们的结果表明,仅通过精心策划的监督学习就可以实现复杂的代理多模态智能,而无需依赖强化学习。
PretrainZero:强化主动预训练
- 标题: PretrainZero: Reinforcement Active Pretraining
- 作者: Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang
- 日期: 2025-12-03
- ArXiv主页 : https://arxiv.org/abs/2512.03442
- 论文链接 : https://arxiv.org/pdf/2512.03442
英文摘要
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
中文摘要
模仿人类行为,主动学习通用经验,实现人工通用智能一直是人类的梦想。最近基于强化学习(RL)的大思维模型展示了令人印象深刻的专家级能力,即软件和数学,但仍然严重依赖特定领域的可验证奖励,为扩展一般推理能力的性能边界设置了重大瓶颈。在这项工作中,我们提出了 PretrainZero,这是一种基于预训练语料库的强化主动学习框架,可将强化学习从特定领域的后训练扩展到一般的预训练。PretrainZero具有以下特点: 1)主动预训练:受人类主动学习能力的启发,PretrainZero学习统一的推理策略,主动从预训练语料中识别合理且信息丰富的内容,并通过强化学习推理来预测这些内容。2)自监督学习:在没有任何可验证标签、预训练奖励模型或监督微调的情况下,我们直接使用强化学习在通用维基百科语料库上预训练 3 到 30B 基本模型的推理机,显着打破了通用推理的验证数据墙。3)验证扩展:通过解决日益具有挑战性的屏蔽跨度,PretrainZero 大大增强了预训练基础模型的一般推理能力。在强化预训练中,PretrainZero 将 Qwen3-4B-Base 在 MMLU-Pro、SuperGPQA 和数学平均基准上改进为 8.43、5.96 和 10.60。在训练后,预训练模型还可以作为下游 RLVR 任务的推理基础模型。
REASONEDIT:走向推理增强图像编辑模型
-
标题: REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
-
作者: Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu
-
日期: 2025-11-27
-
ArXiv主页 : https://arxiv.org/abs/2511.22625
-
gitHub仓库 : https://github.com/stepfun-ai/Step1X-Edit?tab=readme-ov-file#step1x-edit-v1p2-v12
英文摘要
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).
中文摘要
图像编辑模型的最新进展已经显示出显着的进步。常见的架构设计将多模态大语言模型 (MLLM) 编码器与扩散解码器结合在一起,如 Step1X-Edit 和 Qwen-Image-Edit 等系统中所示,其中 MLLM 对参考图像和指令进行编码,但在训练期间保持冻结状态。在这项工作中,我们证明解锁 MLLM 的推理能力可以进一步突破编辑模型的界限。具体来说,我们探索了两种推理机制:思考和反思,以增强指令理解和编辑准确性。基于此,我们提出的框架使图像编辑能够在思维-编辑-反思循环中实现:思维机制利用 MLLM 的世界知识来解释抽象指令,而反思则审查编辑结果,自动纠正无意的操作,并识别停止回合。大量实验表明,我们的推理方法实现了显着的性能提升,在从 Step1X-Edit (ReasonEdit-S) 初始化 DiT 时,ImgEdit (+4.3%)、GEdit (+4.7%) 和 Kris (+8.2%) 得到了改进,并且在与 Qwen-Image-Edit (ReasonEdit-Q) 集成时,也优于之前在 GEdit 和 Kris 上的开源方法。
ARM-Thinker:通过代理工具使用和视觉推理强化多模式生成奖励模型
- 标题: ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
- 作者: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
- 日期: 2025-12-04
- ArXiv主页 : https://arxiv.org/abs/2512.05111
- 论文链接 : https://arxiv.org/pdf/2512.05111
- 项目链接 : https://github.com/InternLM/ARM-Thinker
- gitHub仓库 : https://github.com/InternLM/ARM-Thinker
英文摘要
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
中文摘要
奖励模型对于使视觉语言系统与人类偏好保持一致至关重要,但当前的方法存在幻觉、视觉基础薄弱以及无法使用工具进行验证的问题,限制了它们在复杂的多模态推理任务上的可靠性。我们提出了 ARM-Thinker,一种自主的多模式奖励模型,它自动调用外部工具(例如图像裁剪、文档页面检索)来根据可验证的证据进行判断,取代静态的、非交互式的奖励评分。这使得模型能够验证细粒度的视觉细节、交叉引用多页证据并验证推理主张,这些都是现有奖励模型所缺乏的功能。我们通过多阶段强化学习来训练 ARM-Thinker,共同优化工具调用决策和判断准确性。为了评估代理奖励模型,我们引入了 ARMBench-VL,它包含三个基准测试,分别评估细粒度视觉基础(图像级工具)、多页文档理解(检索工具)和指令遵循(文本级验证)。ARM-Thinker 在奖励建模基准上实现了 +16.2% 的平均改进,在工具使用任务上实现了 +9.6% 的平均改进,并且在多模态数学和逻辑推理基准上优于基准。我们的结果表明,代理能力显着提高了奖励模型的准确性和可解释性。
Infinity-RoPE:自回归自推出产生动作可控的无限视频
- 标题: Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
- 作者: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
- 日期: 2025-11-25
- ArXiv主页 : https://arxiv.org/abs/2511.20649
- 论文链接 : https://arxiv.org/pdf/2511.20649
- 项目链接 : https://infinity-rope.github.io/
英文摘要
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce infty-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish infty-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that infty-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
中文摘要
当前的自回归视频扩散模型受到三个核心瓶颈的限制:(i) 基础模型的 3D 旋转位置嵌入 (3D-RoPE) 所施加的有限时间范围,(ii) 在长格式推出期间维持细粒度动作控制的即时响应速度缓慢,以及 (iii) 无法在单生成流内实现不连续的电影过渡。我们引入了 infty-RoPE,这是一个统一的推理时间框架,它通过三个互连组件解决所有三个限制:块相对论 RoPE、KV Flush 和 RoPE Cut。块相对论 RoPE 将时间编码重新表述为移动局部参考帧,其中每个新生成的潜在块相对于基本模型的最大帧水平旋转,而较早的块向后旋转以保留相对时间几何形状。这种相对论公式消除了固定的时间位置,从而能够生成远远超出基本位置限制的连续视频。为了在不重新编码的情况下获得细粒度的动作控制,KV Flush 通过仅保留两个潜在帧(全局接收器和最后生成的潜在帧)来更新 KV 缓存,从而确保立即响应。最后,RoPE Cut 在时间 RoPE 坐标中引入了受控的不连续性,从而在单个连续推出中实现多剪辑场景过渡。这些组件共同将 infty-RoPE 建立为无限视野、可控和电影视频传播的免培训基础。综合实验表明,infty-RoPE 在总体 VBench 分数上始终优于以前的自回归模型。
大规模 Vision Bridge 变压器
- 标题: Vision Bridge Transformer at Scale
- 作者: Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
- 日期: 2025-11-28
- ArXiv主页 : https://arxiv.org/abs/2511.23199
- 论文链接 : https://arxiv.org/pdf/2511.23199
- 项目链接 : https://yuanshi9815.github.io/ViBT_homepage/
- gitHub仓库 : https://github.com/Yuanshi9815/ViBT
英文摘要
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
中文摘要
我们推出 Vision Bridge Transformer (ViBT),它是专为条件生成而设计的布朗桥模型的大规模实例。与将噪声转换为数据的传统扩散模型不同,桥模型直接对输入和输出之间的轨迹进行建模,从而创建有效的数据到数据的转换范例。通过将这些模型扩展到 20B 和 1.3B 参数,我们展示了它们在图像和视频翻译任务中的有效性。为了支持这种规模,我们采用了 Transformer 架构,并提出了用于稳健训练的方差稳定速度匹配目标。这些进步共同凸显了扩展桥接模型在基于指令的图像编辑和复杂视频翻译方面的强大功能。
DualCamCtrl:用于几何感知相机控制视频生成的双分支扩散模型
- 标题: DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
- 作者: Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen
- 日期: 2025-11-28
- ArXiv主页 : https://arxiv.org/abs/2511.23127
- 论文链接 : https://arxiv.org/pdf/2511.23127
- 项目链接 : https://soyouthinkyoucantell.github.io/dualcamctrl-page/
- gitHub仓库 : https://github.com/EnVision-Research/DualCamCtrl
英文摘要
This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl-page/
中文摘要
本文提出了 DualCamCtrl,一种用于摄像机控制视频生成的新颖的端到端扩散模型。最近的工作通过将相机姿势表示为基于光线的条件来推进这一领域,但它们往往缺乏足够的场景理解和几何意识。DualCamCtrl 通过引入双分支框架专门针对这一限制,该框架可相互生成相机一致的 RGB 和深度序列。为了协调这两种模式,我们进一步提出了语义引导相互对齐(SIGMA)机制,该机制以语义引导和相互增强的方式执行 RGB 深度融合。这些设计共同使 DualCamCtrl 能够更好地理清外观和几何建模,生成更忠实地遵循指定摄像机轨迹的视频。此外,我们分析并揭示了深度和相机姿势在去噪阶段的独特影响,并进一步证明早期和后期在形成全局结构和细化局部细节方面发挥着互补作用。大量实验表明,DualCamCtrl 实现了更一致的摄像机控制视频生成,与之前的方法相比,摄像机运动误差减少了 40% 以上。我们的项目页面:https://soyouthinkyoucantell.github.io/dualcamctrl-page/