中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- LLM推理中桥接内部概率和自洽性的理论研究
- 通过核心注意力分解进行高效的长上下文语言模型训练
- 每个注意力都很重要:用于长上下文推理的高效混合架构
- LightMem:轻量级、高效的内存增强生成
- DeepAnalyze:用于自主数据科学的代理大型语言模型
- 世界中的世界:闭环世界中的世界模型
- OmniVinci:增强架构和数据以实现全模式理解法学硕士
- BAPO:通过自适应裁剪的平衡策略优化来稳定法学硕士的离策略强化学习
- DeepSeek-OCR:上下文光学压缩
- 每一步都在发展:将强化学习扩展到万亿级思维模型
- [人工协作纸到页制作,价格不到 0.1 美元](#人工协作纸到页制作,价格不到 0.1 美元)
- FineVision:开放数据就是您所需要的
- 语言模型是内射的,因此是可逆的
- Glyph:通过视觉文本压缩缩放上下文窗口
- UniGenBench++:文本到图像生成的统一语义评估基准
- [NANO3D:无需培训即可实现无遮罩高效 3D 编辑的方法](#NANO3D:无需培训即可实现无遮罩高效 3D 编辑的方法)
- PICABench:我们距离真实的图像编辑还有多远?
- LoongRL:长上下文高级推理的强化学习
- AdaSPEC:高效推测解码器的选择性知识蒸馏
- [Open-o3 视频:具有显式时空证据的扎根视频推理](#Open-o3 视频:具有显式时空证据的扎根视频推理)
- Chem-R:作为化学家学习推理
- 使用高质量合成数据集扩展基于指令的视频编辑
- [RL 让 MLLM 比 SFT 看得更清楚](#RL 让 MLLM 比 SFT 看得更清楚)
- 扩散语言模型中的注意力集中
- 没有变分自动编码器的潜在扩散模型
- 通过情境学习出现的紧急错位:狭窄的情境示例可能会产生广泛错位的法学硕士
- GigaBrain-0:世界模型驱动的视觉-语言-动作模型
- [Skyfall-GS:从卫星图像合成沉浸式 3D 城市场景](#Skyfall-GS:从卫星图像合成沉浸式 3D 城市场景)
- HoloCine:电影多镜头长视频叙事的整体生成
- MoGA:用于端到端长视频生成的混合组注意力
LLM推理中桥接内部概率和自洽性的理论研究
- 标题: A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
- 作者: Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15444
- 论文链接 : https://arxiv.org/pdf/2510.15444
- 项目链接 : https://wnjxyk.github.io/RPC
- gitHub仓库 : https://github.com/WNJXYK/RPC
英文摘要
Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.
中文摘要
测试时间扩展旨在通过添加计算资源来提高大型语言模型 (LLM) 的推理性能。该领域流行的方法是基于采样的测试时间缩放方法,该方法通过在推理过程中为给定输入生成多个推理路径来增强推理。然而,尽管它在实践中取得了成功,但其理论基础仍未得到充分探索。在本文中,我们从置信度估计的角度提供了第一个分析基于采样的测试时间缩放方法的理论框架。基于该框架,我们分析了两种主要范式:自洽和困惑,并揭示了关键局限性:自洽遭受高估计误差,而困惑则表现出巨大的建模误差和估计误差收敛性可能下降。为了解决这些限制,我们引入了 RPC,这是一种混合方法,它通过两个关键组件利用我们的理论见解:困惑一致性和推理剪枝。困惑一致性结合了自一致性和困惑的优点,在保留模型误差的同时,将估计误差的收敛速度从线性提高到指数。推理剪枝通过消除低概率推理路径来防止退化。七个基准数据集的理论分析和实证结果都表明 RPC 具有减少推理错误的强大潜力。值得注意的是,RPC 实现了与自一致性相当的推理性能,同时不仅增强了置信可靠性,还降低了 50% 的采样成本。代码和资源可在 https://wnjxyk.github.io/RPC 获取。
通过核心注意力分解进行高效的长上下文语言模型训练
- 标题: Efficient Long-context Language Model Training by Core Attention Disaggregation
- 作者: Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang
- 日期: 2025-10-20
- ArXiv主页 : https://arxiv.org/abs/2510.18121
- 论文链接 : https://arxiv.org/pdf/2510.18121
英文摘要
We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.
中文摘要
我们提出了核心注意力分解(CAD),这是一种通过将核心注意力计算 softmax(QK^T)V 与模型的其余部分解耦并在单独的设备池上执行来改进长上下文大语言模型训练的技术。在现有系统中,核心注意力与其他层位于同一位置;在长上下文长度下,与其他组件的近线性增长相比,其二次计算增长会导致数据和管道并行组之间的负载不平衡和落后。CAD 通过两个观察来启用。首先,核心注意力是无状态的:它没有可训练的参数,只有最少的瞬态数据,因此平衡减少为调度计算密集型任务。其次,它是可组合的:现代注意力内核在处理任意长度的融合批次令牌级分片时保持高效率。CAD 将核心注意力划分为令牌级任务,并将它们分派到专用注意力服务器,该服务器动态重新批处理任务以均衡计算,而不会牺牲内核效率。我们在名为 DistCA 的系统中实现 CAD,该系统使用乒乓执行方案来完全重叠通信与计算以及注意力服务器上的就地执行,以减少内存使用。在 512 个 H200 GPU 和高达 512k 令牌的上下文长度上,DistCA 将端到端训练吞吐量提高了 1.35 倍,消除了数据和管道并行落后者,并实现了近乎完美的计算和内存平衡。
每个注意力都很重要:用于长上下文推理的高效混合架构
-
标题: Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
-
作者: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
-
日期: 2025-10-22
-
ArXiv主页 : https://arxiv.org/abs/2510.19338
-
gitHub仓库 : https://github.com/inclusionAI/linghe
英文摘要
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
中文摘要
在本次技术报告中,我们展示了Ring-线性模型系列,具体包括Ring-mini-linear-2.0和Ring-flash-linear-2.0。Ring-mini-linear-2.0包含16B参数和957M激活,而Ring-flash-linear-2.0包含104B参数和6.1B激活。两种模型均采用混合架构,有效集成了线性注意力和softmax注意力,显着减少了长上下文推理场景中的I/O和计算开销。与320亿参数密集模型相比,该系列推理成本降低至1/10,与原始Ring系列相比,成本也降低了50%以上。此外,通过系统地探索混合架构中不同注意力机制之间的比例,我们已经确定了当前最优的模型结构。此外,利用我们自主研发的高性能FP8算子库凌河,整体训练效率提升了50%。受益于训练和推理引擎算子之间的高度一致性,模型可以在强化学习阶段进行长期、稳定、高效的优化,在多个具有挑战性的复杂推理基准上始终保持 SOTA 性能。
LightMem:轻量级、高效的内存增强生成
-
标题: LightMem: Lightweight and Efficient Memory-Augmented Generation
-
作者: Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang
-
日期: 2025-10-21
-
ArXiv主页 : https://arxiv.org/abs/2510.18866
-
gitHub仓库 : https://github.com/zjunlp/LightMem
英文摘要
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.
中文摘要
尽管大型语言模型 (LLM) 具有卓越的功能,但它仍难以在动态和复杂的环境中有效利用历史交互信息。内存系统通过引入持久性信息存储、检索和利用机制,使法学硕士能够超越无状态交互。然而,现有的存储器系统经常引入大量的时间和计算开销。为此,我们引入了一种名为LightMem的新内存系统,它在内存系统的性能和效率之间取得了平衡。受人类记忆 Atkinson-Shiffrin 模型的启发,LightMem 将记忆组织为三个互补的阶段。首先,受认知启发的感觉记忆通过轻量级压缩快速过滤不相关的信息,并根据主题对信息进行分组。接下来,主题感知短期记忆会整合这些基于主题的组,组织和总结内容以实现更结构化的访问。最后,具有睡眠时间更新的长期记忆采用离线程序,将整合与在线推理分离。在具有 GPT 和 Qwen 主干的 LongMemEval 上进行的实验表明,LightMem 在准确性方面优于强大的基线(增益高达 10.9%),同时将令牌使用量减少高达 117 倍,API 调用减少高达 159 倍,运行时间减少超过 12 倍。代码可在 https://github.com/zjunlp/LightMem 获取。
DeepAnalyze:用于自主数据科学的代理大型语言模型
- 标题: DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
- 作者: Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du
- 日期: 2025-10-19
- ArXiv主页 : https://arxiv.org/abs/2510.16872
- 论文链接 : https://arxiv.org/pdf/2510.16872
- 项目链接 : https://ruc-deepanalyze.github.io/
- gitHub仓库 : https://github.com/ruc-datalab/DeepAnalyze
英文摘要
Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.
中文摘要
自主数据科学,从原始数据源到分析师级的深度研究报告,一直是一个长期存在的挑战,现在随着强大的大型语言模型(LLM)的出现而变得可行。最近基于工作流的数据代理在特定数据任务上显示出了有希望的结果,但由于它们依赖于预定义的工作流,因此在实现完全自主的数据科学方面仍然受到根本限制。在本文中,我们介绍了 DeepAnalyze-8B,这是第一个专为自主数据科学设计的代理法学硕士,能够自动完成从数据源到分析师级深度研究报告的端到端管道。为了解决高复杂性的数据科学任务,我们提出了一种基于课程的代理培训范式,模拟人类数据科学家的学习轨迹,使法学硕士能够在现实环境中逐步获取和整合多种能力。我们还引入了一个基于数据的轨迹合成框架,可以构建高质量的训练数据。通过代理训练,DeepAnalyze 学习执行广泛的数据任务,从数据问答和专业分析任务到开放式数据研究。实验表明,仅使用 8B 个参数,DeepAnalyze 的性能就优于之前基于最先进的专有 LLM 构建的基于工作流的代理。DeepAnalyze 的模型、代码和训练数据都是开源的,为自主数据科学铺平了道路。
世界中的世界:闭环世界中的世界模型
- 标题: World-in-World: World Models in a Closed-Loop World
- 作者: Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen
- 日期: 2025-10-20
- ArXiv主页 : https://arxiv.org/abs/2510.18135
- 论文链接 : https://arxiv.org/pdf/2510.18135
- 项目链接 : https://world-in-world.github.io/
- gitHub仓库 : https://github.com/World-In-World/world-in-world
英文摘要
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
中文摘要
生成世界模型(WM)现在可以模拟具有惊人视觉真实感的世界,这自然提出了一个问题:它们是否可以赋予实体代理决策的预测感知。这个问题的进展受到碎片化评估的限制:大多数现有基准采用开环协议,孤立地强调视觉质量,而没有解决体现效用的核心问题,即 WM 是否真的帮助智能体成功完成体现任务?为了解决这一差距,我们引入了 World-in-World,这是第一个在闭环世界中对 WM 进行基准测试的开放平台,反映了真实的代理与环境交互。World-in-World提供统一的在线规划策略和标准化的操作API,支持异构WM进行决策。我们策划了四个闭环环境,严格评估不同的 WM,将任务成功作为主要指标,并超越对视觉质量的共同关注;我们还为具体环境中的世界模型提出了第一个数据缩放定律。我们的研究揭示了三个惊喜:(1)视觉质量本身并不能保证任务成功,可控性更重要;(2) 使用动作观察数据扩展训练后比升级预训练视频生成器更有效;(3) 分配更多的推理时间计算使 WM 能够显着提高闭环性能。
OmniVinci:增强架构和数据以实现全模式理解法学硕士
- 标题: OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
- 作者: Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15870
- 论文链接 : https://arxiv.org/pdf/2510.15870
- 项目链接 : https://nvlabs.github.io/OmniVinci/
- gitHub仓库 : https://github.com/NVlabs/OmniVinci
英文摘要
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
中文摘要
推进机器智能需要培养跨多种模式感知的能力,就像人类感知世界一样。我们推出 OmniVinci,这是一项旨在建立强大、开源、全模式法学硕士的举措。我们仔细研究模型架构和数据管理的设计选择。对于模型架构,我们提出了三项关键创新:(i)OmniAlignNet,用于加强共享全模态潜在空间中视觉和音频嵌入之间的对齐;(ii) 时间嵌入分组,用于捕获视觉和音频信号之间的相对时间对齐;(iii) 约束旋转时间嵌入,用于在全模态嵌入中编码绝对时间信息。我们引入了一个生成 24M 单模态和全模态对话的管理和综合管道。我们发现,各种模式在感知和推理方面相互加强。我们的模型 OmniVinci 的性能优于 Qwen2.5-Omni,DailyOmni(跨模式理解)为 +19.05,MMAR(音频)为 +1.7,Video-MME(视觉)为 +3.9,同时仅使用 0.2T 训练令牌 - 与 Qwen2.5-Omni 的 1.2T 相比减少了 6 倍。我们最终在机器人、医疗人工智能和智能工厂等下游应用中展示了全模态优势。
BAPO:通过自适应裁剪的平衡策略优化来稳定法学硕士的离策略强化学习
- 标题: BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
- 作者: Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang
- 日期: 2025-10-21
- ArXiv主页 : https://arxiv.org/abs/2510.18927
- 论文链接 : https://arxiv.org/pdf/2510.18927
- 项目链接 : https://github.com/WooooDyy/BAPO
- gitHub仓库 : https://github.com/WooooDyy/BAPO
英文摘要
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
中文摘要
强化学习 (RL) 最近已成为调整和强化大型语言模型 (LLM) 的核心范例。然而,在非策略环境(使用过去策略的陈旧数据进行训练)中应用强化学习可以提高样本效率,但仍然具有挑战性:策略熵急剧下降,优化通常变得不稳定,甚至可能崩溃。通过理论和实证分析,我们确定了两个关键见解:(i)优化不平衡,负优势样本主导政策梯度,抑制有用行为并面临梯度爆炸的风险;(ii) 衍生的熵剪辑规则,它揭示了类似 PPO 目标中的固定剪辑机制系统地阻止熵增加的更新,从而以牺牲探索为代价推动政策过度利用。基于这些见解,我们提出了带有自适应裁剪的平衡策略优化(BAPO),这是一种简单而有效的方法,可以动态调整裁剪范围以自适应地重新平衡正负贡献、保留熵并稳定 RL 优化。在不同的离策略场景(包括样本重放和部分推出)中,BAPO 实现了快速、稳定和数据高效的训练。在 AIME 2024 和 AIME 2025 基准测试中,我们的 7B BAPO 模型超越了 SkyWork-OR1-7B 等开源模型,而我们的 32B BAPO 模型不仅在同规模模型中取得了最先进的结果,而且还优于 o3-mini 和 Gemini-2.5-Flash-Thinking 等领先的专有系统。
DeepSeek-OCR:上下文光学压缩
- 标题: DeepSeek-OCR: Contexts Optical Compression
- 作者: Haoran Wei, Yaofeng Sun, Yukun Li
- 日期: 2025-10-21
- ArXiv主页 : https://arxiv.org/abs/2510.18234
- 论文链接 : https://arxiv.org/pdf/2510.18234
英文摘要
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
中文摘要
我们提出 DeepSeek-OCR 作为对通过光学 2D 映射压缩长上下文的可行性的初步研究。DeepSeek-OCR 由两个组件组成:DeepEncoder 和作为解码器的 DeepSeek3B-MoE-A570M。具体来说,DeepEncoder 作为核心引擎,旨在在高分辨率输入下保持低激活,同时实现高压缩比,以确保最佳且可管理的视觉令牌数量。实验表明,当文本标记数量在视觉标记数量的10倍以内(即压缩比<10倍)时,该模型可以实现97%的解码(OCR)精度。即使在20倍的压缩比下,OCR准确率仍然保持在60%左右。这为法学硕士中的历史长上下文压缩和记忆遗忘机制等研究领域显示了巨大的前景。除此之外,DeepSeek-OCR还表现出很高的实用价值。在 OmniDocBench 上,它仅使用 100 个视觉令牌就超越了 GOT-OCR2.0(256 个令牌/页),并且在使用少于 800 个视觉令牌的情况下超越了 MinerU2.0(平均每页 6000 多个令牌)。在生产中,DeepSeek-OCR 可以每天生成 200k+ 页规模的 LLM/VLM 训练数据(单个 A100-40G)。代码和模型权重可在 http://github.com/deepseek-ai/DeepSeek-OCR 上公开访问。
每一步都在发展:将强化学习扩展到万亿级思维模型
-
标题: Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
-
作者: Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen
-
日期: 2025-10-21
-
ArXiv主页 : https://arxiv.org/abs/2510.18855
-
gitHub仓库 : https://github.com/inclusionAI/Ring-V2
英文摘要
We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
中文摘要
我们推出了 Ring-1T,这是第一个开源、最先进的思维模型,具有万亿级参数。它具有 1 万亿个总参数,每个代币激活大约 500 亿个。在万亿参数规模上训练此类模型会带来前所未有的挑战,包括训练推理错位、推出处理效率低下以及强化学习系统中的瓶颈。为了解决这些问题,我们开创了三项相互关联的创新:(1)IcePop 通过令牌级差异屏蔽和裁剪来稳定强化学习训练,解决训练-推理不匹配带来的不稳定问题;(2) C3PO++通过动态分区,提高了代币预算下长时间部署的资源利用率,从而获得较高的时间效率;(3) ASystem,一个高性能强化学习框架,旨在克服阻碍万亿参数模型训练的系统瓶颈。Ring-1T 在关键基准测试中取得了突破性的结果:AIME-2025 上的 93.4、HMMT-2025 上的 86.72、CodeForces 上的 2088 以及 ARC-AGI-v1 上的 55.94。值得注意的是,它在 IMO-2025 上获得了银牌级别的结果,凸显了其卓越的推理能力。通过向社区发布完整的 1T 参数 MoE 模型,我们为研究社区提供了直接获取尖端推理能力的机会。这一贡献标志着大规模推理智能民主化的一个重要里程碑,并为开源模型性能建立了新的基准。
人工协作纸到页制作,价格不到 0.1 美元
- 标题: Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1
- 作者: Qianli Ma, Siyu Wang, Yilin Chen, Yinhao Tang, Yixiang Yang, Chang Guo, Bingjie Gao, Zhening Xing, Yanan Sun, Zhipeng Zhang
- 日期: 2025-10-22
- ArXiv主页 : https://arxiv.org/abs/2510.19600
- 论文链接 : https://arxiv.org/pdf/2510.19600
- 项目链接 : https://mqleet.github.io/AutoPage_ProjectPage
- gitHub仓库 : https://github.com/AutoLab-SAI-SJTU/AutoPage
英文摘要
In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce AutoPage, a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated "Checker" agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author's vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct PageBench, the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than \0.1. Code and dataset will be released at https://mqleet.github.io/AutoPage_ProjectPage/{Webpage}$.
中文摘要
在追求科学进步的过程中,交流研究与发现本身一样重要。然而,研究人员经常被手动、重复性的构建项目网页的繁琐工作所困扰,以使他们的密集论文易于访问。虽然自动化已经解决了静态幻灯片和海报的问题,但网页的动态、交互性仍然是一个尚未解决的挑战。为了弥合这一差距,我们重新构建了问题,认为解决方案不在于单一命令,而在于协作、分层流程。我们推出 AutoPage,一种体现这一理念的新型多代理系统。AutoPage 将纸张到页面的创建解构为从叙事规划到多模式内容生成和交互式渲染的从粗到细的管道。为了对抗人工智能的幻觉,专门的"检查器"代理根据源论文验证每一步,而可选的人工检查点确保最终产品与作者的愿景完美契合,将系统从单纯的工具转变为强大的协作助手。为了严格验证我们的方法,我们还构建了 PageBench,这是这项新任务的第一个基准。实验表明,AutoPage 不仅可以生成高质量、具有视觉吸引力的页面,而且可以在 15 分钟内以低于 0.1 美元的成本高效地生成页面。代码和数据集将在 https://mqleet.github.io/AutoPage_ProjectPage/{Webpage}$ 发布。
FineVision:开放数据就是您所需要的
- 标题: FineVision: Open Data Is All You Need
- 作者: Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
- 日期: 2025-10-20
- ArXiv主页 : https://arxiv.org/abs/2510.17269
- 论文链接 : https://arxiv.org/pdf/2510.17269
- 项目链接 : https://huggingface.co/spaces/HuggingFaceM4/FineVision
英文摘要
The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
中文摘要
视觉语言模型(VLM)的进步受到不一致和受污染的公共数据集碎片化的阻碍。我们推出 FineVision,这是一个精心收集、整理和统一的包含 2400 万个样本的语料库 - 同类中最大的开放资源。我们通过半自动化、人机交互管道将 200 多个来源统一为 185 个子集:自动化执行批量摄取和模式映射,而审阅者审核映射和抽查输出,以验证注释的忠实使用、适当的格式和多样性以及安全性;问题会触发有针对性的修复并重新运行。该工作流程进一步在数据源内部和跨数据源之间应用严格的重复数据删除,并根据 66 个公共基准进行净化。FineVision 还包含具有统一操作空间的代理/GUI 任务;审阅者验证模式并检查轨迹样本以确认可执行的保真度。在 FineVision 上训练的模型在广泛的评估套件中始终优于在现有开放混合物上训练的模型,强调了规模、数据卫生和在人工监督下平衡自动化的好处。我们发布语料库和管理工具来加速以数据为中心的 VLM 研究。
语言模型是内射的,因此是可逆的
- 标题: Language Models are Injective and Hence Invertible
- 作者: Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15511
- 论文链接 : https://arxiv.org/pdf/2510.15511
英文摘要
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
中文摘要
非线性激活和归一化等 Transformer 组件本质上是非内射的,这表明不同的输入可能映射到相同的输出,并阻止从模型表示中精确恢复输入。在本文中,我们对这一观点提出了挑战。首先,我们从数学上证明,将离散输入序列映射到相应的连续表示序列的 Transformer 语言模型是单射的,因此是无损的,这是在初始化时建立并在训练期间保留的属性。其次,我们通过对六种最先进的语言模型进行数十亿次碰撞测试,凭经验证实了这一结果,并且没有观察到碰撞。第三,我们操作单射性:我们引入 SipIt,这是第一个可证明且有效地从隐藏激活重建精确输入文本的算法,建立线性时间保证并在实践中展示精确的可逆性。总的来说,我们的工作将单射性确立为语言模型的基本且可利用的属性,对透明度、可解释性和安全部署具有直接影响。
Glyph:通过视觉文本压缩缩放上下文窗口
- 标题: Glyph: Scaling Context Windows via Visual-Text Compression
- 作者: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
- 日期: 2025-10-20
- ArXiv主页 : https://arxiv.org/abs/2510.17800
- 论文链接 : https://arxiv.org/pdf/2510.17800
英文摘要
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
中文摘要
大型语言模型 (LLM) 越来越依赖长上下文建模来完成文档理解、代码分析和多步骤推理等任务。然而,将上下文窗口扩展到百万个令牌级别会带来高昂的计算和内存成本,限制了长上下文 LLM 的实用性。在这项工作中,我们采取不同的视角------视觉上下文缩放------来应对这一挑战。我们提出了 Glyph,而不是扩展基于标记的序列,这是一个将长文本渲染为图像并使用视觉语言模型(VLM)对其进行处理的框架。这种方法在保留语义信息的同时大大压缩了文本输入,我们进一步设计了法学硕士驱动的遗传搜索来识别最佳的视觉渲染配置,以平衡准确性和压缩。通过大量的实验,我们证明我们的方法实现了 3-4 倍的令牌压缩,同时在各种长上下文基准上保持与领先的 LLM(例如 Qwen3-8B)相当的精度。这种压缩还使预填充和解码速度提高了约 4 倍,SFT 训练速度提高了约 2 倍。此外,在极端压缩下,128K 上下文 VLM 可以扩展以处理 1M 令牌级别的文本任务。此外,渲染的文本数据有利于现实世界的多模式任务,例如文档理解。我们的代码和模型发布于 https://github.com/thu-coai/Glyph。
UniGenBench++:文本到图像生成的统一语义评估基准
- 标题: UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
- 作者: Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
- 日期: 2025-10-21
- ArXiv主页 : https://arxiv.org/abs/2510.18701
- 论文链接 : https://arxiv.org/pdf/2510.18701
- 项目链接 : https://codegoat24.github.io/UniGenBench/
- gitHub仓库 : https://github.com/CodeGoat24/UniGenBench
英文摘要
Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
中文摘要
文本到图像 (T2I) 生成的最新进展强调了可靠基准在评估生成的图像如何准确地反映其文本提示的语义方面的重要性。然而,(1)现有基准缺乏提示场景的多样性和多语言支持,而这对于现实世界的适用性至关重要;(2)仅对主要维度进行粗略评估,涵盖的子维度范围较窄,细粒度的子维度评估不足。为了解决这些限制,我们引入了 UniGenBench++,这是一个用于 T2I 生成的统一语义评估基准。具体来说,它由600个提示组成,分层组织,以确保覆盖范围和效率:(1)跨越不同的现实场景,即5个主要提示主题和20个副主题;(2)全面探讨T2I模型在10个主要和27个子评估标准上的语义一致性,每个评估标准评估多个测试点。为了严格评估模型对语言和提示长度变化的鲁棒性,我们提供了每个提示的英文和中文版本的短形式和长形式。利用闭源多模态大语言模型(MLLM)(即 Gemini-2.5-Pro)的一般世界知识和细粒度图像理解能力,开发了一个有效的管道,用于可靠的基准构建和简化的模型评估。此外,为了进一步促进社区使用,我们训练了一个强大的评估模型,可以对 T2I 模型输出进行离线评估。通过对开源和闭源 T2I 模型进行全面的基准测试,我们系统地揭示了它们在各个方面的优势和劣势。
NANO3D:无需培训即可实现无遮罩高效 3D 编辑的方法
- 标题: NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
- 作者: Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, Jun Zhu
- 日期: 2025-10-16
- ArXiv主页 : https://arxiv.org/abs/2510.15019
- 论文链接 : https://arxiv.org/pdf/2510.15019
- 项目链接 : https://jamesyjl.github.io/Nano3D/
英文摘要
3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D
中文摘要
3D 对象编辑对于游戏、动画和机器人领域的交互式内容创建至关重要,但当前的方法仍然效率低下、不一致,并且常常无法保留未编辑的区域。大多数方法依赖于编辑多视图渲染,然后进行重建,这会引入伪影并限制实用性。为了应对这些挑战,我们提出了 Nano3D,这是一种无需训练的框架,可以在没有掩模的情况下进行精确且连贯的 3D 对象编辑。Nano3D 将 FlowEdit 集成到 TRELLIS 中,以在前视图渲染的指导下执行本地化编辑,并进一步引入区域感知合并策略 Voxel/Slat-Merge,该策略通过确保编辑和未编辑区域之间的一致性来自适应地保持结构保真度。实验表明,与现有方法相比,Nano3D 实现了卓越的 3D 一致性和视觉质量。基于该框架,我们构建了第一个大规模3D编辑数据集Nano3D-Edit-100k,其中包含超过100,000个高质量3D编辑对。这项工作解决了算法设计和数据可用性方面长期存在的挑战,显着提高了 3D 编辑的通用性和可靠性,并为前馈 3D 编辑模型的开发奠定了基础。项目页面:https://jamesyjl.github.io/Nano3D
PICABench:我们距离真实的图像编辑还有多远?
- 标题: PICABench: How Far Are We from Physically Realistic Image Editing?
- 作者: Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu
- 日期: 2025-10-20
- ArXiv主页 : https://arxiv.org/abs/2510.17681
- 论文链接 : https://arxiv.org/pdf/2510.17681
- 项目链接 : https://picabench.github.io/
- gitHub仓库 : https://github.com/Andrew0613/PICABench
英文摘要
Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
中文摘要
图像编辑最近取得了显着的进步。现代编辑模型已经可以遵循复杂的指令来操纵原始内容。然而,除了完成编辑指令之外,附带的物理效果也是生成真实感的关键。例如,删除一个对象还应该删除它的阴影、反射以及与附近对象的交互。不幸的是,现有的模型和基准主要关注指令完成,而忽略了这些物理影响。那么,目前我们离物理真实的图像编辑还有多远?为了回答这个问题,我们引入了 PICABench,它系统地评估了大多数常见编辑操作(添加、删除、属性更改等)的八个子维度(跨越光学、力学和状态转换)的物理真实感。我们进一步提出了 PICAEval,这是一种可靠的评估协议,它使用 VLM 作为法官,并针对每个案例、区域级别的人工注释和问题进行评估。除了基准测试之外,我们还通过从视频中学习物理并构建训练数据集 PICA-100K 来探索有效的解决方案。在评估了大多数主流模型后,我们发现物理现实主义仍然是一个具有挑战性的问题,有很大的探索空间。我们希望我们的基准和提出的解决方案可以作为未来工作从简单的内容编辑转向物理一致的现实主义的基础。
LoongRL:长上下文高级推理的强化学习
- 标题: LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
- 作者: Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang
- 日期: 2025-10-22
- ArXiv主页 : https://arxiv.org/abs/2510.19363
- 论文链接 : https://arxiv.org/pdf/2510.19363
英文摘要
Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
中文摘要
长上下文推理对于大型语言模型至关重要。虽然强化学习 (RL) 通过在思维链中引入"啊哈"时刻来增强短上下文推理,但长上下文推理所需的高级思维模式在很大程度上仍未被探索,而且高难度的 RL 数据也很稀缺。在本文中,我们介绍了 LoongRL,一种用于高级长上下文推理的数据驱动的 RL 方法。LoongRL 的核心是 KeyChain,这是一种综合方法,通过插入隐藏大量分散注意力的文档中的真正问题的 UUID 链,将短多跳 QA 转换为高难度的长上下文任务。解决这些任务需要模型逐步追踪正确的链,识别真正的问题,检索相关事实并推理它们以正确回答。对 KeyChain 数据进行 RL 训练会引发一种紧急的计划-检索-推理-重新检查推理模式,其概括性远远超出了训练长度。在 16K 下训练的模型可以有效解决 128K 任务,而无需高昂的全长 RL 部署成本。在 Qwen2.5-7B 和 14B 上,LoongRL 大幅提高了长上下文多跳 QA 准确率,绝对增益增加了 +23.5% 和 +21.1%。由此产生的 LoongRL-14B 得分达到 74.2,可与 o3-mini(74.5)和 DeepSeek-R1(74.9)等更大的前沿模型相媲美。它还改进了长上下文检索,通过了所有 128K 大海捞针压力测试,并保留了短上下文推理能力。
AdaSPEC:高效推测解码器的选择性知识蒸馏
- 标题: AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
- 作者: Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
- 日期: 2025-10-22
- ArXiv主页 : https://arxiv.org/abs/2510.19779
- 论文链接 : https://arxiv.org/pdf/2510.19779
- 项目链接 : https://github.com/yuezhouhu/adaspec
- gitHub仓库 : https://github.com/yuezhouhu/adaspec
英文摘要
Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15%). The code is publicly available at https://github.com/yuezhouhu/adaspec.
中文摘要
推测解码 (SD) 通过使用小型草稿模型生成预测来加速大型语言模型推理,然后由更大的目标模型进行验证。SD 的有效性取决于这些模型之间的一致性,这通常通过知识蒸馏 (KD) 得到增强。然而,传统的 KD 方法旨在最小化所有代币的草稿模型和目标模型之间的 KL 差异,这一目标与 SD 的真正目标(即最大化代币接受率)不一致。因此,由于容量限制,草稿模型常常难以完全吸收目标模型的知识,从而导致性能不佳。为了应对这一挑战,我们提出了 AdaSPEC,这是一种将选择性令牌过滤纳入 KD 过程的新颖方法。AdaSPEC 利用参考模型来识别和过滤掉难以拟合的标记,从而能够对草稿模型进行提炼,从而更好地与更简单的标记上的目标模型保持一致。这种方法在不影响生成质量的情况下提高了总体代币接受率。我们使用 31M/1.4B 和 350M/2.7B 参数的模型配置跨各种任务评估 AdaSPEC,包括算术推理、指令跟踪、编码和摘要。我们的结果表明,AdaSPEC 始终优于最先进的 DistillSpec 方法,在所有任务中实现了更高的接受率(高达 15%)。该代码可在 https://github.com/yuezhouhu/adaspec 上公开获取。
Open-o3 视频:具有显式时空证据的扎根视频推理
- 标题: Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
- 作者: Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
- 日期: 2025-10-23
- ArXiv主页 : https://arxiv.org/abs/2510.20579
- 论文链接 : https://arxiv.org/pdf/2510.20579
- 项目链接 : https://marinero4972.github.io/projects/Open-o3-Video/
- gitHub仓库 : https://github.com/marinero4972/Open-o3-Video
英文摘要
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
中文摘要
大多数视频推理模型仅生成文本推理轨迹,而不指示关键证据出现的时间和地点。OpenAI-o3 等最新模型引发了人们对以证据为中心的图像推理的广泛兴趣,但将这种能力扩展到视频更具挑战性,因为它需要跨动态场景的联合时间跟踪和空间定位。我们引入了 Open-o3 Video,一种将显式时空证据集成到视频推理中的非代理框架,并仔细收集训练数据并设计训练策略来应对上述挑战。该模型突出显示了关键时间戳、对象和边界框及其答案,使推理能够基于具体的视觉观察。为了实现此功能,我们首先策划和构建两个高质量的数据集,即用于 SFT 的 STGR-CoT-30k 和用于 RL 的 STGR-RL-36k,并带有精心构建的时间和空间注释,因为大多数现有数据集提供视频的时间跨度或图像的空间框,缺乏统一的时空监督和推理跟踪。然后,我们采用冷启动强化学习策略,并具有多个专门设计的奖励,共同鼓励答案准确性、时间对齐和空间精度。在 V-STAR 基准上,Open-o3 Video 实现了最先进的性能,在 Qwen2.5-VL 基准上,mAM 提高了 14.4%,mLGM 提高了 24.2%。在广泛的视频理解基准测试中也观察到了一致的改进,包括 VideoMME、WorldSense、VideoMMMU 和 TVGBench。除了准确性之外,Open-o3 Video 生成的推理轨迹还为测试时间扩展提供有价值的信号,从而实现置信度验证并提高答案可靠性。
Chem-R:作为化学家学习推理
-
标题: Chem-R: Learning to Reason as a Chemist
-
作者: Weida Wang, Benteng Chen, Di Zhang, Wanhao Liu, Shuchen Pu, Ben Gao, Jin Zeng, Lei Bai, Wanli Ouyang, Xiaoyong Wei, Tianshu Yu, Tianfan Fu, Shuzhou Sun, Jiatong Li, Zifu Wang, Yuqiang Li, Shufei Zhang
-
日期: 2025-10-19
-
ArXiv主页 : https://arxiv.org/abs/2510.16880
-
gitHub仓库 : https://github.com/davidweidawang/Chem-R
英文摘要
Although large language models (LLMs) have significant potential to advance chemical discovery, current LLMs lack core chemical knowledge, produce unreliable reasoning trajectories, and exhibit suboptimal performance across diverse chemical tasks. To address these challenges, we propose Chem-R, a generalizable Chemical Reasoning model designed to emulate the deliberative processes of chemists. Chem-R is trained through a three-phase framework that progressively builds advanced reasoning capabilities, including: 1) Chemical Foundation Training, which establishes core chemical knowledge. 2) Chemical Reasoning Protocol Distillation, incorporating structured, expert-like reasoning traces to guide systematic and reliable problem solving. 3) Multi-task Group Relative Policy Optimization that optimizes the model for balanced performance across diverse molecular- and reaction-level tasks. This structured pipeline enables Chem-R to achieve state-of-the-art performance on comprehensive benchmarks, surpassing leading large language models, including Gemini-2.5-Pro and DeepSeek-R1, by up to 46% on molecular tasks and 66% on reaction tasks. Meanwhile, Chem-R also consistently outperforms the existing chemical foundation models across both molecular and reaction level tasks. These results highlight Chem-R's robust generalization, interpretability, and potential as a foundation for next-generation AI-driven chemical discovery.
中文摘要
尽管大型语言模型 (LLM) 在推进化学发现方面具有巨大潜力,但当前的 LLM 缺乏核心化学知识,产生不可靠的推理轨迹,并且在各种化学任务中表现出次优性能。为了应对这些挑战,我们提出了 Chem-R,这是一种通用的化学推理模型,旨在模拟化学家的审议过程。Chem-R 通过三阶段框架进行培训,逐步构建高级推理能力,包括:1) 化学基础培训,建立核心化学知识。2)化学推理协议蒸馏,结合结构化的、专家般的推理痕迹来指导系统、可靠的问题解决。3) 多任务组相关策略优化,优化模型以实现不同分子和反应级任务的平衡性能。这种结构化管道使 Chem-R 能够在综合基准测试中实现最先进的性能,超越领先的大型语言模型,包括 Gemini-2.5-Pro 和 DeepSeek-R1,在分子任务上超过 46%,在反应任务上超过 66%。同时,Chem-R 在分子和反应水平任务上也始终优于现有的化学基础模型。这些结果凸显了 Chem-R 强大的泛化性、可解释性以及作为下一代人工智能驱动的化学发现基础的潜力。
使用高质量合成数据集扩展基于指令的视频编辑
- 标题: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
- 作者: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15742
- 论文链接 : https://arxiv.org/pdf/2510.15742
- 项目链接 : https://ezioby.github.io/Ditto_page
- gitHub仓库 : https://github.com/EzioBy/Ditto
英文摘要
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
中文摘要
基于指令的视频编辑有望使内容创作民主化,但其进展却因大规模、高质量训练数据的稀缺而受到严重阻碍。我们推出 Ditto,一个旨在应对这一基本挑战的整体框架。Ditto 的核心是一个新颖的数据生成管道,它将领先图像编辑器的创意多样性与上下文视频生成器融合在一起,克服了现有模型的有限范围。为了使这个过程可行,我们的框架通过采用由时间增强器增强的高效、精炼的模型架构来解决令人望而却步的成本质量权衡,同时减少计算开销并提高时间一致性。最后,为了实现完全的可扩展性,整个管道由智能代理驱动,该智能代理可以制作不同的指令并严格过滤输出,从而确保大规模的质量控制。使用该框架,我们投入了超过 12,000 个 GPU 天来构建 Ditto-1M,这是一个包含 100 万个高保真视频编辑示例的新数据集。我们使用课程学习策略在 Ditto-1M 上训练我们的模型 Editto。结果展示了卓越的指令遵循能力,并在基于指令的视频编辑中建立了新的最先进技术。
RL 让 MLLM 比 SFT 看得更清楚
- 标题: RL makes MLLMs see better than SFT
- 作者: Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo
- 日期: 2025-10-18
- ArXiv主页 : https://arxiv.org/abs/2510.16333
- 论文链接 : https://arxiv.org/pdf/2510.16333
- 项目链接 : https://june-page.github.io/pivot/
英文摘要
A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/
中文摘要
多模态语言模型 (MLLM) 研究的一个主要假设是,鉴于其巨大的参数规模和卓越的功能,其性能很大程度上继承自 LLM 主干。这造成了对视觉编码器的理解空白,视觉编码器决定了 MLLM 如何感知图像。最近 MLLM 训练范式的转变,从监督微调 (SFT) 到强化学习 (RL),放大了这种监督------即,对此类训练如何重塑视觉编码器以及 MLLM 的分析严重缺乏。为了解决这个问题,我们首先研究训练策略对 MLLM 的影响,其中 RL 在与视觉密切相关的 VQA 基准测试中显示出明显优于 SFT 的优势。受此启发,我们通过各种深入的实验(从 ImageNet 分类和分割到梯度可视化)对 MLLM 的视觉编码器进行了关键但尚未充分探索的分析。我们的结果表明,MLLM 的训练后策略(即 SFT 或 RL)不仅会在 MLLM 下游任务上产生不同的结果,而且还会从根本上重塑 MLLM 的底层视觉表示。具体来说,我们研究的主要发现是,与 SFT 相比,RL 可以产生更强且更精确的局部视觉表示,从而提高了 MLLM 视觉编码器的能力。然后,我们将我们的发现重新构建为一个简单的方法,用于为 MLLM 构建强大的视觉编码器,即偏好指示视觉优化 (PIVOT)。当集成到 MLLM 中时,经过 PIVOT 训练的视觉编码器的性能优于更大、训练更严格的编码器,尽管所需的计算成本不到标准视觉预训练的 1%。这一结果为推进 MLLM 的视觉支柱开辟了一条有效且高效的道路。项目页面位于 https://june-page.github.io/pivot/
扩散语言模型中的注意力集中
- 标题: Attention Sinks in Diffusion Language Models
- 作者: Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15731
- 论文链接 : https://arxiv.org/pdf/2510.15731
英文摘要
Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
中文摘要
掩蔽扩散语言模型 (DLM) 最近已成为传统自回归模型 (ARM) 的有前途的替代方案。DLM 采用具有双向注意力的变压器编码器,实现并行令牌生成,同时保持竞争性能。尽管它们的效率和有效性已被广泛研究,但控制 DLM 的内部机制仍然很大程度上未被探索。在这项工作中,我们对 DLM 注意力模式进行了实证分析,重点关注注意力下沉现象,这是之前在各种基于 Transformer 的架构中观察到的效应。我们的研究结果表明,DLM 也表现出注意力下沉,但具有独特的特征。首先,与 ARM 不同,DLM 中的接收器位置往往会在整个生成过程中发生变化,表现出动态行为。其次,虽然 ARM 对注意力接收器的移除高度敏感,但 DLM 仍然保持稳健:屏蔽接收器只会导致性能轻微下降。这些结果为基于扩散的语言模型的内部运作提供了新的见解,并强调了它们与自回归模型相比在如何分配和利用注意力方面的根本差异。
没有变分自动编码器的潜在扩散模型
- 标题: Latent Diffusion Model without Variational Autoencoder
- 作者: Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15301
- 论文链接 : https://arxiv.org/pdf/2510.15301
- 项目链接 : https://howlin-wang.github.io/svg/
- gitHub仓库 : https://github.com/shiml20/SVG
英文摘要
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.
中文摘要
基于扩散的视觉生成的最新进展很大程度上依赖于具有变分自动编码器(VAE)的潜在扩散模型。虽然这种 VAE+扩散范式对于高保真合成有效,但其训练效率有限、推理速度慢以及向更广泛的视觉任务的可迁移性较差。这些问题源于 VAE 潜在空间的一个关键限制:缺乏清晰的语义分离和强判别结构。我们的分析证实,这些属性不仅对于感知和理解任务至关重要,而且对于潜在扩散模型的稳定有效的训练也至关重要。受这种洞察力的启发,我们引入了 SVG,这是一种没有变分自动编码器的新型潜在扩散模型,它利用自监督表示来进行视觉生成。SVG 通过利用冻结的 DINO 特征构建具有清晰语义辨别性的特征空间,而轻量级残差分支捕获细粒度细节以进行高保真重建。扩散模型直接在这个语义结构化的潜在空间上进行训练,以促进更有效的学习。因此,SVG 能够加速扩散训练,支持少步采样,并提高生成质量。实验结果进一步表明,SVG 保留了底层自我监督表示的语义和判别能力,为实现任务通用、高质量视觉表示提供了原则性途径。
通过情境学习出现的紧急错位:狭窄的情境示例可能会产生广泛错位的法学硕士
- 标题: Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
- 作者: Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov
- 日期: 2025-10-13
- ArXiv主页 : https://arxiv.org/abs/2510.11288
- 论文链接 : https://arxiv.org/pdf/2510.11288
英文摘要
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.
中文摘要
最近的研究表明,狭隘的微调可能会产生广泛的 LLM 错位,这种现象被称为紧急错位 (EM)。虽然令人担忧,但这些发现仅限于微调和激活指导,而忽略了上下文学习(ICL)。因此我们要问:ICL 中是否会出现 EM?我们发现确实如此:在三个数据集中,三个前沿模型在给出 64 个狭窄的上下文示例时产生了 2% 到 17% 之间的广泛错位响应,而在 256 个示例中则高达 58%。我们还通过引出逐步推理来检查 EM 机制(同时保持上下文示例不变)。对由此产生的思想链的手动分析表明,67.5% 的未对齐痕迹通过采用鲁莽或危险的"角色"明确地合理化有害输出,这与之前关于微调引起的 EM 的结果相呼应。
GigaBrain-0:世界模型驱动的视觉-语言-动作模型
- 标题: GigaBrain-0: A World Model-Powered Vision-Language-Action Model
- 作者: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu
- 日期: 2025-10-22
- ArXiv主页 : https://arxiv.org/abs/2510.19430
- 论文链接 : https://arxiv.org/pdf/2510.19430
- 项目链接 : https://gigabrain0.github.io/
- gitHub仓库 : https://github.com/open-gigaai/giga-brain-0
英文摘要
Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
中文摘要
训练通用机器人的视觉-语言-动作 (VLA) 模型通常需要大规模的现实世界机器人数据,而收集这些数据既昂贵又耗时。物理数据收集的低效率严重限制了当前 VLA 系统的可扩展性和泛化能力。为了应对这一挑战,我们引入了 GigaBrain-0,这是一种新颖的 VLA 基础模型,由世界模型生成的数据(例如视频生成、real2real 传输、人类传输、视图传输、sim2real 传输数据)提供支持。通过利用世界模型大规模生成不同的数据,GigaBrain-0 显着减少了对真实机器人数据的依赖,同时提高了跨任务泛化能力。我们的方法通过 RGBD 输入建模和体现的思想链 (CoT) 监督进一步提高了策略的稳健性,使模型能够在任务执行期间推理空间几何、对象状态和长范围依赖关系。这使得在灵巧、长视野和移动操作任务的实际性能方面取得了显着的进步。大量实验表明,GigaBrain-0 在外观(例如纹理、颜色)、对象放置和相机视点的变化方面实现了卓越的泛化。此外,我们还推出了 GigaBrain-0-Small,这是一种优化的轻量级变体,旨在在 NVIDIA Jetson AGX Orin 等设备上高效运行。
Skyfall-GS:从卫星图像合成沉浸式 3D 城市场景
- 标题: Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
- 作者: Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, Yu-Lun Liu
- 日期: 2025-10-17
- ArXiv主页 : https://arxiv.org/abs/2510.15869
- 论文链接 : https://arxiv.org/pdf/2510.15869
- 项目链接 : https://skyfall-gs.jayinnn.dev/
- gitHub仓库 : https://github.com/jayin92/skyfall-gs
英文摘要
Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose Skyfall-GS, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/
中文摘要
合成大规模、可探索且几何精确的 3D 城市场景是提供沉浸式和具体化应用程序的一项具有挑战性但有价值的任务。挑战在于缺乏大规模、高质量的现实世界 3D 扫描来训练可推广的生成模型。在本文中,我们采用另一种方法来创建大规模 3D 场景,即通过协同提供逼真的粗糙几何形状的现成卫星图像和用于创建高质量特写外观的开放域扩散模型。我们提出了 Skyfall-GS,这是第一个城市街区规模的 3D 场景创建框架,无需昂贵的 3D 注释,还具有实时、沉浸式 3D 探索功能。我们定制了课程驱动的迭代细化策略,以逐步增强几何完整性和逼真的纹理。大量实验表明,与最先进的方法相比,Skyfall-GS 提供了改进的跨视图一致几何形状和更真实的纹理。项目页面:https://skyfall-gs.jayinnn.dev/
HoloCine:电影多镜头长视频叙事的整体生成
- 标题: HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
- 作者: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu
- 日期: 2025-10-23
- ArXiv主页 : https://arxiv.org/abs/2510.20822
- 论文链接 : https://arxiv.org/pdf/2510.20822
- 项目链接 : https://holo-cine.github.io/holocine.mp4
- gitHub仓库 : https://github.com/yihao-meng/HoloCine
英文摘要
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
中文摘要
最先进的文本到视频模型擅长生成孤立的剪辑,但无法创建连贯的多镜头叙事,而这正是讲故事的本质。我们通过 HoloCine 弥补了这一"叙事差距",HoloCine 是一种整体生成整个场景的模型,以确保从第一个镜头到最后一个镜头的全局一致性。我们的架构通过窗口交叉注意机制实现精确的方向控制,将文本提示本地化到特定镜头,而稀疏镜头间自注意模式(镜头内密集但镜头之间稀疏)确保了分钟级生成所需的效率。除了在叙事连贯性方面创造了新的最先进水平之外,HoloCine 还开发了非凡的新兴能力:对角色和场景的持久记忆,以及对电影技术的直观掌握。我们的工作标志着从剪辑合成到自动化电影制作的关键转变,使端到端的电影创作成为一个切实的未来。我们的代码位于:https://holo-cine.github.io/。
MoGA:用于端到端长视频生成的混合组注意力
- 标题: MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
- 作者: Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao
- 日期: 2025-10-21
- ArXiv主页 : https://arxiv.org/abs/2510.18692
- 论文链接 : https://arxiv.org/pdf/2510.18692
- 项目链接 : https://jiawn-creator.github.io/mixture-of-groups-attention/
- gitHub仓库 : https://github.com/bytedance-fanqie-ai/MoGA
英文摘要
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
中文摘要
使用扩散变压器(DiT)生成长视频的瓶颈是完全注意力与序列长度的二次缩放。由于注意力高度冗余,输出由一小部分查询密钥对主导。现有的稀疏方法依赖于分块粗略估计,其精度-效率权衡受到块大小的限制。本文介绍了混合组注意力(MoGA),这是一种高效的稀疏注意力,它使用轻量级、可学习的令牌路由器来精确匹配令牌,而无需按块进行估计。通过语义感知路由,MoGA 可实现有效的远程交互。作为一种无内核方法,MoGA 与现代注意力堆栈无缝集成,包括 FlashAttention 和序列并行性。在 MoGA 的基础上,我们开发了一种高效的长视频生成模型,该模型可以端到端地以 24 fps 生成分钟级、多镜头、480p 视频,上下文长度约为 580k。对各种视频生成任务的综合实验验证了我们方法的有效性。