中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- Wan-Move:通过潜在轨迹引导生成运动可控视频
- [T-pro 2.0:高效的俄罗斯混合推理模型和游乐场](#T-pro 2.0:高效的俄罗斯混合推理模型和游乐场)
- [Visionary:基于 WebGPU 驱动的 Gaussian Splatting 平台构建的世界模型载体](#Visionary:基于 WebGPU 驱动的 Gaussian Splatting 平台构建的世界模型载体)
- [Native Parallel Reasoner:通过自蒸馏强化学习进行并行推理](#Native Parallel Reasoner:通过自蒸馏强化学习进行并行推理)
- TwinFlow:利用自对抗流实现大型模型的一步生成
- StereoWorld:几何感知单目到立体视频生成
- 超越真实:长上下文LLM旋转位置嵌入的想象扩展
- [OmniPSD:使用扩散Transformer生成分层 PSD](#OmniPSD:使用扩散Transformer生成分层 PSD)
- [我们准备好在文本转 3D 生成中使用强化学习了吗?渐进式调查](#我们准备好在文本转 3D 生成中使用强化学习了吗?渐进式调查)
- 用于解决奥林匹克级数学问题的长视野推理代理
- [使用 Temporal Reasoner 进行统一视频编辑](#使用 Temporal Reasoner 进行统一视频编辑)
- OneStory:具有自适应内存的连贯多镜头视频生成
- Voxify3D:像素艺术与体积渲染的结合
- EditThinker:解锁任何图像编辑器的迭代推理
- BrainExplore:大规模发现人脑中可解释的视觉表征
- 关于预训练、训练中和强化学习在推理语言模型上的相互作用
- [BEAVER:高效的确定性 LLM 验证器](#BEAVER:高效的确定性 LLM 验证器)
- OPV:基于结果的流程验证器,用于高效的长思路验证
- 通过复杂性提升强化学习实现奥林匹亚级几何大型语言模型智能体
- DeepCode:开放代理编码
- [MoCapAnything:对单目视频中的任意骨骼进行统一 3D 动作捕捉](#MoCapAnything:对单目视频中的任意骨骼进行统一 3D 动作捕捉)
- 分布匹配变分自动编码器
- 扩展零样本参考视频生成
- EgoEdit:以自我为中心的视频编辑的数据集、实时流模型和基准
- 通过概念提示绑定从图像和视频组成概念
- 从模仿到歧视:迈向增强跨领域推理任务的通用课程优势机制
Wan-Move:通过潜在轨迹引导生成运动可控视频
- 标题: Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
- 作者: Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, Yujiu Yang
- 日期: 2025-12-09
- ArXiv主页 : https://arxiv.org/abs/2512.08765
- 论文链接 : https://arxiv.org/pdf/2512.08765
- 项目链接 : https://wan-move.github.io/
- gitHub仓库 : https://github.com/ali-vilab/Wan-Move
英文摘要
We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.
中文摘要
我们推出了 Wan-Move,这是一个简单且可扩展的框架,可为视频生成模型带来运动控制。现有的运动可控方法通常存在控制粒度粗和可扩展性有限的问题,导致其输出不足以实际使用。我们通过实现精确和高质量的运动控制来缩小这一差距。我们的核心思想是直接使原始条件特征具有运动感知能力,以指导视频合成。为此,我们首先用密集点轨迹表示对象运动,从而允许对场景进行细粒度控制。然后,我们将这些轨迹投影到潜在空间中,并沿着每个轨迹传播第一帧的特征,生成一个对齐的时空特征图,告诉每个场景元素应该如何移动。该特征图作为更新的潜在条件,自然地集成到现成的图像到视频模型中,例如 Wan-I2V-14B,作为运动指导,无需任何架构更改。它消除了对辅助运动编码器的需求,并使微调基础模型易于扩展。用户研究表明,通过大规模训练,Wan-Move 可以生成 5 秒、480p 的视频,其运动可控性可与 Kling 1.5 Pro 的商业 Motion Brush 相媲美。为了支持综合评估,我们进一步设计了 MoveBench,这是一个经过严格策划的基准测试,具有多样化的内容类别和混合验证的注释。它的特点是更大的数据量、更长的视频时长和高质量的运动注释。MoveBench 和公共数据集上的大量实验一致证明了 Wan-Move 卓越的运动质量。代码、模型和基准数据都是公开的。
T-pro 2.0:高效的俄罗斯混合推理模型和游乐场
- 标题: T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
- 作者: Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov, Vladislav Kruglikov, Nikita Surkov, German Abramov, Pavel Gein, Dmitry Abulkhanov, Mikhail Gashkov, Viktor Zelenkovskiy, Artem Batalov, Aleksandr Medvedev, Anatolii Potapov
- 日期: 2025-12-11
- ArXiv主页 : https://arxiv.org/abs/2512.10430
- 论文链接 : https://arxiv.org/pdf/2512.10430
- 项目链接 : https://t-pro-2-0.streamlit.app
英文摘要
We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.
中文摘要
我们推出 T-pro 2.0,这是一个开放式的俄罗斯LLM,用于混合推理和高效推理。该模型支持直接应答和推理跟踪生成,使用西里尔字母密集标记器和经过调整的 EAGLE 推测解码管道来减少延迟。为了实现可重复和可扩展的研究,我们发布了模型权重、T-Wix 500k 指令语料库、T-Math 推理基准以及 Hugging Face 上的 EAGLE 权重。这些资源允许用户学习俄语推理并扩展或调整模型和推理管道。公共网络演示公开了推理和非推理模式,并说明了我们的推理堆栈跨域所实现的加速。因此,T-pro 2.0 作为一个可访问的开放系统,用于构建和评估高效、实用的俄罗斯LLM申请。
Visionary:基于 WebGPU 驱动的 Gaussian Splatting 平台构建的世界模型载体
- 标题: Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
- 作者: Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li, Yuanjun Liao, Jiaming Chen, Yuanyuan Gao, Jiaqi Chen, Minming Chen, Li Zhou, Yuning Zhang, Wei Wang, Xiaoqing Hou, Huaxi Huang, Shixiang Tang, Le Ma, Dingwen Zhang, Xue Yang, Junchi Yan, Yanchi Zhang, Yinqiang Zheng, Xiao Sun, Zhihang Zhong
- 日期: 2025-12-09
- ArXiv主页 : https://arxiv.org/abs/2512.08478
- 论文链接 : https://arxiv.org/pdf/2512.08478
- 项目链接 : https://visionary-laboratory.github.io/visionary/
- gitHub仓库 : https://github.com/Visionary-Laboratory/visionary
英文摘要
Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.
中文摘要
神经渲染,特别是 3D 高斯分布 (3DGS),已经迅速发展并成为构建世界模型的关键组件。然而,现有的查看器解决方案仍然分散、笨重或受到遗留管道的限制,导致部署摩擦较大,并且对动态内容和生成模型的支持有限。在这项工作中,我们展示了 Visionary,一个开放的、网络原生的平台,用于实时各种高斯泼溅和网格渲染。Visionary 基于高效的 WebGPU 渲染器和每帧 ONNX 推理而构建,可实现动态神经处理,同时保持轻量级的"即点即用"浏览器体验。它引入了标准化的高斯生成器合约,不仅支持标准的3DGS渲染,还允许即插即用算法生成或更新每帧的高斯。这种推断还使我们能够应用前馈生成后处理。该平台还提供了一个插件 Three.js 库,具有简洁的 TypeScript API,可以无缝集成到现有的 Web 应用程序中。实验表明,在相同的 3DGS 资源下,由于基于 GPU 的图元排序,Visionary 比当前的 Web 查看器实现了更高的渲染效率。它已经支持多种变体,包括基于 MLP 的 3DGS、4DGS、神经化身以及风格转换或增强网络。通过直接在浏览器中统一推理和渲染,Visionary 显着降低了 3DGS 系列方法的再现、比较和部署的障碍,成为重建和生成范式的统一世界模型载体。
Native Parallel Reasoner:通过自蒸馏强化学习进行并行推理
- 标题: Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
- 作者: Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng
- 日期: 2025-12-08
- ArXiv主页 : https://arxiv.org/abs/2512.07461
- 论文链接 : https://arxiv.org/pdf/2512.07461
- 项目链接 : https://bigai-nlco.github.io/Native-Parallel-Reasoner
- gitHub仓库 : https://github.com/bigai-nlco/Native-Parallel-Reasoner
英文摘要
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
中文摘要
我们引入了 Native Parallel Reasoner (NPR),这是一个无需教师的框架,使大型语言模型 (LLM) 能够自我进化真正的并行推理能力。NPR 通过三个关键创新将模型从顺序仿真转变为原生并行认知:1)自提渐进式训练范式,从"冷启动"格式发现过渡到无需外部监督的严格拓扑约束;2)一种新颖的并行感知策略优化(PAPO)算法,可直接在执行图中优化分支策略,使模型能够通过试错来学习自适应分解;3) 强大的 NPR 引擎,可重构 SGLang 的内存管理和流程控制,以实现稳定、大规模的并行 RL 训练。在八个推理基准测试中,在 Qwen3-4B 上训练的 NPR 实现了高达 24.5% 的性能提升和高达 4.6 倍的推理速度提升。与通常回退到自回归解码的先前基线不同,NPR 展示了 100% 真正的并行执行,为自我进化、高效和可扩展的代理推理建立了新标准。
TwinFlow:利用自对抗流实现大型模型的一步生成
- 标题: TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
- 作者: Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin
- 日期: 2025-12-03
- ArXiv主页 : https://arxiv.org/abs/2512.05150
- 论文链接 : https://arxiv.org/pdf/2512.05150
- 项目链接 : https://zhenglin-cheng.com/twinflow
- gitHub仓库 : https://github.com/inclusionAI/TwinFlow
英文摘要
Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by 100times with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.
中文摘要
大型多模态生成模型的最新进展展示了多模态生成(包括图像和视频生成)方面令人印象深刻的能力。这些模型通常建立在扩散和流匹配等多步骤框架之上,这本质上限制了它们的推理效率(需要 40-100 次功能评估 (NFE))。虽然各种几步方法旨在加速推理,但现有的解决方案具有明显的局限性。著名的基于蒸馏的方法,例如渐进式蒸馏和稠度蒸馏,要么需要迭代蒸馏程序,要么在很少的步骤中表现出显着的降解(< 4-NFE)。同时,将对抗性训练集成到蒸馏中(例如 DMD/DMD2 和 SANA-Sprint)以增强性能,会由于辅助训练模型而导致训练不稳定、增加复杂性和高 GPU 内存开销。为此,我们提出了 TwinFlow,这是一种简单而有效的训练一步生成模型的框架,它绕过了固定的预训练教师模型的需要,并避免了训练期间的标准对抗网络,使其成为构建大规模、高效模型的理想选择。在文本到图像任务中,我们的方法在 1-NFE 中取得了 0.83 的 GenEval 分数,优于 SANA-Sprint(基于 GAN 损失的框架)和 RCGM(基于一致性的框架)等强基线。值得注意的是,我们通过在 Qwen-Image-20B 上进行全参数训练来展示 TwinFlow 的可扩展性,并将其转变为高效的少步生成器。仅使用 1-NFE,我们的方法就可以在 GenEval 和 DPG-Bench 基准测试中与原始 100-NFE 模型的性能相匹配,从而将计算成本降低 100 倍,同时质量下降较小。项目页面位于 https://zhenglin- Cheng.com/twinflow。
StereoWorld:几何感知单目到立体视频生成
- 标题: StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
- 作者: Ke Xing, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Xiaojie Jin, Yao Zhao, Yunchao Wei
- 日期: 2025-12-10
- ArXiv主页 : https://arxiv.org/abs/2512.09363
- 论文链接 : https://arxiv.org/pdf/2512.09363
- 项目链接 : https://ke-xing.github.io/StereoWorld/
英文摘要
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
中文摘要
XR 设备的日益普及推动了对高质量立体视频的强劲需求,但其生产成本仍然很高且容易出现伪影。为了应对这一挑战,我们提出了 StereoWorld,这是一个端到端框架,它重新利用预训练的视频生成器来生成高保真单目到立体视频。我们的框架在单目视频输入上联合调节模型,同时通过几何感知正则化明确监督生成,以确保 3D 结构保真度。进一步集成时空切片方案以实现高效、高分辨率的合成。为了实现大规模训练和评估,我们策划了一个高清立体视频数据集,其中包含超过 11M 帧,与自然人类瞳距 (IPD) 对齐。大量实验表明,StereoWorld 的性能大大优于现有方法,可生成具有卓越视觉保真度和几何一致性的立体视频。该项目网页位于https://ke-xing.github.io/StereoWorld/。
超越真实:长上下文LLM旋转位置嵌入的想象扩展
-
标题: Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
-
作者: Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu
-
日期: 2025-12-08
-
ArXiv主页 : https://arxiv.org/abs/2512.07525
-
gitHub仓库 : https://github.com/OpenMOSS/rope_pp
英文摘要
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.
中文摘要
旋转位置嵌入 (RoPE) 通过将旋转应用于复平面中的查询和关键向量,已成为大型语言模型 (LLM) 中编码序列顺序的标准。然而,标准实现仅利用复值点积的实数部分来计算注意力分数。这种简化丢弃了虚部,其中包含有价值的阶段信息,导致可能丢失对于建模长上下文依赖关系至关重要的关系细节。在本文中,我们提出了一种扩展,重新合并了这个被丢弃的虚数部分。我们的方法利用完整的复值表示来创建双分量注意力分数。我们从理论上和经验上证明,这种方法通过保留更多的位置信息来增强长上下文依赖的建模。此外,对一套长上下文语言建模基准的评估表明,我们的方法相对于标准 RoPE 持续提高了性能,并且随着上下文长度的增加,优势变得更加显着。代码可在 https://github.com/OpenMOSS/rope_pp 获取。
OmniPSD:使用扩散Transformer生成分层 PSD
- 标题: OmniPSD: Layered PSD Generation with Diffusion Transformer
- 作者: Cheng Liu, Yiren Song, Haofan Wang, Mike Zheng Shou
- 日期: 2025-12-10
- ArXiv主页 : https://arxiv.org/abs/2512.09247
- 论文链接 : https://arxiv.org/pdf/2512.09247
- 项目链接 : https://showlab.github.io/OmniPSD/
英文摘要
Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.
中文摘要
扩散模型的最新进展极大地改进了图像生成和编辑,但生成或重建具有透明 Alpha 通道的分层 PSD 文件仍然极具挑战性。我们提出了 OmniPSD,这是一个基于 Flux 生态系统构建的统一扩散框架,可通过上下文学习实现文本到 PSD 生成和图像到 PSD 分解。对于文本到 PSD 的生成,OmniPSD 将多个目标图层在空间上排列到单个画布中,并通过空间注意力学习它们的组成关系,从而生成语义连贯和层次结构的图层。对于图像到 PSD 的分解,它执行迭代上下文编辑,逐步提取和擦除文本和前景组件,以从单个展平图像重建可编辑的 PSD 图层。采用 RGBA-VAE 作为辅助表示模块,以在不影响结构学习的情况下保持透明度。对我们新的 RGBA 分层数据集的大量实验表明,OmniPSD 实现了高保真生成、结构一致性和透明度感知,为分层设计生成和扩散Transformer分解提供了新的范例。
我们准备好在文本转 3D 生成中使用强化学习了吗?渐进式调查
-
标题: Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
-
作者: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
-
日期: 2025-12-11
-
ArXiv主页 : https://arxiv.org/abs/2512.10949
-
gitHub仓库 : https://github.com/Ivan-Tang-3D/3DGen-R1
英文摘要
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
中文摘要
强化学习 (RL) 早先被证明在大型语言和多模态模型中有效,最近已成功扩展到增强 2D 图像生成。然而,由于 3D 对象的空间复杂性较高,需要全局一致的几何形状和细粒度的局部纹理,因此将 RL 应用于 3D 生成在很大程度上仍未得到探索。这使得 3D 生成对奖励设计和 RL 算法非常敏感。为了应对这些挑战,我们对跨多个维度的文本到 3D 自回归生成的强化学习进行了首次系统研究。(1) 奖励设计:我们评估奖励维度和模型选择,表明与人类偏好的一致性至关重要,并且通用多模态模型为 3D 属性提供了稳健的信号。(2) RL 算法:我们研究 GRPO 变体,强调 token 级优化的有效性,并进一步研究训练数据和迭代的扩展。(3)文本到3D基准:由于现有基准无法衡量3D生成模型中的隐式推理能力,因此我们引入了MME-3DR。(4) 高级 RL 范式:受 3D 生成的自然层次结构的启发,我们提出了 Hi-GRPO,它通过专用奖励集成来优化全局到局部的层次化 3D 生成。基于这些见解,我们开发了 AR3D-R1,这是第一个 RL 增强的文本到 3D 模型,是从粗糙形状到纹理细化的专家。我们希望这项研究能够为 3D 生成的 RL 驱动推理提供见解。代码发布于 https://github.com/Ivan-Tang-3D/3DGen-R1。
用于解决奥林匹克级数学问题的长视野推理代理
- 标题: Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
- 作者: Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen
- 日期: 2025-12-11
- ArXiv主页 : https://arxiv.org/abs/2512.10739
- 论文链接 : https://arxiv.org/pdf/2512.10739
英文摘要
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \thisbench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
中文摘要
大型语言模型(LLM)通过可验证奖励的强化学习(RLVR)在解决复杂推理任务方面取得了重大进展。这一进步也与可靠验证者的自动化监督密不可分。然而,当前基于结果的验证者(OV)无法检查长推理思想链(CoT)中不可靠的中间步骤。与此同时,当前基于流程的验证器 (PV) 难以可靠地检测复杂的长 CoT 中的错误,这是由于人工注释成本过高而导致高质量注释稀缺的限制。因此,我们提出了基于结果的过程验证器(OPV),它验证长 CoT 总结结果的基本原理过程,以实现准确高效的验证并实现大规模注释。为了增强所提出的验证者的能力,我们采用带有专家注释的迭代主动学习框架,以更少的注释成本逐步提高 OPV 的验证能力。具体来说,在每次迭代中,当前最佳 OPV 中最不确定的情况都会被注释,然后用于通过拒绝微调 (RFT) 和 RLVR 为下一轮训练新的 OPV。大量的实验证明了OPV的优越性能和广泛的适用性。它在我们保留的 \thisbench 上取得了最先进的结果,其 F1 分数为 83.1,优于 Qwen3-Max-Preview 等更大的开源模型,而 F1 分数为 76.3。此外,OPV 可以有效检测合成数据集中的误报,与专家评估紧密结合。与策略模型协作时,OPV 始终能带来性能提升,例如,随着计算预算的扩展,AIME2025 上的 DeepSeek-R1-Distill-Qwen-32B 的准确性从 55.2% 提高到 73.3%。
使用 Temporal Reasoner 进行统一视频编辑
- 标题: Unified Video Editing with Temporal Reasoner
- 作者: Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu
- 日期: 2025-12-08
- ArXiv主页 : https://arxiv.org/abs/2512.07469
- 论文链接 : https://arxiv.org/pdf/2512.07469
- 项目链接 : https://videocof.github.io/
- gitHub仓库 : https://github.com/knightyxp/VideoCoF
英文摘要
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
中文摘要
现有的视频编辑方法面临着一个关键的权衡:专家模型提供了精度,但依赖于特定于任务的先验(例如掩模),阻碍了统一;相反,统一的时间上下文学习模型是无掩模的,但缺乏明确的空间线索,导致指令到区域的映射较弱和定位不精确。为了解决这一冲突,我们提出了 VideoCoF,这是一种受思想链推理启发的新颖的框架链方法。VideoCoF 通过强制视频扩散模型在生成目标视频标记之前首先预测推理标记(编辑区域潜伏)来强制执行"查看、推理、然后编辑"过程。这种显式推理步骤消除了对用户提供的掩码的需要,同时实现精确的指令到区域对齐和细粒度视频编辑。此外,我们引入了一种 RoPE 对齐策略,利用这些推理标记来确保运动对齐并实现超出训练持续时间的长度外推。我们证明了这一点仅 50k 视频对的数据成本,VideoCoF 在 VideoCoF-Bench 上实现了最先进的性能,验证了我们方法的效率和有效性,我们的代码、权重和数据可在 https://github.com/knightyxp/VideoCoF 上获得。
OneStory:具有自适应内存的连贯多镜头视频生成
- 标题: OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
- 作者: Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie
- 日期: 2025-12-08
- ArXiv主页 : https://arxiv.org/abs/2512.07802
- 论文链接 : https://arxiv.org/pdf/2512.07802
- 项目链接 : https://zhaochongan.github.io/projects/OneStory/
英文摘要
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
中文摘要
现实世界视频中的故事讲述通常通过多个镜头展开------不连续但语义相关的剪辑共同传达连贯的叙述。然而,现有的多镜头视频生成(MSV)方法难以有效地模拟远程交叉镜头上下文,因为它们依赖于有限的时间窗口或单个关键帧条件,导致复杂叙事下的性能下降。在这项工作中,我们提出了 OneStory,支持全局而紧凑的跨镜头上下文建模,以实现一致且可扩展的叙事生成。OneStory 将 MSV 重新定义为下一代镜头生成任务,实现自回归镜头合成,同时利用预训练的图像到视频 (I2V) 模型来实现强大的视觉调节。我们引入了两个关键模块:一个帧选择模块,它根据先前镜头中的信息帧构建语义相关的全局记忆;以及一个自适应调节器,它执行重要性引导的补丁化以生成用于直接调节的紧凑上下文。我们进一步策划了一个带有参考标题的高质量多镜头数据集,以反映现实世界的故事讲述模式,并在下一个镜头范例下设计有效的训练策略。OneStory 根据我们精心策划的 60K 数据集上的预训练 I2V 模型进行了微调,在文本和图像条件设置中的各种复杂场景中实现了最先进的叙事连贯性,从而实现了可控且身临其境的长视频故事讲述。
Voxify3D:像素艺术与体积渲染的结合
- 标题: Voxify3D: Pixel Art Meets Volumetric Rendering
- 作者: Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu
- 日期: 2025-12-08
- ArXiv主页 : https://arxiv.org/abs/2512.07834
- 论文链接 : https://arxiv.org/pdf/2512.07834
- 项目链接 : https://yichuanh.github.io/Voxify-3D/
英文摘要
Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/
中文摘要
体素艺术是一种广泛应用于游戏和数字媒体的独特风格,但由于几何抽象、语义保存和离散颜色一致性的相互冲突的要求,从 3D 网格自动生成仍然具有挑战性。现有的方法要么过度简化几何图形,要么无法实现像素精确、调色板受限的体素艺术美学。我们引入了 Voxify3D,这是一种可微的两阶段框架,将 3D 网格优化与 2D 像素艺术监督联系起来。我们的核心创新在于三个组件的协同集成:(1)正交像素艺术监督,消除透视失真以实现精确的体素像素对齐;(2) 基于补丁的 CLIP 对齐,可跨离散化级别保留语义;(3) 调色板约束的 Gumbel-Softmax 量化能够通过可控调色板策略对离散颜色空间进行可微分优化。这种集成解决了基本挑战:极端离散化下的语义保存、通过体积渲染实现像素艺术美学以及端到端离散优化。实验表明,在不同的字符和可控的抽象(2-8 种颜色,20x-50x 分辨率)上具有卓越的性能(37.12 CLIP-IQA,77.90% 用户偏好)。项目页面:https://yichuanh.github.io/Voxify-3D/
EditThinker:解锁任何图像编辑器的迭代推理
- 标题: EditThinker: Unlocking Iterative Reasoning for Any Image Editor
- 作者: Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu
- 日期: 2025-12-05
- ArXiv主页 : https://arxiv.org/abs/2512.05965
- 论文链接 : https://arxiv.org/pdf/2512.05965
- 项目链接 : https://appletea233.github.io/think-while-edit/
- gitHub仓库 : https://github.com/appletea233/EditThinker
英文摘要
Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.
中文摘要
基于指令的图像编辑已成为一个突出的研究领域,受益于图像生成基础模型,已经实现了很高的美学质量,使得指令跟踪能力成为主要挑战。现有方法通过监督学习或强化学习来提高指令依从性,但由于固有的随机性和缺乏深思熟虑,单轮成功率仍然有限。在这项工作中,我们提出了一个在编辑时"思考"的深思熟虑的编辑框架,它通过迭代执行边编辑边思考的循环来模拟人类的认知循环:批评结果和细化指令,然后重复生成直到满意为止。具体来说,我们训练一个 MLLM,EditThinker,作为该框架的推理引擎,共同产生批评分数、推理过程和细化指令。我们采用强化学习来使 EditThinker 的思维与其编辑保持一致,从而产生更有针对性的指令改进。对四个基准的广泛实验表明,我们的方法极大地提高了任何图像编辑模型的指令跟踪能力。我们将发布我们的数据构建框架、数据集和模型以造福社区。
BrainExplore:大规模发现人脑中可解释的视觉表征
- 标题: BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
- 作者: Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani
- 日期: 2025-12-09
- ArXiv主页 : https://arxiv.org/abs/2512.08560
- 论文链接 : https://arxiv.org/pdf/2512.08560
- 项目链接 : https://navvewas.github.io/BrainExplore/
英文摘要
Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.
中文摘要
了解人脑如何表示视觉概念,以及这些表示在哪些大脑区域进行编码,仍然是一个长期存在的挑战。数十年的工作增进了我们对视觉表征的理解,但大脑信号仍然庞大且复杂,并且可能的视觉概念空间巨大。因此,大多数研究规模仍然较小,依赖于人工检查,专注于特定区域和属性,很少包括系统验证。我们提出了一个大规模的自动化框架,用于发现和解释人类皮层的视觉表征。我们的方法包括两个主要阶段。首先,我们通过无监督、数据驱动的分解方法发现功能磁共振成像活动中的候选可解释模式。接下来,我们通过识别最能引发该模式的自然图像集并生成其共享视觉含义的自然语言描述来解释每种模式。为了扩展这个过程,我们引入了一个自动化管道,可以测试多个候选解释,分配定量可靠性分数,并为每个体素模式选择最一致的描述。我们的框架揭示了数千种可解释的模式,涵盖许多不同的视觉概念,包括以前未报告的细粒度表示。
关于预训练、训练中和强化学习在推理语言模型上的相互作用
-
标题: On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
-
作者: Charlie Zhang, Graham Neubig, Xiang Yue
-
日期: 2025-12-08
-
ArXiv主页 : https://arxiv.org/abs/2512.07783
-
gitHub仓库 : https://github.com/Interplay-LM-Reasoning/Interplay-LM-Reasoning
英文摘要
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
中文摘要
最近的强化学习(RL)技术在语言模型中取得了令人印象深刻的推理改进,但目前尚不清楚训练后是否真正将模型的推理能力扩展到训练前所获得的能力之外。一个核心挑战是现代训练流程缺乏控制:大规模预训练语料库不透明,训练中期经常被低估,强化学习目标以复杂的方式与未知的先验知识相互作用。为了解决这种模糊性,我们开发了一个完全受控的实验框架,该框架隔离了训练前、训练中和基于强化学习的训练后的因果贡献。我们的方法采用具有显式原子操作、可解析的逐步推理轨迹以及训练分布的系统操作的合成推理任务。我们沿着两个轴评估模型:对更复杂的组合的外推概括和跨表面上下文的上下文概括。使用这个框架,我们协调了关于强化学习有效性的不同观点。我们表明:1)仅当预训练留有足够的空间,并且当 RL 数据针对模型的能力边缘(即困难但尚未遥不可及的边界任务)时,RL 才会产生真正的能力增益 (pass@128)。2)上下文泛化需要最少但足够的预训练暴露,之后 RL 可以可靠地迁移。3)与仅 RL 相比,中期训练显着提高了固定计算下的性能,证明了其在训练流程中的核心但尚未充分开发的作用。4)过程级奖励减少奖励黑客行为并提高推理保真度。总之,这些结果阐明了训练前、训练中和强化学习之间的相互作用,为理解和改进推理 LM 训练策略奠定了基础。
BEAVER:高效的确定性 LLM 验证器
- 标题: BEAVER: An Efficient Deterministic LLM Verifier
- 作者: Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh
- 日期: 2025-12-05
- ArXiv主页 : https://arxiv.org/abs/2512.05439
- 论文链接 : https://arxiv.org/pdf/2512.05439
英文摘要
As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.
中文摘要
随着大型语言模型 (LLM) 从研究原型过渡到生产系统,从业者通常需要可靠的方法来验证模型输出是否满足所需的约束。虽然基于采样的估计提供了模型行为的直觉,但它们没有提供可靠的保证。我们提出了 BEAVER,这是第一个用于计算 LLM 约束满足的确定性、合理概率界限的实用框架。给定任何前缀封闭语义约束,BEAVER 使用新颖的 token trie 和前沿数据结构系统地探索生成空间,在每次迭代中维护可证明的合理边界。我们将验证问题形式化,证明我们方法的合理性,并在多个最先进的LLM中评估 BEAVER 的正确性验证、隐私验证和安全代码生成任务。与相同计算预算下的基线方法相比,BEAVER 实现了 6 至 8 倍的严格概率界限,并识别出 3 至 4 倍的高风险实例,从而实现了松散界限或经验评估无法提供的精确表征和风险评估。
OPV:基于结果的流程验证器,用于高效的长思路验证
- 标题: OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
- 作者: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen
- 日期: 2025-12-11
- ArXiv主页 : https://arxiv.org/abs/2512.10756
- 论文链接 : https://arxiv.org/pdf/2512.10756
英文摘要
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
中文摘要
大型语言模型(LLM)通过可验证奖励的强化学习(RLVR)在解决复杂推理任务方面取得了重大进展。这一进步也与可靠验证者的自动化监督密不可分。然而,当前基于结果的验证者(OV)无法检查长推理思想链(CoT)中不可靠的中间步骤。与此同时,当前基于流程的验证器 (PV) 难以可靠地检测复杂的长 CoT 中的错误,这是由于人工注释成本过高而导致高质量注释稀缺的限制。因此,我们提出了基于结果的过程验证器(OPV),它验证长 CoT 总结结果的基本原理过程,以实现准确高效的验证并实现大规模注释。为了增强所提出的验证者的能力,我们采用带有专家注释的迭代主动学习框架,以更少的注释成本逐步提高 OPV 的验证能力。具体来说,在每次迭代中,当前最佳 OPV 中最不确定的情况都会被注释,然后用于通过拒绝微调 (RFT) 和 RLVR 为下一轮训练新的 OPV。大量的实验证明了OPV的优越性能和广泛的适用性。它在我们的 OPV-Bench 上取得了最先进的结果,其 F1 分数为 83.1,优于 Qwen3-Max-Preview 等更大的开源模型,而 F1 分数为 76.3。此外,OPV 可以有效检测合成数据集中的误报,与专家评估紧密结合。与策略模型协作时,OPV 始终能带来性能提升,例如,随着计算预算的扩大,AIME2025 上 DeepSeek-R1-Distill-Qwen-32B 的准确性从 55.2% 提高到 73.3%。
通过复杂性提升强化学习实现奥林匹亚级几何大型语言模型智能体
- 标题: Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
- 作者: Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen
- 日期: 2025-12-11
- ArXiv主页 : https://arxiv.org/abs/2512.10534
- 论文链接 : https://arxiv.org/pdf/2512.10534
英文摘要
Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.
中文摘要
大型语言模型(LLM)智能体表现出强大的数学问题解决能力,甚至可以借助形式化证明系统解决国际数学奥林匹克(IMO)级别的问题。然而,由于辅助构造的启发性较弱,用于几何问题解决的人工智能仍然以 AlphaGeometry 2 等专家模型为主,该模型严重依赖大规模数据合成和搜索来进行训练和评估。在这项工作中,我们首次尝试构建奖牌获得者级的几何LLM代理,并提出了 InternGeometry。InternGeometry 通过迭代地提出命题和辅助构造,使用符号引擎对其进行验证,并反映引擎的反馈以指导后续提案,从而克服了几何中的启发式限制。动态内存机制使 InternGeometry 能够针对每个问题与符号引擎进行 200 多次交互。为了进一步加速学习,我们引入了复杂性提升强化学习(CBRL),它逐渐增加了跨训练阶段的综合问题的复杂性。InternGeometry 基于 InternThinker-32B 构建,解决了 50 个 IMO 几何问题中的 44 个(2000-2024 年),超过了金牌得主的平均得分(40.9),仅使用 13K 训练样本,仅为 AlphaGeometry 2 使用的数据的 0.004%,展示了 LLM 代理在专家级几何任务上的潜力。InternGeometry 还可以针对人类解决方案中未出现的 IMO 问题提出新颖的辅助结构。我们将发布模型、数据和符号引擎以支持未来的研究。
DeepCode:开放代理编码
-
标题: DeepCode: Open Agentic Coding
-
作者: Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, Chao Huang
-
日期: 2025-12-08
-
ArXiv主页 : https://arxiv.org/abs/2512.07921
-
gitHub仓库 : https://github.com/HKUDS/DeepCode
英文摘要
Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis--such as scientific papers to code--primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics. By systematically transforming paper specifications into production-grade implementations comparable to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery.
中文摘要
大型语言模型 (LLM) 的最新进展催生了强大的编码代理,使代码助理有可能发展为代码工程师。然而,现有方法在实现高保真文档到代码库合成(例如科学论文到代码)方面仍然面临重大挑战,这主要是由于信息过载和LLM上下文瓶颈之间的根本冲突。在这项工作中,我们引入了 DeepCode,这是一个完全自主的框架,它通过原则性的信息流管理从根本上解决了这一挑战。通过将存储库综合视为通道优化问题,DeepCode 无缝地协调了四种信息操作,以在有限的上下文预算下最大化任务相关信号:通过蓝图蒸馏进行源压缩、使用状态代码内存进行结构化索引、通过检索增强生成进行条件知识注入以及闭环纠错。对 PaperBench 基准的广泛评估表明,DeepCode 实现了最先进的性能,明显优于 Cursor 和 Claude Code 等领先的商业代理,更重要的是,在关键复制指标上超越了顶级机构的博士级人类专家。通过系统地将纸质规范转化为可与人类专家质量相媲美的生产级实现,这项工作为自主科学复制奠定了新的基础,可以加速研究评估和发现。
MoCapAnything:对单目视频中的任意骨骼进行统一 3D 动作捕捉
- 标题: MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
- 作者: Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang
- 日期: 2025-12-11
- ArXiv主页 : https://arxiv.org/abs/2512.10881
- 论文链接 : https://arxiv.org/pdf/2512.10881
- 项目链接 : https://animotionlab.github.io/MoCapAnything/
英文摘要
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
中文摘要
动作捕捉现在支撑的内容创作远远超出了数字人类的范围,但大多数现有的管道仍然是特定于物种或模板的。我们将这种差距形式化为与类别无关的运动捕捉 (CAMoCap):给定单目视频和任意装备的 3D 资产作为提示,目标是重建基于旋转的动画,例如直接驱动特定资产的 BVH。我们提出了 MoCapAnything,这是一个参考引导的分解框架,它首先预测 3D 关节轨迹,然后通过约束感知逆向运动学恢复特定于资产的旋转。该系统包含三个可学习模块和一个轻量级 IK 阶段:(1) 参考提示编码器,用于从资产的骨架、网格和渲染图像中提取每个关节的查询;(2) 视频特征提取器,计算密集的视觉描述符并重建粗略的 4D 变形网格,以弥合视频和关节空间之间的差距;(3) 统一运动解码器,融合这些线索以产生时间连贯的轨迹。我们还策划了包含 1038 个运动剪辑的 Truebones Zoo,每个剪辑都提供标准化的骨架-网格-渲染三元组。对域内基准测试和野外视频的实验表明,MoCapAnything 可提供高质量的骨骼动画,并在异构设备上展示有意义的跨物种重定向,从而为任意资产实现可扩展、提示驱动的 3D 动作捕捉。项目页面:https://animotionlab.github.io/MoCapAnything/
分布匹配变分自动编码器
-
标题: Distribution Matching Variational AutoEncoder
-
作者: Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu
-
日期: 2025-12-08
-
ArXiv主页 : https://arxiv.org/abs/2512.07778
-
gitHub仓库 : https://github.com/sen-ye/dmvae}
英文摘要
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce Distribution-Matching VAE (DMVAE), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.
中文摘要
大多数视觉生成模型在应用扩散或自回归建模之前将图像压缩到潜在空间中。然而,诸如 VAE 和基础模型对齐编码器之类的现有方法隐式地约束了潜在空间,而没有显式地塑造其分布,从而不清楚哪种类型的分布最适合建模。我们引入了分布匹配 VAE (DMVAE),它通过分布匹配约束将编码器的潜在分布与任意参考分布显式对齐。这超出了传统 VAE 的高斯先验,能够与源自自监督特征、扩散噪声或其他先验分布的分布对齐。通过 DMVAE,我们可以系统地研究哪些潜在分布更有利于建模,我们发现 SSL 派生的分布在重建保真度和建模效率之间提供了出色的平衡,仅用 64 个训练周期就在 ImageNet 上达到了 gFID 等于 3.2。我们的结果表明,选择合适的潜在分布结构(通过分布级对齐实现),而不是依赖固定的先验,是弥合易于建模的潜在分布和高保真图像合成之间差距的关键。代码可在 https://github.com/sen-ye/dmvae 获取。
扩展零样本参考视频生成
- 标题: Scaling Zero-Shot Reference-to-Video Generation
- 作者: Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He
- 日期: 2025-12-07
- ArXiv主页 : https://arxiv.org/abs/2512.06905
- 论文链接 : https://arxiv.org/pdf/2512.06905
- 项目链接 : https://franciszzj.github.io/Saber/
- gitHub仓库 : https://github.com/franciszzj/Saber
英文摘要
Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
中文摘要
视频参考 (R2V) 生成旨在合成与文本提示对齐的视频,同时保留参考图像中的主体身份。然而,当前的 R2V 方法受到对显式参考图像-视频-文本三元组的依赖的阻碍,其构建成本高昂且难以扩展。我们通过引入 Saber 来绕过这个瓶颈,这是一个不需要显式 R2V 数据的可扩展零样本框架。Saber 专门针对视频-文本对进行训练,采用屏蔽训练策略和基于注意力的定制模型设计来学习身份一致和参考感知的表示。进一步集成掩模增强技术,以减轻参考视频生成中常见的复制粘贴伪影。此外,Sabre 在不同数量的参考中展示了卓越的泛化能力,并且与使用 R2V 数据训练的方法相比,在 OpenS2V-Eval 基准上实现了卓越的性能。
EgoEdit:以自我为中心的视频编辑的数据集、实时流模型和基准
- 标题: EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
- 作者: Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, Willi Menapace
- 日期: 2025-12-05
- ArXiv主页 : https://arxiv.org/abs/2512.06065
- 论文链接 : https://arxiv.org/pdf/2512.06065
- 项目链接 : https://snap-research.github.io/EgoEdit/
- gitHub仓库 : https://github.com/snap-research/EgoEdit
英文摘要
We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit
中文摘要
我们研究交互式 AR 应用程序中以自我为中心的视频的指令引导编辑。虽然最近的人工智能视频编辑器在第三人称镜头上表现良好,但以自我为中心的视图提出了独特的挑战,包括快速的自我情绪和频繁的手部物体交互,从而造成了显着的领域差距。此外,现有的离线编辑流程存在高延迟,限制了实时交互。为了解决这些问题,我们提出了一个完整的以自我为中心的视频编辑生态系统。首先,我们构建 EgoEditData,这是一个精心设计和手动策划的数据集,专门为以自我为中心的编辑场景而设计,具有丰富的手部对象交互,同时明确保留手部。其次,我们开发了 EgoEdit,这是一种遵循指令、以自我为中心的视频编辑器,支持在单个 GPU 上进行实时流推理。最后,我们介绍 EgoEditBench,这是一个评估套件,目标是指令忠实度、手部和交互保留以及自我运动下的时间稳定性。在以自我为中心的编辑任务和一般编辑任务中,EgoEdit 都能生成时间稳定、忠实于指令的结果,并具有交互式延迟。它在以自我为中心的编辑基准上取得了明显的进步(现有方法在这方面表现不佳),同时保持了与一般编辑任务的最强基准相当的性能。EgoEditData 和 EgoEditBench 将向研究社区公开。请参阅我们的网站 https://snap-research.github.io/EgoEdit
通过概念提示绑定从图像和视频组成概念
- 标题: Composing Concepts from Images and Videos via Concept-prompt Binding
- 作者: Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao
- 日期: 2025-12-10
- ArXiv主页 : https://arxiv.org/abs/2512.09824
- 论文链接 : https://arxiv.org/pdf/2512.09824
- 项目链接 : https://refkxh.github.io/BiCo_Webpage/
- gitHub仓库 : https://github.com/refkxh/bico
英文摘要
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
中文摘要
视觉概念合成旨在将图像和视频中的不同元素整合为单个、连贯的视觉输出,但在从视觉输入中准确提取复杂概念以及灵活组合图像和视频中的概念方面仍然存在不足。我们引入了 Bind & Compose,这是一种一次性方法,通过将视觉概念与相应的提示标记绑定并使用来自各种来源的绑定标记来组合目标提示,从而实现灵活的视觉概念组合。它采用分层绑定器结构进行扩散Transformer中的交叉注意调节,将视觉概念编码为相应的提示标记,以准确分解复杂的视觉概念。为了提高概念标记绑定的准确性,我们设计了一种多样化吸收机制,该机制使用额外的吸收标记来消除在使用多样化提示进行训练时与概念无关的细节的影响。为了增强图像和视频概念之间的兼容性,我们提出了一种时间解耦策略,通过用于时间建模的双分支绑定器结构将视频概念的训练过程解耦为两个阶段。评估表明,我们的方法比现有方法实现了卓越的概念一致性、即时保真度和运动质量,为视觉创造力开辟了新的可能性。
从模仿到歧视:迈向增强跨领域推理任务的通用课程优势机制
- 标题: From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
- 作者: Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang
- 日期: 2025-12-02
- ArXiv主页 : https://arxiv.org/abs/2512.02580
- 论文链接 : https://arxiv.org/pdf/2512.02580
英文摘要
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (C urriculum A dvantage P olicy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
中文摘要
强化学习已成为大型语言模型后训练的范例,可提高其推理能力。这种方法计算每个样本的优势值,反映比预期更好或更差的性能,从而产生用于训练的正信号和负信号。然而,现有方法中两种信号的不加区别的混合,尤其是在早期阶段,可能会导致指导不明确和增益有限。为了解决这个问题,我们提出了 CAPO (C urriculum A dvantage P olicy Optimization),一种基于优势信号的自适应课程机制。所提出的机制利用仅积极的优势样本引导模仿学习,以建立稳健的基础,随后引入消极信号来培养判别能力,从而提高复杂场景的泛化能力。我们的方法兼容GRPO、PPO、RLOO和Reinforce++等多种优化方法,在数学推理任务中持续实现稳定和显着的改进,并进一步有效地推广到多模态图形用户界面(GUI)推理场景,成为一个通用且鲁棒的优化框架。