中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 共享即关心:有效的LM后训练以及集体RL经验共享
- VLA-AUPAPTER:微小视觉语言模型的有效范例
- 为什么语言模型幻觉
- 大型推理模型的加强学习调查
- 开放式一代的反向工程推理
- Humo:通过协作多模式调节以人为本的以人为本的视频
- 平行-R1:通过加强学习进行平行思考
- 多模式大语言模型的视觉表示对齐
- WebExplorer:探索和进化以培训长跑web代理
- SimpleVLA-RL:通过增强学习扩展VLA培训
- 奖励:视觉范围的奖励缩放
- MachineLearninglm:在数百万综合表格预测任务中持续预处理语言模型范围内下文ML
- Mini-O3:扩展推理模式和互动以进行视觉搜索
- ECHOX:通过语音到语音LLM的回声培训来减轻声学差距
- AgentGym-RL:通过多转弯学习的培训LLM代理进行长匹马决策
- 3D和4D世界建模:调查
- 革命扩散的强化学习框架大型语言模型
- 设定块解码是一种语言模型推理加速器
- 克林·阿瓦塔尔(Kling-avatar):级联长期化头像动画综合的接地多模式说明
- 具有大语言模型的符号图形编程
- 利用不确定性:长途LLM代理的熵调制的政策梯度
- Flux-Reason-6M&Prism板凳:一百万个尺度的文本对象推理数据集和全面的基准测试
- 重建对准可改善统一的多峰模型
- Dinov3是否设定了新的医疗视觉标准?
- [理解和产生可以真正受益 - 还是只是共存?](#理解和产生可以真正受益 - 还是只是共存?)
- F1:视觉语言行动模型将理解和生成与动作桥接
- 深入研究系统的强化学习基础:一项调查
共享即关心:有效的LM后训练以及集体RL经验共享
- 标题: Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
- 作者: Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
- 日期: 2025-09-10
- ArXiv主页 : https://arxiv.org/abs/2509.08721
- 论文链接 : https://arxiv.org/pdf/2509.08721
- 项目链接 : https://blog.gensyn.ai/sapo-efficient-lm-post-training-with-collective-rl/
- gitHub仓库 : https://github.com/gensyn-ai/rl-swarm
英文摘要
Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.
中文摘要
后训练语言模型(LMS)具有加固学习(RL)可以增强其复杂的推理能力而无需监督微调,如DeepSeek-R1-Zero所证明的那样。但是,有效利用RL进行LM需要大量的并行化来扩展推断,这引入了非平凡的技术挑战(例如延迟,内存和可靠性)以及不断增长的财务成本。我们提出了群群采样策略优化(SAPO),这是一种完全分散和异步的RL训练后训练算法。SAPO专为异源计算节点的分散网络而设计,每个节点都会管理其自己的策略模型,同时与网络中的其他节点"共享"推出;不需要关于延迟,模型同质性或硬件的明确假设,并且可以根据需要在筒仓中运行。结果,该算法避免了缩放RL后训练的常见瓶颈,同时还允许(甚至令人鼓舞)新的可能性。通过在整个网络上抽样"共享"推出,它可以使" aha momments"传播,从而引导学习过程。在本文中,我们显示SAPO在受控实验中获得了多达94%的累积奖励收益。我们还分享了一个网络测试中的见解,其中成千上万的节点是由Gensyn社区成员在开源演示过程中运行各种硬件和模型的算法的贡献。
VLA-AUPAPTER:微小视觉语言模型的有效范例
- 标题: VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
- 作者: Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang
- 日期: 2025-09-11
- ArXiv主页 : https://arxiv.org/abs/2509.09372
- 论文链接 : https://arxiv.org/pdf/2509.09372
- 项目链接 : https://vla-adapter.github.io/
英文摘要
Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.
中文摘要
视觉语言动作(VLA)模型通常通过在机器人数据上预先培训大规模视觉模型(VLM)来弥合感知空间和动作空间之间的差距。尽管这种方法大大提高了性能,但也会造成巨大的培训成本。在本文中,我们研究了如何有效地将视觉语言(VL)表示为动作(a)。我们引入了VLA-Audapter,这是一种新型范式,旨在减少VLA模型对大规模VLM和广泛的预训练的依赖。为此,我们首先系统地分析了各种VL条件的有效性,并介绍了哪些条件对于桥接感知和动作空间至关重要。基于这些见解,我们提出了一个引起桥梁注意力的轻量级政策模块,该模块自主将最佳条件注入动作空间。通过这种方式,我们的方法仅使用0.5B参数骨架就可以实现高性能,而没有任何机器人数据预训练。对模拟和现实世界机器人基准的广泛实验表明,VLA-Adapter不仅实现了最先进的水平性能,而且还提供了迄今为止报告的快速推理速度。此外,由于拟议的高级桥接范式,VLA-ADAPTER可以在短短8个小时内使用单个消费级GPU来训练强大的VLA型号,从而大大降低了部署VLA型号的障碍。项目页面:https://vla-adapter.github.io/。
为什么语言模型幻觉
- 标题: Why Language Models Hallucinate
- 作者: Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang
- 日期: 2025-09-04
- ArXiv主页 : https://arxiv.org/abs/2509.04664
- 论文链接 : https://arxiv.org/pdf/2509.04664
英文摘要
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
中文摘要
就像面临艰苦考试问题的学生一样,大型语言模型有时会猜测不确定的时候,产生合理但不正确的陈述,而不是承认不确定性。这种"幻觉"即使在最先进的系统和破坏信任中也存在。我们认为语言模型幻觉是因为培训和评估程序奖励猜测不确定性,并且我们分析了现代培训管道中幻觉的统计原因。幻觉不必是神秘的 - 它们仅作为二进制分类中的错误起源。如果无法将错误的陈述与事实区分开,则通过自然的统计压力,预验证的语言模型中的幻觉将出现。然后,我们认为,由于大多数评估的分级方式,幻觉持续存在 - 优化语言模型是良好的测试者,并且猜测何时不确定会改善测试性能。这种惩罚不确定反应的"流行病"只能通过社会技术缓解来解决:修改现有基准的评分,这些基准是未对准但主导的排行榜,而不是引入额外的幻觉评估。这种更改可能会引导该领域朝着更值得值得信赖的AI系统迈进。
大型推理模型的加强学习调查
- 标题: A Survey of Reinforcement Learning for Large Reasoning Models
- 作者: Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou
- 日期: 2025-09-10
- ArXiv主页 : https://arxiv.org/abs/2509.08827
- 论文链接 : https://arxiv.org/pdf/2509.08827
- 项目链接 : https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
- gitHub仓库 : https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
英文摘要
In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
中文摘要
在本文中,我们调查了使用大语言模型(LLM)推理的加强学习(RL)的最新进展。RL在推进LLM功能的前沿方面取得了巨大的成功,特别是在解决复杂的逻辑任务(例如数学和编码)方面。结果,RL已成为将LLMS转化为LRM的基础方法。随着该领域的快速发展,LRM的RL的进一步扩展现在不仅面临计算资源的基础挑战,而且还面临算法设计,培训数据和基础架构的基础挑战。为此,及时重新审视该领域的发展,重新评估其轨迹,并探讨提高RL对人工超级智能(ASI)的可扩展性的策略。特别是,我们研究了将RL适用于LLM和LRM的研究,尤其是自DeepSeek-R1发布以来,包括基本组件,核心问题,培训资源和下游应用程序,以确定这个快速发展的领域的未来机会和方向。我们希望这篇评论将促进对RL的未来研究,以促进更广泛的推理模型。github:https://github.com/tsinghuac3i/awesome-rl-for-lrms
开放式一代的反向工程推理
- 标题: Reverse-Engineered Reasoning for Open-Ended Generation
- 作者: Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, Ge Zhang, Fangzhen Lin
- 日期: 2025-09-07
- ArXiv主页 : https://arxiv.org/abs/2509.06160
- 论文链接 : https://arxiv.org/pdf/2509.06160
- 项目链接 : https://m-a-p.ai/REER_DeepWriter/
- gitHub仓库 : https://github.com/multimodal-art-projection/REER_DeepWriter
英文摘要
While the deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process forwards'' through trial-and-error or imitation, REER works ``backwards'' from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.
中文摘要
尽管深层推理''范式刺激了数学等可验证领域的重大进展,但其在开放式,创造性生成的应用仍然是一个关键挑战。灌输推理的两种主要方法 - 加强学习(RL)和指导蒸馏 - 在这一领域步履蹒跚;RL在缺乏明确的奖励信号和高质量奖励模型的情况下进行斗争,而蒸馏的昂贵,并且由教师模型的能力限制。为了克服这些限制,我们引入了反设计推理(REER),这是一种从根本上改变方法的新范式。Reer Works向后''从已知的好解决方案到计算发现潜在的,逐步的深层推理过程,而不是通过反复试验或模仿来构建推理过程``向前''。使用这种可扩展的,无梯度的方法,我们策划和开源DeepWriting-20k,这是一个大规模的数据集,其中包含20,000个深层推理轨迹的大规模数据集,用于开放式任务。我们的模型DeepWriter-8B对这些数据进行了培训,不仅超过了强大的开源基准,而且还可以实现与GPT-4O和Claude 3.5(Claude 3.5)领先的专有模型的竞争,有时甚至优于领先的专有模型。
Humo:通过协作多模式调节以人为本的以人为本的视频
- 标题: HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
- 作者: Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu
- 日期: 2025-09-10
- ArXiv主页 : https://arxiv.org/abs/2509.08519
- 论文链接 : https://arxiv.org/pdf/2509.08519
- 项目链接 : https://phantom-video.github.io/HuMo/
- gitHub仓库 : https://github.com/Phantom-video/HuMo
英文摘要
Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.
中文摘要
以人为中心的视频生成(HCVG)方法旨在从多模式输入(包括文本,图像和音频)中综合人类视频。由于两个挑战,现有方法难以有效地协调这些异质方式:缺乏配对三重态条件的训练数据以及与多型输入的主题保存和视听同步的子任务的难度。在这项工作中,我们提出了Humo,这是一个统一的HCVG合作多模式控制的框架。对于第一个挑战,我们构建了一个具有多种和配对的文本,参考图像和音频的高质量数据集。对于第二个挑战,我们提出了具有特定于任务策略的两阶段渐进多模式训练范式。对于主题保存任务,要保持基础模型的及时关注和视觉产生能力,我们采用了最小的侵入性图像注入策略。对于视听同步任务,除了通常采用的音频跨注意层外,我们提出了一种逐个预测的策略,该策略隐含地指导该模型将音频与面部区域联系起来。为了在以前获得的功能为基础的基于多模式输入的控制能力的联合学习,我们逐步合并了视听同步任务。在推断期间,对于灵活且细粒度的多模式控制,我们设计了一种时间自适应分类器的指导策略,该策略会动态调整跨剥离步骤的指导权重。广泛的实验结果表明,Humo超过了子任务中的专业最先进方法,为协作多模式条件HCVG建立了统一的框架。项目页面:https://phantom-video.github.io/humo。
平行-R1:通过加强学习进行平行思考
-
标题: Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
-
作者: Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
-
日期: 2025-09-09
-
ArXiv主页 : https://arxiv.org/abs/2509.07980
-
gitHub仓库 : https://github.com/zhengkid/Parallel-R1
英文摘要
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a mid-training exploration scaffold, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.
中文摘要
通过同时探索多个推理路径,平行思维已成为一种增强大语言模型(LLM)推理能力的新方法。但是,通过培训来激活此类功能仍然具有挑战性,因为现有方法主要依赖于综合数据的监督微调(SFT),这鼓励了教师责任模仿而不是探索和概括。与它们不同,我们提出了Parallel-R1,这是第一个强化学习(RL)框架,该框架可以实现复杂的现实世界推理任务的并行思维行为。我们的框架采用了一种渐进式课程,该课程明确解决了与RL平行思考的训练中的冷启动问题。我们首先使用SFT在迅速生成的轨迹上从更轻松的任务中灌输并行的思维能力,然后过渡到RL以探索和推广此技能,以解决更严重的问题。在包括数学,AMC23和AIME在内的各种数学基准测试的实验表明,并行R1成功灌输并行思考,从而对直接对RL直接培训的顺序思维模型进行了8.4%的准确性提高。进一步的分析揭示了模型的思维行为发生了明显的转变:在早期阶段,它将平行思维用作探索策略,而在以后的阶段,它使用相同的功能进行多观点验证。最重要的是,我们将平行思考视为中期训练脚手架,在RL后,这个临时探索阶段解锁了更高的性能上限,比AIME25的基线提高了42.9%。我们的模型,数据和代码将在https://github.com/zhengkid/parallel-r1上开源。
多模式大语言模型的视觉表示对齐
- 标题: Visual Representation Alignment for Multimodal Large Language Models
- 作者: Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim
- 日期: 2025-09-09
- ArXiv主页 : https://arxiv.org/abs/2509.07979
- 论文链接 : https://arxiv.org/pdf/2509.07979
- 项目链接 : https://praeclarumjj3.github.io/ola_vlm/
- gitHub仓库 : https://github.com/cvlab-kaist/VIRAL
英文摘要
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
中文摘要
经过视觉指导调整训练的多模式大语模型(MLLM)在各种任务中取得了强大的性能,但它们仍受到以视觉为中心的任务(例如对象计数或空间推理)的限制。我们将这一差距归因于主要的仅文本监督范式,该范式仅为视觉途径提供间接的指导,并且经常导致MLLM在训练过程中丢弃细粒度的细节。在本文中,我们提出了视觉表示对准(病毒),这是一种简单而有效的正则化策略,将MLLM的内部视觉表示与预训练的视觉基础模型(VFM)保持一致。通过明确执行此对齐方式,病毒不仅可以保留来自输入视觉编码器的关键视觉细节,而且还可以补充VFM的其他视觉知识,从而增强了其在复杂的视觉输入上推理的能力。我们的实验表明,对广泛采用的多模式基准的所有任务的一致改进。此外,我们进行全面的消融研究,以验证框架基础的关键设计选择。我们认为,这一简单的发现为在培训MLLM中有效整合的视觉信息打开了一个重要方向。
WebExplorer:探索和进化以培训长跑web代理
-
标题: WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
-
作者: Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He
-
日期: 2025-09-08
-
ArXiv主页 : https://arxiv.org/abs/2509.06501
-
gitHub仓库 : https://github.com/hkust-nlp/WebExplorer
英文摘要
The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.
中文摘要
大型语言模型(LLM)的范式越来越多地转向了代理应用,在这些应用程序中,网络浏览功能是从不同的在线资源中检索信息的基础。但是,现有的开源Web代理要么在复杂任务上表现出有限的信息寻求信息,要么缺乏透明的实现。在这项工作中,我们确定关键的挑战在于缺乏挑战数据来寻求信息。为了解决此限制,我们介绍了WebExplorer:一种使用基于模型的探索和迭代,远程查询进化的系统数据生成方法。此方法创建了需要多步推理和复杂的Web导航的挑战性查询对。通过利用我们精心策划的高质量数据集,我们通过监督的微调成功地开发了高级Web代理WebExplorer-8B,然后进行了增强学习。我们的模型支持128K上下文长度,最多100个工具调用转弯,从而实现了长马问题解决方案。在各种各样的信息寻求基准的基准中,WebExplorer-8B在其规模上实现了最先进的性能。值得注意的是,作为一个8B大小的型号,WebExplorer-8B能够在RL训练后平均进行16个弯道有效搜索,比BrowseComp-en/ZH上的Weberailor-72B获得更高的精度,并在WebWalkerQA和Frames上达到100B参数的模型中的最佳性能。除了这些寻求信息的任务外,我们的模型还在HLE基准上实现了强烈的概括,即使它仅在知识密集型质量检查数据中进行培训。这些结果突出了我们的方法是通往长跑网络代理的实用途径。
SimpleVLA-RL:通过增强学习扩展VLA培训
-
标题: SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
-
作者: Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding
-
日期: 2025-09-11
-
ArXiv主页 : https://arxiv.org/abs/2509.09674
-
gitHub仓库 : https://github.com/PRIME-RL/SimpleVLA-RL
英文摘要
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0 on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
中文摘要
视觉语言动作(VLA)模型最近已成为用于机器人操作的强大范式。尽管大规模预处理和监督微调(SFT)可以取得重大进展,但这些模型面临两个基本挑战:(i)SFT扩展所需的大型人类经验的机器人轨迹的稀缺性和高成本,以及(ii)对涉及分配转移任务的有限概括。大型推理模型(LRMS)的最新突破表明,加强学习(RL)可以显着增强逐步推理能力,提出一个自然的问题:RL是否可以同样改善VLA的长期逐步动作计划?在这项工作中,我们介绍了SimpleVLA-RL,这是一个针对VLA型号量身定制的有效RL框架。在VERL的基础上,我们引入了VLA特异性轨迹采样,可扩展并行化,多环境渲染和优化的损耗计算。当应用于OpenVla-Oft时,SimpleVla-RL在Libero上实现了SOTA的性能,并且在Robotwin 1.0 \&2.0上甚至超过了PI_0,我们介绍了我们介绍的探索增强策略。SimpleVLA-RL不仅降低了对大规模数据的依赖,并实现了强大的概括,而且还可以超过现实世界任务中的SFT。此外,我们在RL培训期间确定了一种新颖的现象``Pushcut'',在其中,该政策发现了以前看不见的模式,而不是先前培训过程中看到的模式。github:https://github.com/prime-rl/simplevla-rl
奖励:视觉范围的奖励缩放
- 标题: RewardDance: Reward Scaling in Visual Generation
- 作者: Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang
- 日期: 2025-09-10
- ArXiv主页 : https://arxiv.org/abs/2509.08826
- 论文链接 : https://arxiv.org/pdf/2509.08826
英文摘要
Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.
中文摘要
奖励模型(RMS)对于通过增强学习(RL)改善生成模型至关重要,但是视觉生成中的RM缩放范式在很大程度上尚未开发。这主要是由于现有方法的基本限制:基于夹的RMS遭受了建筑和输入方式限制的影响,而普遍的Bradley-Terry-Terry-Terry损失从根本上与视觉模型(VLMS)的下一个预测机制(VLMS)的预测机制不一致,并阻碍了有效的尺度。更重要的是,RLHF优化过程受到奖励黑客问题的困扰,在奖励黑客问题上,模型在不提高真实质量的情况下利用奖励信号中的缺陷。为了应对这些挑战,我们引入了奖励,这是一个可扩展的奖励建模框架,通过新颖的生成奖励范式克服了这些障碍。通过将奖励分数重新定义为模型预测"是"令牌的可能性,表明生成的图像根据特定标准优于参考图像,奖励本质上将奖励目标与VLM架构保持一致。该对齐方式解锁了跨两个维度的缩放:(1)模型缩放:RMS的系统缩放率高达260亿个参数;(2)上下文缩放:特定于任务的指令,参考示例和思想链(COT)推理的集成。广泛的实验表明,奖励大大超过了文本对图像,文本对视频和图像到视频的最新方法。至关重要的是,我们解决了"奖励黑客"的持续挑战:我们的大规模RMS展示并在RL微调过程中保持了较高的奖励差异,证明了它们对黑客的抵抗力以及产生多样化的高质量产量的能力。它极大地缓解了困扰较小模型的模式崩溃问题。
MachineLearninglm:在数百万综合表格预测任务中持续预处理语言模型范围内下文ML
- 标题: MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML
- 作者: Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
- 日期: 2025-09-08
- ArXiv主页 : https://arxiv.org/abs/2509.06806
- 论文链接 : https://arxiv.org/pdf/2509.06806
- 项目链接 : https://huggingface.co/MachineLearningLM/
- gitHub仓库 : https://github.com/HaoAreYuDong/MachineLearningLM
英文摘要
Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
中文摘要
大型语言模型(LLMS)具有广泛的世界知识和强大的通用推理能力,但是他们很难从许多关于标准机器学习(ML)任务的文本示例中学习,也就是说,即可通过纯粹的内在学习(ICL)而没有梯度下降。我们介绍了Machinelearninglm,这是一种便携式持续预处理的框架,它使通用LLM具有强大的文本ML功能,同时保留其一般知识和更广泛的聊天工作流程的推理。我们的训练过程综合了数百万个结构性因果模型(SCM)的ML任务,跨越了射击的数量高达1,024。我们从一位随机的老师开始,将基于树的决策策略蒸馏到LLM中,以增强数值建模的鲁棒性。所有任务均通过令牌有效的提示序列化,每个上下文窗口允许3倍至6倍的示例,并通过批处理延误高达50倍摊销的吞吐量。尽管设置适中(QWEN-2.5-7B - 洛拉等级8),但MachineLearninglm的表现均超过了强大的LLM基准(例如GPT-5-MINI),平均在跨金融,物理学,生物学和医疗保健范围内的分布分布分类的分布分类量平均约为15%。它表现出惊人的许多缩放定律:随着中文示范从8个增长到1,024,精度可以单调地提高。没有任何特定于任务的培训,它就可以达到数百个镜头的随机森林级准确性。保留了一般聊天功能,包括知识和推理:它在MMLU上实现了75.4%。
Mini-O3:扩展推理模式和互动以进行视觉搜索
- 标题: Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
- 作者: Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao
- 日期: 2025-09-09
- ArXiv主页 : https://arxiv.org/abs/2509.07969
- 论文链接 : https://arxiv.org/pdf/2509.07969
- 项目链接 : https://mini-o3.github.io/
- gitHub仓库 : https://github.com/Mini-o3/Mini-o3
英文摘要
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
中文摘要
大型多模型模型的最新进展具有利用基于图像的工具,并通过增强学习来解决视觉问题。但是,现有的开源方法通常表现出单调的推理模式,并且仅允许有限的互动转弯,从而使他们不足以完成需要试用和错误探索的困难任务。在这项工作中,我们通过扩展基于工具的交互并引入Mini-O3来解决此限制,该系统可以执行深层,多转的推理 - 跨越了数十步 - 并在具有挑战性的视觉搜索任务上实现最先进的性能。我们复制Openai O3式行为的食谱包括三个关键组成部分。首先,我们构建了视觉探针数据集,这是成千上万的挑战性视觉搜索问题,该问题是为探索推理而设计的。其次,我们开发了一条迭代数据收集管道,以获取具有各种推理模式的冷轨迹,包括深度优先搜索,反复试验和目标维护。第三,我们提出了一种过度转向掩盖策略,以防止在强化学习过程中对过度转弯响应的惩罚(那些击中最大转弯的响应),从而平衡训练时间效率与测试时间的可扩展性。尽管训练只有六个相互作用的上限,但我们的模型仍会生成轨迹,这些轨迹自然地在推理时间扩展到数十圈,并且随着弯道数量的增加,准确性提高。广泛的实验表明,Mini-O3会产生丰富的推理模式和深层思考路径,从而有效地解决了具有挑战性的视觉搜索问题。
ECHOX:通过语音到语音LLM的回声培训来减轻声学差距
- 标题: EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
- 作者: Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li
- 日期: 2025-09-11
- ArXiv主页 : https://arxiv.org/abs/2509.09174
- 论文链接 : https://arxiv.org/pdf/2509.09174
- 项目链接 : https://freedomintelligence.github.io/EchoX/
- gitHub仓库 : https://github.com/FreedomIntelligence/EchoX
英文摘要
Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.
中文摘要
语音到语音大语模型(SLLM)正在吸引越来越多的关注。SLLM源自基于文本的大语言模型(LLM),经常在知识和推理能力中表现出降级。我们假设出现了这种限制,因为SLLMS的当前训练范例无法弥合特征表示空间中的声音 - 语义差距。为了解决这个问题,我们提出了ECHOX,该ECHOX利用语义表示并动态生成语音训练目标。这种方法同时整合了声学和语义学习,从而使Echox能够将强大的推理能力保留为语音LLM。实验结果表明,ECHOX在大约六千小时的培训数据中,可以在多个基于知识的问题的基准基准上实现高级绩效。该项目可从https://github.com/freedomintelligence/echox获得。
AgentGym-RL:通过多转弯学习的培训LLM代理进行长匹马决策
- 标题: AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
- 作者: Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
- 日期: 2025-09-10
- ArXiv主页 : https://arxiv.org/abs/2509.08755
- 论文链接 : https://arxiv.org/pdf/2509.08755
- 项目链接 : https://agentgym-rl.github.io/
- gitHub仓库 : https://github.com/WooooDyy/AgentGym-RL
英文摘要
Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.
中文摘要
开发能够做出一系列智能决策来解决复杂的现实世界任务的自主LLM代理是一个快速发展的前沿。像人类的认知发展一样,预计代理人将通过探索和与环境互动来获得知识和技能。尽管有进步,社区仍然缺乏统一的互动加强学习(RL)框架,该框架可以在不依赖各种环境的环境中有效地从头开始训练此类代理 - 而无需依靠监督的微调(SFT)。为了弥合这一差距,我们介绍了Agengym-RL,这是一个新的框架,用于通过RL培训LLM代理,以通过RL进行多转变的互动决策。该框架具有模块化和脱钩的体系结构,可确保高灵活性和可扩展性。它涵盖了各种各样的现实情况,并支持主流RL算法。此外,我们提出了缩放量-RL,这是一种训练方法,旨在探索探索平衡和稳定的RL优化。在早期阶段,它通过限制相互作用的数量来强调剥削,并逐渐向更大的视野转向探索,以鼓励各种各样的解决问题的策略。这样,代理会发展出更加多样化的行为,并且在远距离下不容易崩溃。我们执行广泛的实验,以验证AgensGyM-RL框架和ScaingInter-RL方法的稳定性和有效性。我们的代理商匹配或超过各种环境的27个任务的商业模型。我们提供关键的见解,并将开源完整的AgentGyM-RL框架(包括代码和数据集),以增强研究社区的能力,以开发下一代的智能代理。
3D和4D世界建模:调查
- 标题: 3D and 4D World Modeling: A Survey
- 作者: Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu
- 日期: 2025-09-04
- ArXiv主页 : https://arxiv.org/abs/2509.07996
- 论文链接 : https://arxiv.org/pdf/2509.07996
- 项目链接 : https://worldbench.github.io/survey
- gitHub仓库 : https://github.com/worldbench/survey
英文摘要
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey
中文摘要
世界建模已成为AI研究的基石,使代理商能够理解,代表和预测他们所居住的动态环境。尽管先前的工作在很大程度上强调了2D图像和视频数据的生成方法,但它们忽略了利用本机3D和4D表示的快速增长的工作,例如RGB-D图像,占用网格和LIDAR点云用于大规模场景建模。同时,缺乏对``世界模型''的标准化定义和分类法,导致文献中的主张分散,有时甚至不一致。这项调查通过介绍第一个专门针对3D和4D世界建模和发电的首次全面审查来解决这些差距。我们建立精确的定义,介绍一个结构化分类法,该分类法(基于视频),基于占用率(OCCGEN)和基于激光雷达(Lidar)的方法(Lidargen)方法,并系统地总结了数据集和量身定制的针对3D/4D设置的评估指标。我们进一步讨论实用应用,确定开放挑战并突出有前途的研究方向,旨在为该领域提供连贯和基础的参考。现有文献的系统摘要可从https://github.com/worldbench/survey获得
革命扩散的强化学习框架大型语言模型
- 标题: Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
- 作者: Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang
- 日期: 2025-09-08
- ArXiv主页 : https://arxiv.org/abs/2509.06949
- 论文链接 : https://arxiv.org/pdf/2509.06949
- 项目链接 : https://huggingface.co/collections/Gen-Verse/trado-series-68beb6cd6a26c27cde9fe3af
- gitHub仓库 : https://github.com/Gen-Verse/dLLM-RL
英文摘要
We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL
中文摘要
我们提出了Tracerl,这是一种扩散语言模型(DLMS)的轨迹感知增强学习框架,将首选推理轨迹纳入训练后,并且适用于不同的体系结构。配备了基于扩散的价值模型,可以增强训练稳定性,我们证明了在复杂的数学和编码任务上的推理性能提高了。此外,它也可以应用于将块特异性模型调整到较大的块中,从而提高了采样灵活性。我们采用Tracerl,得出了一系列最先进的扩散语言模型,即Trado。尽管小于7B尺度的AR模型,但Trado-4B-Instruct仍然一致地胜过复杂的数学推理任务。Trado-8B教学实现在QWEN2.5-7B-INSTRUCTION的相对准确性提高6.1%,而Llama3.1-8B - 实验室在数学推理基准方面的相对准确度提高了51.3%。通过课程学习,我们还得出了第一个长核DLM,在Math500上表现优于QWEN2.5-7B-Instruct,相对准确性增长率为18.1%。为了促进可再现的研究和实践应用,我们发布了一个全面的开源框架,用于建筑,培训和部署跨不同体系结构的扩散LLM。该框架集成了用于推理和强化学习的加速KV-CACHE技术和推理引擎,并包括针对数学,编码和一般任务的各种监督微调和RL方法的实现。代码和模型:https://github.com/gen-verse/dllm-rl
设定块解码是一种语言模型推理加速器
- 标题: Set Block Decoding is a Language Model Inference Accelerator
- 作者: Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman
- 日期: 2025-09-04
- ArXiv主页 : https://arxiv.org/abs/2509.04185
- 论文链接 : https://arxiv.org/pdf/2509.04185
英文摘要
Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.
中文摘要
自回归的隔壁预测语言模型提供了强大的功能,但是由于推理的高计算和记忆成本,尤其是在解码阶段,因此在实际部署方面面临重大挑战。我们介绍了Set Block解码(SBD),这是一种简单而灵活的范式,通过在单个体系结构中集成标准的隔壁标记预测(NTP)和掩盖的令牌预测(MATP)来加速生成。SBD允许该模型并行采样多个,不一定是连续的未来令牌,这是与以前的加速方法的关键区别。这种灵活性允许从离散扩散文献中使用高级求解器,从而在不牺牲准确性的情况下提供了显着的加速。SBD不需要架构更改或额外的培训超级计,可以保持与精确的KV辅助的兼容性,并且可以通过对现有的隔壁预测模型进行微调来实现。通过微调Llama-3.1 8B和QWEN-3 8B,我们证明SBD可以减少生成所需的前向通行证数量3-5倍,同时获得与等效NTP培训相同的性能。
克林·阿瓦塔尔(Kling-avatar):级联长期化头像动画综合的接地多模式说明
- 标题: Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
- 作者: Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
- 日期: 2025-09-11
- ArXiv主页 : https://arxiv.org/abs/2509.09595
- 论文链接 : https://arxiv.org/pdf/2509.09595
- 项目链接 : https://klingavatar.github.io/
英文摘要
Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
中文摘要
音频驱动的阿凡达视频的最新进展显着增强了视听现实主义。但是,现有方法仅将指导条件视为由声学或视觉提示驱动的低级跟踪,而无需对指令传达的沟通目的进行建模。这种限制损害了他们的叙事连贯性和性格表现力。为了弥合这一差距,我们介绍了Kling-Avatar,这是一个新颖的级联框架,该框架将多模式的教学理解与影像肖像产生相关。我们的方法采用了两阶段的管道。在第一阶段,我们设计了一个多模式的大语言模型(MLLM)导演,该导演制作了蓝图视频,该视频以各种教学信号为条件,从而管理高级语义,例如角色运动和情感。在第二阶段,在蓝图密钥框的指导下,我们使用第一last框架策略并行生成多个子剪辑。这个全球到本地的框架保留了细粒细节,同时忠实地编码了多模式说明背后的高级意图。我们的平行体系结构还可以快速,稳定的长期视频,使其适用于现实世界中的应用程序,例如数字人类直播和视频博客。为了全面评估我们的方法,我们构建了375个策划样本的基准,涵盖了各种说明和具有挑战性的场景。广泛的实验表明,克林·阿瓦塔尔(Kling-Avatar)能够以高达1080p和48 fps的形式生成生动,流利,长期的视频,从而在唇部同步精度,情感和动态表现力,指导能力控制性,身份保存和交叉跨跨跨跨跨概括方面取得了卓越的性能。这些结果将克林·阿瓦塔尔(Kling-avatar)作为语义接地,高保真驱动的阿凡达(Avatar)合成的新基准。
具有大语言模型的符号图形编程
- 标题: Symbolic Graphics Programming with Large Language Models
- 作者: Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu
- 日期: 2025-09-05
- ArXiv主页 : https://arxiv.org/abs/2509.05208
- 论文链接 : https://arxiv.org/pdf/2509.05208
- 项目链接 : https://spherelab.ai/SGP-Gen/
- gitHub仓库 : https://github.com/simonw/pelican-bicycle
英文摘要
Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.
中文摘要
大型语言模型(LLMS)在程序综合方面表现出色,但它们生产符号图形程序(SGP)的能力仍未得到充满反感。我们研究符号图形编程,其目标是从自然语言描述中生成SGP。此任务还可以作为LLM通过提示从SGP呈现的图像来理解视觉世界的视角。在各种SGP中,我们的纸张坚持可扩展的向量图形(SVG)。我们首先检查LLM可以生成SGP的程度。为此,我们介绍了SGP-GENBENCH,这是一个全面的基准,涵盖了对象保真度,场景保真度和组成性(属性绑定,空间关系,算术)。在SGP-Genbench上,我们发现边境专有模型基本上要优于开源模型,并且性能与一般编码功能良好相关。在这一差距的推动下,我们旨在提高LLMS生成SGP的能力。我们建议使用可验证的奖励方法进行加固学习(RL),其中格式 - 录音门可确保可渲染的SVG,并且通过强视觉编码器(例如,用于图像形象的文本图像和dino)的跨模式奖励使文本和渲染图像相结合。应用于QWEN-2.5-7B,我们的方法基本上提高了SVG的发电质量和语义,从而在边境系统中实现了性能。我们进一步分析了训练动力学,表明RL诱导(i)将对象分解为可控的原始素和(ii)上下文细节,以提高场景相干性。我们的结果表明,符号图形编程在跨模式接地方面提供了精确且可解释的镜头。
利用不确定性:长途LLM代理的熵调制的政策梯度
- 标题: Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
- 作者: Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang
- 日期: 2025-09-11
- ArXiv主页 : https://arxiv.org/abs/2509.09265
- 论文链接 : https://arxiv.org/pdf/2509.09265
- 项目链接 : https://empgseed-seed.github.io/
英文摘要
In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/
中文摘要
在长途任务中,基于大语言模型(LLMS)的最新代理面临着一个重大挑战,稀疏,基于结果的奖励使得很难将信用分配给中间步骤。先前的方法主要集中于创建密集的奖励信号来指导学习,要么通过传统的强化学习技术(例如逆强化学习)或使用过程奖励模型进行分步反馈。在本文中,我们确定了LLMS学习动态中的一个基本问题:策略梯度的大小与熵相结合,这导致了无效的小型更新,以确保自信的正确操作,并可能破坏了不确定效率的大型更新。为了解决此问题,我们提出了熵调节的策略梯度(EMPG),该框架可以根据逐步不确定性和最终任务结果重新校准学习信号。EMPG放大了更新,以获取自信的正确操作,惩罚自信的错误,并从不确定的步骤中减轻更新以稳定探索。我们进一步介绍了一个奖励术语,以实现未来的清晰度,该术语鼓励代理商找到更可预测的解决方案路径。通过对三个具有挑战性的代理商任务,网络商店,ALFWORLD和深入搜索进行的全面实验,我们证明EMPG可以实现可观的绩效增长,并且显着优于强大的政策梯度基线。项目页面位于https://empgseed-seed.github.io/
Flux-Reason-6M&Prism板凳:一百万个尺度的文本对象推理数据集和全面的基准测试
- 标题: FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
- 作者: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
- 日期: 2025-09-11
- ArXiv主页 : https://arxiv.org/abs/2509.09680
- 论文链接 : https://arxiv.org/pdf/2509.09680
- 项目链接 : https://flux-reason-6m.github.io/
- gitHub仓库 : https://github.com/rongyaofang/prism-bench
英文摘要
The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .
中文摘要
没有大规模,以推理为重点的数据集和全面的评估基准的开源文本对图像(T2I)模型的进步受到阻碍,这与领先的封闭源系统相比,绩效差距导致性能差距。为了应对这一挑战,我们引入了Flux-Reason-6M和Prism Bench(精确且健壮的图像合成测量基准)。Flux-Reason-6M是一个庞大的数据集,它由600万个高质量的磁通图像和2000万双语言(英语和中文)描述,专门设计用于教授复杂的推理。图像是根据六个关键特征组织的:想象力,实体,文本渲染,样式,感情和构图,以及设计明确的生成链(GCOT),以提供图像生成步骤的详细分解。整个数据策展需要15,000个A100 GPU天,为社区提供了以前在大型工业实验室之外无法实现的资源。Prism-Bench提供了一个新颖的评估标准,具有七个不同的曲目,包括使用GCOT的强大的长文本挑战。通过精心设计的提示,它利用了先进的视觉模型来对及时图像一致性和图像美学的细微统一评估。我们对19个领先模型在Prism台上的广泛评估揭示了关键的性能差距,并突出了需要改进的特定领域。我们的数据集,基准和评估代码将释放,以催化下一波以推理为导向的T2I生成。项目页面:https://flux-reason-6m.github.io/。
重建对准可改善统一的多峰模型
- 标题: Reconstruction Alignment Improves Unified Multimodal Models
- 作者: Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang
- 日期: 2025-09-08
- ArXiv主页 : https://arxiv.org/abs/2509.07295
- 论文链接 : https://arxiv.org/pdf/2509.07295
- 项目链接 : https://reconstruction-alignment.github.io/
- gitHub仓库 : https://github.com/HorizonWind2004/reconstruction-alignment
英文摘要
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
中文摘要
统一的多模型模型(UMMS)在单个体系结构中统一视觉理解和生成。但是,传统的培训依赖于图像文本对(或序列)的字幕通常是稀疏的,并且错过了细颗粒的视觉细节 - 即使他们使用数百个单词来描述一个简单的图像。我们介绍了重建对齐(RECA),这是一种资源有效的训练后方法,利用视觉理解编码器嵌入为密集的"文本提示",提供了没有标题的丰富监督。具体而言,RECA在其视觉理解的嵌入方式上适应了UMM的条件,并优化了它以通过自我监督的重建损失重建输入图像,从而重新调整理解和产生。尽管它很简单,但RECA还是广泛适用的:跨回归,掩盖了自动回调和基于扩散的UMMS,它始终提高生成和编辑保真度。RECA的培训只有27个小时,重大培训大大提高了Geneval(0.73RightArrow0.90)和DPGBench(80.93RightArrow88.15)的图像生成性能,同时还提高了编辑基准的编辑基准(IMGENTIT 3.38RIGHTAROW3.75,GEDIT 6.944rightArrow7...25)。值得注意的是,RECA超过了更大的开源模型,并在各种UMM体系结构之间广泛应用,将其确立为UMMS的有效且一般的训练后对齐策略
Dinov3是否设定了新的医疗视觉标准?
- 标题: Does DINOv3 Set a New Medical Vision Standard?
- 作者: Che Liu, Yinda Chen, Haoyuan Shi, Jinpeng Lu, Bailiang Jian, Jiazhen Pan, Linghan Cai, Jiayi Wang, Yundi Zhang, Jun Li, Cosmin I. Bercea, Cheng Ouyang, Chen Chen, Zhiwei Xiong, Benedikt Wiestler, Christian Wachinger, Daniel Rueckert, Wenjia Bai, Rossella Arcucci
- 日期: 2025-09-08
- ArXiv主页 : https://arxiv.org/abs/2509.06467
- 论文链接 : https://arxiv.org/pdf/2509.06467
英文摘要
The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.
中文摘要
大规模视觉基础模型的出现,在各种自然图像上进行了训练,标志着计算机视觉的范式转移。但是,Frontier Vision Foundation模型如何转移到专业领域的效率仍然存在,例如医学成像仍然是一个悬而未决的问题。该报告调查了Dinov3是一种最先进的自我监督视觉变压器(VIT),它在密集的预测任务中具有强大的能力,可以直接用作有力的,统一的编码器,用于没有特定领域的预训练的医疗视觉任务。为了回答这一点,我们在常见的医学视觉任务中基准了Dinov3,包括2D/3D分类和各种医学成像方式的细分。我们通过不同的模型大小和输入图像分辨率来系统地分析其可扩展性。我们的发现表明,Dinov3表现出令人印象深刻的性能,并建立了强大的新基线。值得注意的是,尽管仅接受了自然图像的培训,但它甚至可以超过多个任务的医学特定基础模型,例如BiomedClip和CT-NET。但是,我们确定了清晰的局限性:模型的特征在需要深层域专业化的情况下降低了,例如全坡度病理图像(WSIS),电子显微镜(EM)和正电子发射断层扫描(PET)。此外,我们观察到Dinov3在医疗领域并不始终如一地遵守规模的法律。绩效不会随着较大的模型或更精细的特征分辨率可靠地提高,显示了跨任务的各种缩放行为。最终,我们的工作将Dinov3建立为强大的基线,其强大的视觉功能可以作为多个复杂的医疗任务的强大先验。这打开了有希望的未来方向,例如利用其功能来在3D重建中执行多视图一致性。
理解和产生可以真正受益 - 还是只是共存?
- 标题: Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
- 作者: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
- 日期: 2025-09-11
- ArXiv主页 : https://arxiv.org/abs/2509.09666
- 论文链接 : https://arxiv.org/pdf/2509.09666
英文摘要
In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
中文摘要
在本文中,我们通过自动编码器镜头理解为编码器(I2T)引入了一个有见地的范式,该范式将图像压缩到文本中,并以解码器(T2I)的形式产生,从而从该文本中重建图像。我们将重建忠诚度作为统一的训练目标,我们在理解过程和发电过程之间实施了相干的双向信息流,从而带来了相互利益。为了实施这一点,我们提出了阿联酋,这是统一多模式学习的新型框架。我们首先用大规模的长篇小写图像标题预先训练解码器,以捕获细粒的语义和复杂的空间关系。然后,我们通过加固学习(RL)提出统一的GRPO,该学习涵盖了三个阶段:(1)一个冷启动的阶段,可以轻轻初始化编码器和解码器,并具有语义重建损失;(2)为理解的生成,培训编码器以生成内容丰富的字幕,以最大程度地提高解码器的重建质量,从而增强其视觉理解;(3)对生成的理解,其中解码器被改进以从这些字幕中重建,迫使其利用每个细节并改善其长篇文献的跟随和产生忠诚。为了进行评估,我们介绍了统一基础,这是第一个量身定制的基准,该基准是为评估UMMS统一程度的量身定制的。多模式学习域中出现了一个令人惊讶的" AHA时刻":随着RL的进展,编码器自主产生更大的描述性字幕,而解码器同时表现出了深刻的理解这些复杂描述的能力,从而导致了引人注目的富裕性的重建。
F1:视觉语言行动模型将理解和生成与动作桥接
- 标题: F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
- 作者: Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, Jiangmiao Pang
- 日期: 2025-09-08
- ArXiv主页 : https://arxiv.org/abs/2509.06951
- 论文链接 : https://arxiv.org/pdf/2509.06951
- 项目链接 : https://aopolin-lv.github.io/F1-VLA/
- gitHub仓库 : https://github.com/InternRobotics/F1-VLA
英文摘要
Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.
中文摘要
在动态视觉环境中执行语言条件的任务仍然是体现AI的核心挑战。现有的视觉语言动作(VLA)模型主要采用反应性状态映射,通常导致短暂的行为和动态场景中的鲁棒性。在本文中,我们介绍了F1,这是一个经过验证的VLA框架,将视觉远景生成整合到决策管道中。F1采用了转化器架构的混合体,并具有专门的模块,以进行感知,远见生成和控制,从而弥补了理解,生成和动作。F1以此为核心,采用了下一尺度的预测机制来综合目标条件的视觉远景作为明确的计划目标。通过预测合理的未来视觉状态,F1将动作生成重新定义为前瞻性引导的逆动力问题,从而实现了隐式实现视觉目标的行动。为了赋予F1具有强大且可推广的功能,我们建议在一个大量的数据集上提出一个三阶段的培训食谱,其中包括136个不同任务的330K轨迹。该训练方案增强了模块化推理,并为模型提供了可转移的视觉远见,这对于复杂而动态的环境至关重要。对现实世界任务和仿真基准的广泛评估表明,F1始终优于现有方法,从而在任务成功率和概括能力方面取得了可观的提高。
深入研究系统的强化学习基础:一项调查
-
标题: Reinforcement Learning Foundations for Deep Research Systems: A Survey
-
作者: Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, Yong Liu
-
日期: 2025-09-08
-
ArXiv主页 : https://arxiv.org/abs/2509.06733
-
gitHub仓库 : https://github.com/wenjunli-0/deepresearch-survey
英文摘要
Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.
中文摘要
深入的研究系统,通过协调推理,跨开放网络和用户文件搜索以及使用工具的代理AI,可以通过计划者,协调员和执行者进行分层部署。在实践中,培训整个堆栈端到端仍然不切实际,因此大多数工作都会训练与搜索,浏览和代码等核心工具连接的单个计划者。尽管SFT赋予协议保真度,但它遭受了模仿和暴露偏见的影响,并在环境反馈中不足。诸如DPO之类的偏好一致性方法是架构和代理依赖性,非政策,并且对于长马信用分配和多目标权衡而言弱。SFT和DPO的进一步局限性是他们通过模式设计对人类定义的决策点和子技能的依赖,并标记为比较。通过优化轨迹级别的政策,使探索,恢复行为和原则信贷分配,强化学习与闭环,工具交互研究保持一致,并减少了对这种人类先验和评估者偏见的依赖。据我们所知,这项调查是第一个专门针对深层研究系统的基础。它沿着三个轴沿DeepSeek-R1进行系统化,将工作系统化:(i)数据综合和策展;(ii)用于代理研究的RL方法涵盖稳定性,样本效率,长篇小说处理,奖励和信用设计,多目标优化和多模式集成;(iii)代理RL培训系统和框架。我们还涵盖了代理体系结构和协调,以及评估和基准,包括最近的QA,VQA,长形合成和域接地,工具交互任务。我们提炼了重复的模式,表面基础设施瓶颈,并为训练具有RL的强大的,透明的深层研究代理提供了实用的指导。