【论文速递】2025年第32周(Aug-03-09)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

QWEN-IMAGE技术报告

  • 标题: Qwen-Image Technical Report
  • 作者: Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
  • 日期: 2025-08-04
  • ArXiv主页 : https://arxiv.org/abs/2508.02324
  • 论文链接 : https://arxiv.org/pdf/2508.02324
  • 项目链接 : https://qwenlm.github.io/blog/qwen-image/
  • gitHub仓库 : https://github.com/QwenLM/Qwen-Image

英文摘要

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

中文摘要

我们提出了QWEN-IMAGE,这是QWEN系列中图像生成基础模型,在复杂的文本渲染和精确的图像编辑中取得了重大进展。为了应对复杂文本渲染的挑战,我们设计了一条全面的数据管道,其中包括大规模数据收集,过滤,注释,综合和平衡。此外,我们采用了一种渐进培训策略,该策略从非文本到文本渲染开始,从简单到复杂的文本输入演变,并逐渐扩展到段落级的描述。这种课程学习方法显着增强了该模型的本地文本渲染功能。结果,QWEN图像不仅在英语等字母语言中表现出色,而且在更具挑战性的逻辑语言(如中文)上取得了惊人的进步。为了增强图像编辑的一致性,我们引入了改进的多任务培训范式,该范围不仅包含传统的文本图像(T2I)和文本图像图像图像(TI2I)任务,还包括图像到图像图像(I2I)重建,有效地使QWEN2.5-d-fl和mmdit之间的延伸表示有效。此外,我们将原始图像分别馈送到QWEN2.5-VL和VAE编码器中,分别获得语义和重建表示。这种双重编码机制使编辑模块能够在保持语义一致性和保持视觉保真度之间取得平衡。Qwen-Image达到了最先进的性能,表明了其在图像生成和跨多个基准测试中的强大功能。


LLMS的经过思考推理是海市rage楼吗?数据分配镜头

英文摘要

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

中文摘要

已经证明,经过深思熟虑的(COT)提示可以改善各种任务的大型语言模型(LLM)。通过这种方法,LLM似乎在提供答案(又称COT推理)之前产生了类似人类的推理步骤,这通常会导致人们认为他们参与故意推论过程。但是,一些初步发现表明,COT推理可能比看起来更肤浅,激励我们进一步探索。在本文中,我们通过数据分布镜来研究COT推理,并研究COT推理是否反映了从分布数据中学到的结构性电感偏差,从而使该模型能够有条件地产生近似训练过程中看到的推理路径。因此,其有效性从根本上是受训练数据和测试查询之间的分布差异程度的限制。使用此镜头,我们通过三个维度解剖COT推理:任务,长度和格式。为了调查每个维度,我们设计了一个孤立和受控的环境,从头开始训练LLMS,并在各种分布条件下系统地探测它们。我们的结果表明,COT推理是一种脆弱的幻影,当它被推动以外的训练分布而消失。这项工作对Cot推理失败的原因和何时更深入地理解,强调实现真正和可推广的推理的持续挑战。


关于SFT的概括:增强学习观点,并进行奖励纠正

英文摘要

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

中文摘要

我们为大型语言模型(LLM)提供了一个简单但理论上的动机改进(SFT),与强化学习(RL)相比,其概括有限。通过数学分析,我们揭示了标准SFT梯度隐式编码有问题的奖励结构,该结构可能严重限制了模型的概括能力。为了纠正这一点,我们提出了动态微调(DFT),通过以这个令牌的概率动态重新缩放目标函数来稳定每个令牌的梯度更新。值得注意的是,这种单线代码更改在多个具有挑战性的基准和基本模型上大大优于标准SFT,这表明概括大大改善了。此外,我们的方法在离线RL设置中显示出竞争性的结果,提供了有效但更简单的替代方案。这项工作桥接了理论洞察力和实用的解决方案,大大提高了SFT的性能。该代码将在https://github.com/yongliang-wu/dft上找到。


Verigui:可验证的长链GUI数据集

  • 标题: VeriGUI: Verifiable Long-Chain GUI Dataset

  • 作者: Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao

  • 日期: 2025-08-06

  • ArXiv主页 : https://arxiv.org/abs/2508.04026

  • 论文链接 : https://arxiv.org/pdf/2508.04026

  • gitHub仓库 : https://github.com/VeriGUI-Team/VeriGUI

英文摘要

Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.

中文摘要

最近的研究已经研究了构建能够执行基于复杂的图形用户界面(GUI)的计算机任务的自动源代理,从而有可能彻底改变人类计算机相互作用。尽管结果令人鼓舞,但现有的努力主要集中在短期互动上,并依靠仅结局的验证,从而限制了它们在需要长期任务分解和执行的实际GUI应用中的可扩展性。在这项工作中,我们介绍了Verigui,这是一种可验证的长链GUI数据集,旨在促进在现实的计算机环境中运行的通才GUI代理的开发和评估。我们的数据集强调了两个关键维度:(1)长链复杂性,任务分解为一系列相互依存的子任务,这些子任务跨越了数百个步骤,明确设计以允许任何子任务作为有效的起点;(2)子任务级的验证性,可以在每个子任务中实现各种探索策略,同时确保每个子任务级目标保持可验证和一致。该数据集由人类专家注释的台式机和Web的GUI任务轨迹组成。使用具有不同基础模型的各种代理商在Verigui上进行的广泛实验揭示了在处理长途任务时的巨大性能差距,从而强调了GUI代理中对更强大的计划和决策能力的需求。


种子扩散:具有高速推理的大规模扩散语言模型

英文摘要

We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.

中文摘要

我们提出种子扩散预览,这是一种基于离散状态扩散的大规模语言模型,提供了非常快的推理速度。多亏了非顺序的平行生成,离散扩散模型提供了一个显着的加速,以减轻逐个解码的固有延迟,如最近所示(例如,Mercury Coder,Gemini扩散)。种子扩散预览的推理速度在H20 GPU上的推理速度为2,146代币,同时在跨越标准代码评估基准的扫描中保持有竞争力的性能,比当代的汞和双子座扩散率要快得多,在代码模型的快速帕莱托边境上确立了新的ART的新状态。


R-Zero:零数据中的自我不断发展的推理LLM

英文摘要

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

中文摘要

自主产生,完善和从自己的经验中学习,自动发展的大型语言模型(LLM)为超级智能提供了可扩展的途径。但是,现有的培训方法仍然很大程度上依赖于庞大的人类策划的任务和标签,通常是通过微调或增强学习的,这构成了基本的瓶颈,以推动AI系统超越人类智能的能力。为了克服此限制,我们引入了R-Zero,这是一个完全自主的框架,从头开始生成自己的培训数据。从单个基本LLM开始,R-Zero初始化了两个具有不同角色的独立模型,即挑战者和一个求解器。这些模型分别优化,并通过互动共同发展:挑战者因提出求解器功能边缘附近的任务而获得奖励,而求解器则获得了解决挑战者提出的越来越具有挑战性的任务的奖励。该过程产生了一个有针对性的自我改善课程,而无需任何先前的任务和标签。从经验上讲,R-Zero显着提高了不同骨干LLM的推理能力,例如,在数学基准测试基准上将QWEN3-4B基本量提高了+6.49,在通用域推理基准中提高了+6.49。


认知内核-PRO:深入研究代理和代理基金会模型培训的框架

英文摘要

General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

中文摘要

通用AI代理人越来越被认为是下一代人工智能的基础框架,可以实现复杂的推理,网络互动,编码和自主研究能力。但是,当前的代理系统要么是封闭源,要么大量依赖各种付费的API和专有工具,从而限制了研究社区的可访问性和可重复性。在这项工作中,我们介绍了认知核-PRO,完全开源的,并且(最大程度地)自由多模型代理框架旨在使高级AI代理的开发和评估民主化。在认知内核Pro中,我们系统地研究了代理基础模型的高质量培训数据的策划,重点介绍了四个关键领域的查询,轨迹和可验证的答案:Web,File,代码和一般推理。此外,我们探讨了代理测试时间反思的新型策略,并投票以增强代理的鲁棒性和性能。我们评估了盖亚(Gaia)的认知核-PRO,从而在开源和自由推动者中取得了最先进的结果。值得注意的是,我们的8B参数开源模型超过了先前的领先系统,例如WebDancer和WebSailor,为可访问的高可及性AI代理建立了新的性能标准。代码可从https://github.com/tencent/cognitivekernel-pro获得


有效的代理:降低成本的同时建造有效的代理

英文摘要

The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from 0.398 to 0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.

中文摘要

大语言模型(LLM)驱动的代理的显着功能使复杂的系统能够处理复杂的多步骤任务,但是它们不断升级的成本威胁着可伸缩性和可访问性。这项工作介绍了现代代理系统效率效果折衷的首次系统研究,以解决对具有成本效益的设计的关键需求,而无需牺牲性能。我们研究了三个关键问题:(1)代理任务本质上需要多少复杂性?(2)其他模块何时会产生减少回报?(3)通过设计有效的代理框架可以获得多少效率?通过对GAIA基准测试的经验分析,我们评估了LLM主链选择,代理框架设计和测试时间缩放策略的影响。使用通通成本度量,我们量化了这些维度的效率绩效权衡。我们的发现为有效代理的开发提供了信息,这是一个新型代理框架,对任务要求具有最佳的复杂性。有效的代理商保留了一个领先的开源代理框架猫头鹰的表现的96.7%,同时将运营成本从0.398降低到0.228,导致通行成本提高了28.4%。我们的工作为设计高效,高性能的代理系统设计,推动AI驱动解决方案的可访问性和可持续性提供了可行的见解。


Genie Envisioner:机器人操作的统一世界基金会平台

英文摘要

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

中文摘要

我们介绍了Genie Invisioner(GE),这是一个统一的世界基金会机器人操纵平台,将政策学习,评估和模拟整合到一个视频基础框架中。GE Base的核心是一个大规模的,指导条件的视频扩散模型,可捕获结构性潜在空间中现实世界机器人相互作用的空间,时间和语义动力学。GE-ACT基于该基础,将潜在的表示通过轻巧的,流动匹配的解码器将可执行动作轨迹映射到可执行的动作轨迹,从而通过最小的监督实现了各种实施例的精确且可推广的策略推论。为了支持可扩展的评估和培训,GE-SIM充当动作条件的神经模拟器,为闭环政策制定产生了高保真的推广。该平台进一步配备了EWMBENCH,这是一个标准化的基准套件,可测量视觉保真度,身体一致性和指令性能对齐。这些组件共同将Genie Envisioner建立为教学驱动的通用体现智能的可扩展且实用的基础。所有代码,模型和基准都将公开发布。


代理闪电:通过加强学习培训任何AI代理商

英文摘要

We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.

中文摘要

我们提出了Agent Lightning,这是一个灵活而可扩展的框架,可以为任何AI代理提供强化学习(RL)的大型语言模型(LLMS)的培训。与现有的RL训练与代理进行紧密搭配或依赖屏蔽序列的方法不同,代理Lightning实现了代理执行和培训之间的完全十分耦合,从而使与现有代理通过不同的方式(例如,使用Langchain,OpenAi Agents SDK,Autogen,Autogen,以及Zero serage serage)Modification的现有框架(例如,使用Langchain,Openai Agents SDK,OpenAi Agents)进行了无缝集成。通过将代理执行作为马尔可夫决策过程,我们定义了一个统一的数据接口,并提出了包含信用分配模块的LightningRl层次RL算法,使我们能够将任何原理生成的轨迹分解为训练过渡。这使RL能够处理复杂的交互逻辑,例如多代理方案和动态工作流程。对于系统设计,我们引入了培训代理分解体系结构,并将代理可观察性框架带入代理运行时,从而提供了标准化的代理鉴定界面。跨文本到SQL的实验,检索功能的生成和数学工具使用任务表明了稳定的,持续的改进,展示了该框架对现实世界代理培训和部署的潜力。


Deepphy:在物理推理上基准测试代理VLM

英文摘要

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

中文摘要

尽管视觉语言模型(VLM)具有强大的感知能力和令人印象深刻的视觉推理,但他们在复杂,动态的环境中对细节和精确的动作计划进行了努力,从而导致了低于标准的表现。现实世界中的任务通常需要复杂的互动,先进的空间推理,长期计划和连续的策略改进,通常需要了解目标情景的物理规则。但是,在实际情况下评估这些功能通常非常昂贵。为了弥合这一差距,我们介绍了Deepphy,这是一个新颖的基准框架,旨在通过一系列具有挑战性的模拟环境系统地评估VLMS对基本物理原理的理解和推理。Deepphy整合了不同难度水平的多个物理推理环境,并结合了细粒度的评估指标。我们的评估发现,即使是最先进的VLM,也很难将描述性的物理知识转化为精确的预测控制。


超越固定:扩散大语言模型的可变长度降级

英文摘要

Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.

中文摘要

扩散大语言模型(DLLM)正在成为主要的自回归大语模型的强大替代方法,提供有效的并行生成和能力的全球上下文建模。但是,DLLM的实际应用受到关键的体系结构约束的阻碍:需要静态预定义的生成长度。这种静态的长度分配导致了问题的折衷:不足的长度在复杂的任务上的削弱性能,而过度长度会导致大量的计算开销,有时会导致性能退化。尽管推理框架很僵硬,但我们观察到该模型本身具有与给定任务的最佳响应长度相关的内部信号。为了弥合这一差距,我们利用这些潜在信号并引入Daedal,这是一种新颖的无训练denoisising策略,可为扩散大语言模型提供动态的自适应长度扩展。DAEDAL分为两个阶段:1)在脱索过程之前,Daedal从短初始长度开始,并迭代地将其扩展到适合任务的粗度,并在序列完成度量指标的指导下。2)在DeNoising过程中,DAEDAL通过确定和扩展不足的发电区域通过掩码令牌插入来动态介入,以确保最终的输出得到充分开发。在DLLM上进行的广泛实验表明,DAEDAL可以在某些情况下与精心调整的固定长度基准相当,并且在某些情况下可以达到较高的固定长度基准,同时通过实现较高的有效令牌比,同时提高了计算效率。通过解决静态长度约束,Daedal解锁了DLLM的新潜力,与自动回归的对应物弥合了临界差距,并为更有效,有能力的一代铺平了道路。


Skywork Unipic:统一的自回归建模,用于视觉理解和发电

英文摘要

We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

中文摘要

我们介绍了Skywork Unipic,这是一种15亿参数自回归模型,在单个体系结构中统一图像理解,文本到图像的生成以及图像编辑,从而在单个体系结构中进行了编辑,从而阐明了对任务特异性适配器或模块间连接器的需求,并证明了紧凑的多模态系统可以实现在商品硬件上的确切性能。Skywork Unipic达到了0.86的遗传评分,超过了大多数现有的统一模型。设置了新的DPG基础复合生成记录85.5;在Geditbench-en上达到5.83,在Imgedit板凳上获得图像编辑的3.49;并生成1024 x 1024图像,具有15 GB的GPU存储器(例如RTX 4090)。(1)一种脱钩的编码策略,该策略利用掩盖的自回归编码器进行合成和siglip2编码器来理解,所有这些都可以喂养共享的自动回归解码器;(2)渐进的,分辨率感知的培训时间表从256 x 256到1024 x 1024,而动态释放参数以平衡容量和稳定性;(3)精心策划的,1亿个规模的数据集增强了特定于任务的奖励模型,以完善生成和编辑目标。通过证明高保真多模式整合不需要产生过度的资源需求,Skywork Unipic为可部署的高保真多模式AI建立了实用的范式。代码和权重可以在https://huggingface.co/skywork/skywork-unipic-1.5b上公开获得。


SITEMB-V1.5:改进的语义感知语义协会和长篇小说理解的密集检索

英文摘要

Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

中文摘要

在长文档上检索增强的一代(RAG)通常涉及将文本分成较小的块,这些块是基本的检索单元。但是,由于原始文档的依赖关系,上下文信息通常对于准确解释每个块是必不可少的。为了解决这个问题,先前的工作探索了编码更长的上下文窗口以生成较长块的嵌入。尽管做出了这些努力,但在检索和下游任务中的收益仍然有限。这是因为(1)由于其必须编码的信息增加而导致嵌入模型的能力的较长块,并且(2)由于模型或人类带宽的限制,许多现实世界应用仍需要返回本地证据。我们通过在更广泛的上下文窗口中代表短块来提高检索性能的方式来提出另一种方法来应对这一挑战 - 即,在其上下文中置于块的含义。我们进一步表明,现有的嵌入模型不具备有效地编码此类上下文的能力,因此引入了新的训练范式并开发位置的嵌入模型(SITEMB)。为了评估我们的方法,我们策划了一个专门设计用于评估位置检索功能的书籍图检索数据集。在此基准测试上,我们基于BGE-M3的SiteMB-V1模型基本上优于最先进的嵌入模型,其中包括具有1B参数的几种具有高达7-8B参数的模型。我们的8B SiteMB-V1.5模型进一步提高了10%以上的性能,并在不同语言和几种下游应用程序上显示出强劲的结果。


通过加强学习培训长篇下说,多转弯软件工程代理

  • 标题: Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
  • 作者: Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel
  • 日期: 2025-08-05
  • ArXiv主页 : https://arxiv.org/abs/2508.03501
  • 论文链接 : https://arxiv.org/pdf/2508.03501

英文摘要

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

中文摘要

关于强化学习(RL)在大型语言模型(LLMS)中的应用的研究主要集中在单转弯问题上,例如数学推理或单次代码生成。尽管可以将这些问题视为令牌级的多转移MDP,但此视图对应于多转交互的堕落情况,而环境没有提供反馈。这与许多现实世界中的域(例如软件工程(SWE))形成鲜明对比,这些领域需要与状态环境进行丰富的多转交流,该环境对每个动作都以非平凡的观察做出了反应。为了弥合这一差距,我们证明了RL在该一般制度中的成功应用。使用修改后的分离优势优化(DAPO)算法,我们培训基于QWEN2.5-72B教学的代理,以求解现实世界中的软件工程任务。我们的方法将代理商在SWE-Bench验证的基准测试中的成功率从20%的拒绝微调基线增加到39%,而无需依赖任何教师模型。在SWE-Rebench上,我们的代理商使用相同的脚手架进行匹配或优于领先的开放式型号,例如DeepSeek-V3-0324和Qwen3-235b-A22b,为基于开放型号的复杂现实世界问题提供了可行的自主剂。


SEATENT:自主学习的自主学习代理人从经验中学习

英文摘要

Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

中文摘要

将大型视觉模型(LVLM)重新利用为计算机使用剂(CUAS)导致了实质性突破,主要是由人类标记的数据驱动。但是,这些模型通常在新颖和专业的软件上遇到困难,尤其是在缺乏人类注释的情况下。为了应对这一挑战,我们提出了Seagent,这是一个代理自我发展的框架,使CUA可以通过与不熟悉的软件进行互动来自主发展。具体而言,Seagent通过体验式学习可以自主掌握新颖的软件环境的能力,在此过程中,代理商探索新软件,通过迭代试验和错误学习,并逐步处理从简单到复杂组织的自动生成的任务。为了实现这一目标,我们为逐步轨迹评估设计了世界状态模型,以及一个产生越来越多样化和挑战性任务的课程生成器。代理商的策略是通过体验式学习来更新的,该策略包括对失败动作的对抗性模仿和对成功的群体相对策略优化(GRPO)。此外,我们介绍了一种专业与将军培训策略,该策略将专家代理人的个人体验见解融为一体,促进了能够持续自主进化的更强大的通才CUA的发展。这个统一的代理最终在其专业软件上实现了各个专家代理的合奏。我们验证了OS世界内五个新型软件环境中SEAGENT的有效性。在竞争激烈的开源CUA(即UI-TARS)上,我们的方法的成功率从11.3%到34.5%的成功率显着提高了23.2%。


Pixnerd:像素神经场扩散

英文摘要

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

中文摘要

扩散变压器的当前成功在很大程度上取决于预先训练的变异自动编码器(VAE)所塑造的压缩潜在空间。但是,这种两阶段的训练范式不可避免地会引入累积的错误和解码伪像。为了解决上述问题,研究人员以复杂的级联管道和令牌复杂性增加而返回像素空间。与他们的努力相反,我们建议用神经场对斑块解码进行建模,并提出单尺度,单阶段,高效,端到端解决方案,以像素神经场扩散〜(Pixelnerd)。由于Pixnerd中有效的神经场表示,我们在Imagenet 256Times256上直接达到了2.15 FID,并且在Imagenet 512Times512上没有任何复杂的级联管道或VAE。我们还将Pixnerd框架扩展到文本图像应用程序。我们的Pixnerd-XXL/16在Geneval基准测试上取得了0.73的总分,DPG基准的总分为80.9。


Longvie:多模式引导可控的超长视频生成

英文摘要

Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.

中文摘要

可控的超长视频生成是一项基本但具有挑战性的任务。尽管现有方法对于短剪辑有效,但由于时间不一致和视觉降解等问题,它们很难进行扩展。在本文中,我们最初研究并确定了三个关键因素:独立的噪声初始化,独立的控制信号归一化以及单模式指导的局限性。为了解决这些问题,我们提出了Longvie,这是一个端到端的自动回归框架,可控制长期视频。Longvie介绍了两种核心设计,以确保时间一致性:1)统一的噪声初始化策略,该策略在整个剪辑中保持一致的生成,以及2)全局控制信号归一化,该信号正常化在整个视频中强制执行控制空间的一致性。为了减轻视觉降低,Longvie使用3)一个多模式控制框架,该框架可以整合密集(例如,深度图)和稀疏(例如,关键点)控制信号,并补充4)一种降级意识到的训练策略,可适应均衡模态的均衡,以保持均衡的时间越来越多地促进视觉质量。我们还介绍了Longvgenbench,这是一个全面的基准测试,该基准由100个高分辨率视频组成,涵盖了多样化的现实世界和合成环境,每个都持续了一分钟。广泛的实验表明,Longvie在远程可控性,一致性和质量方面取得了最新的性能。


Cellforge:虚拟电池模型的代理设计

  • 标题: CellForge: Agentic Design of Virtual Cell Models
  • 作者: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein
  • 日期: 2025-08-04
  • ArXiv主页 : https://arxiv.org/abs/2508.02276
  • 论文链接 : https://arxiv.org/pdf/2508.02276

英文摘要

Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

中文摘要

虚拟细胞建模代表了人工智能和生物学交集的新兴领域,旨在预测数量,例如定量对各种扰动的响应。但是,由于生物系统的复杂性,数据模式的异质性以及对多个学科的特定领域专业知识的需求,因此为虚拟细胞的自主构建计算模型具有挑战性。在这里,我们介绍了Cellforge,这是一种利用多代理框架的代理系统,该框架将呈现的生物学数据集和研究目标直接转换为虚拟单元的优化计算模型。更具体地说,只有原始的单细胞多媒体数据和任务描述为输入,Cellforge既输出了优化的模型架构,也可以输出可执行的代码,以训练虚拟单元格模型和推理。该框架集成了三个核心模块:用于提出的数据集特征和相关文献检索的任务分析,方法设计,专门代理协作开发优化的建模策略,以及对自动生成代码的执行。设计模块中的代理分为具有不同观点和中央主持人的专家,必须协作交换解决方案,直到达成合理的共识为止。我们使用六个不同的数据集展示了Cellforge在单细胞扰动预测中的功能,这些数据集涵盖了多种模态的基因敲除,药物治疗和细胞因子刺激。Cellforge始终优于特定于任务的最先进方法。总体而言,Cellforge展示了与直接应对建模挑战相比,具有不同观点的LLM代理之间的迭代相互作用如何提供更好的解决方案。我们的代码可在https://github.com/gersteinlab/cellforge上公开获取。


Commanderifier:统一且强大的LLMS评估和结果奖励的验证器

英文摘要

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

中文摘要

答案验证不仅对于评估大语言模型(LLM)至关重要,通过将其非结构化输出与标准答案相匹配,而且是指导LLM优化的奖励模型。大多数评估框架都依赖于正规匹配或采用一般LLM进行答案验证,该验证需要广泛的,重复性的定制,以进行正则规则或评估提示。当前方法论中,有两个基本的局限性持续存在:1)缺乏系统地评估不同LLM的验证能力的全面基准;2)验证者开发的新生阶段,现有方法既缺乏处理复杂边缘案例的鲁棒性,又缺乏跨不同领域的普遍性。在这项工作中,我们开发了Commassverifier,这是一种准确,可靠的轻质验证器模型,用于评估和结果奖励。它展示了跨越数学,知识和各种推理任务的多域功能,并具有处理各种答案类型的能力,包括多种副标题,公式和序列答案,同时有效地识别异常/无效的响应。我们介绍了VerifierBench基准测试,其中包括从多个数据源收集的模型输出,通过手动分析元元模式来增强,以增强同情者。我们预计Commanderifier和VerifierBench将有助于答案验证,评估协议和强化学习研究。代码和数据集可从https://github.com/open-compass/compassverifier获得。


超越权衡:推理模型的教学的自我监督的强化学习跟随

英文摘要

Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

中文摘要

推理模型在复杂的问题解决方面表现出色,但在推理能力和能力后的指导之间表现出有关权衡的影响。改善教学后的现有方法依赖于更强的外部模型,创建方法论瓶颈和实际限制,包括增加成本和可访问性限制。我们提出了一个自我监督的RL框架,该框架利用推理模型的内部信号来改善在不外部监督的情况下在功能后提高指导。广泛的实验表明,我们的框架在保持推理性能的同时显着提高了教学能力,并提供了可扩展且具有成本效益的方法来增强推理模型中的教学。数据和代码可在https://github.com/rainier-rq/verl-if上公开获取。


通过在合成世界中增强视觉模型培训,以实现现实世界的成功

英文摘要

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50% relative on BALROG (game-centric agentic control), +5% relative on the hardest part of VSI-Bench (spatial planning), and +2% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

中文摘要

交互式多模式的代理必须将原始的视觉观测转换为语言条件动作的连贯序列 - 当前视觉模型(VLMS)仍然缺乏这种功能。原则上,早期的加强学习(RL)努力可能会赋予VLM,但他们很少测试学习的行为是否超出了训练模拟器,并且依赖于脆弱的超参数调整或较繁琐的环境,具有较低的州可变性。我们引入了视觉语言脱钩的角色批评(VL-DAC),这是一种轻巧的无参数RL算法。VL-DAC仅在环境步骤级别的学习价值时将PPO更新应用于操作令牌:据我们所知,一种安排,以前没有针对大型VLM或LLM进行探索。这种简单的解耦消除了不稳定的加权术语,并产生更快,更可靠的收敛性。一次在一个廉价的模拟器中培训单个VLM(迷你世界,健身卡,Alfworld或网络店)已经制定了广泛概括的策略:+50 \%相对相对(以游戏为中心的代理控制), +5 \%的+5 \%在vsi-bert(spatial ber)(spatebbe)(spatebe)(+5 \%)(+5所有这些都不会降低一般图像理解精度。这些结果提供了第一个证据,表明一种简单的RL算法可以完全在廉价的合成世界中训练VLM,同时在实体代理,空间 - 策划和网络流动基准上带来可衡量的收益。


Llama-3.1基础 - 苏格里尔姆-8B教学技术报告

  • 标题: Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
  • 作者: Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhouran Yang, Yaron Singer, Amin Karbasi
  • 日期: 2025-08-01
  • ArXiv主页 : https://arxiv.org/abs/2508.01059
  • 论文链接 : https://arxiv.org/pdf/2508.01059

英文摘要

Large language models (LLMs) have shown remarkable success across many domains, yet their integration into cybersecurity applications remains limited due to a lack of general-purpose cybersecurity data, representational complexity, and safety and regulatory concerns. To address this gap, we previously introduced Foundation-Sec-8B, a cybersecurity-focused LLM suitable for fine-tuning on downstream tasks. That model, however, was not designed for chat-style interactions or instruction-following. In this report, we release Foundation-Sec-8B-Instruct: a model specifically trained for general-purpose cybersecurity dialogue. Built on Foundation-Sec-8B, it combines domain-specific knowledge with instruction-following, conversational capabilities, and alignment with human preferences to produce high-quality, relevant responses. Comprehensive evaluations show that Foundation-Sec-8B-Instruct outperforms Llama 3.1-8B-Instruct on a range of cybersecurity tasks while matching its instruction-following performance. It is also competitive with GPT-4o-mini on cyber threat intelligence and instruction-following tasks. We envision Foundation-Sec-8B-Instruct becoming an indispensable assistant in the daily workflows of cybersecurity professionals. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Instruct.

中文摘要

大型语言模型(LLMS)在许多领域都表现出了很大的成功,但是由于缺乏通用的网络安全数据,代表性的复杂性以及安全性和调节性问题,它们的整合到网络安全应用中仍然有限。为了解决这一差距,我们之前介绍了Foundation-Sec-8B,这是一种以网络安全为中心的LLM,适合于下游任务进行微调。但是,该模型并不是为聊天风格的互动或指导跟踪的设计。在本报告中,我们发布了Foundation-Sec-8B-Instruct:一种经过通用网络安全对话的模型。它建立在基础式的8B上,将特定领域的知识与指导遵循,对话能力以及人类偏好的一致性结合在一起,以产生高质量的相关响应。全面的评估表明,在一系列网络安全任务上,基金会-8B - 8B教学法在符合其指令遵循的性能的同时,优于Llama 3.1-8B - 教学。它还与GPT-4O-Mini在网络威胁智能和跟踪任务方面具有竞争力。我们设想在网络安全专业人员的日常工作流程中,基金会 - 8B教学成为必不可少的助手。我们通过https://huggingface.co/fdtn-ai/foundation-sec-8b-instruct公开发布该模型。


HI3DEVAL:以分层有效性进行3D生成评估

英文摘要

Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.

中文摘要

尽管3D内容生成的快速发展,但生成的3D资产的质量评估仍然具有挑战性。现有方法主要依赖于基于图像的指标并仅在对象级别上运行,从而限制了它们捕获空间连贯性,物质真实性和高保真性本地细节的能力。1)为了应对这些挑战,我们介绍了HI3Deval,这是一个针对3D生成内容量身定制的分层评估框架。它结合了对象级别和零件级别的评估,从而实现了多个维度和细粒质量分析的整体评估。此外,我们通过明确评估材料现实主义,重点介绍诸如反照率,饱和度和金属性等属性,从而将纹理评估扩展到美学外观之外。2)为了支持此框架,我们构建了HI3DBench,这是一个大规模数据集,其中包括不同的3D资产和高质量的注释,并伴随着可靠的多项式注释管道。我们进一步提出了一个基于混合3D表示的3D感知自动评分系统。具体而言,我们利用基于视频的表示来进行对象级别和材料对象评估来增强时空一致性的建模,并采用预定的3D特征来进行部分级别的感知。广泛的实验表明,我们的方法在建模3D特性中优于现有的基于图像的指标,并与人类偏好达到了较高的一致性,为手动评估提供了可扩展的替代方案。项目页面可在https://zyh482.github.io/hi3deval/上找到。


今天的LLMS已准备好解释福祉概念吗?

英文摘要

Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

中文摘要

幸福感包括对个人成长至关重要的心理,身体和社会层面和知情的生活决策。随着个人越来越多地咨询大型语言模型(LLM)来了解幸福感,出现了一个关键的挑战:LLM可以产生不仅准确,而且对不同受众量身定制的解释吗?高质量的解释既需要事实正确性,又需要满足具有不同专业知识的用户的期望的能力。在这项工作中,我们构建了一个大规模数据集,其中包含43,880个解释,其中有2,194个良好的概念,由十个不同的LLM产生。我们介绍了一个原则引导的LLM-AS-A-A-Gudge评估框架,并采用双重法官来评估解释质量。此外,我们表明,使用监督的微调(SFT)和直接偏好优化(DPO)对开源LLM进行微调可以显着提高生成的解释质量。我们的结果表明:(1)拟议的LLM法官与人类评估很好地符合;(2)在模型,受众和类别中的解释质量差异很大;(3)DPO和SFT-FINETENED模型的表现优于其更大的对应物,证明了基于优先的学习对专业解释任务的有效性。


我们是否采取正确的方法来评估文档检索成绩的一代?

英文摘要

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

中文摘要

使用多模式大语言模型(MLLM)检索提示的生成(RAG)系统对复杂的文档理解表现出了巨大的希望,但是由于评估不足,它们的发展受到严重阻碍。当前的基准通常关注文档抹布系统的特定部分,并使用具有不完整的地面真相和证据标签的合成数据,因此未能反映现实世界中的瓶颈和挑战。为了克服这些局限性,我们引入了双基础基础:一种新的大规模,多语言和多模式评估系统,能够为文档抹布系统中的每个组件进行细粒度评估。它包含3,276个文档(72,880页)和5,168个文档和5,168个单独和多跳的查询,跨6种语言和4种文档类型,具有简化的动态更新支持,以解决潜在的数据污染问题。查询基于详尽的扫描证据页面,并由人类专家进行验证,以确保最高质量和完整性。我们在9个最先进的嵌入模型,4个MLLM和4个端到端文档RAG框架上进行的全面实验证明了文本和视觉嵌入模型之间的差距正在缩小,这突出了建立更强文档检索模型的需求。我们的发现还揭示了当前文档框架内的过度信心困境,即使没有证据支持,也倾向于提供答案。我们希望我们的完全开源的双基础台基础为先进文档抹布系统的未来研究提供了严格的基础。我们计划每年及时检索及时的语料库并发布新的基准。


相关推荐
每日新鲜事1 小时前
2025新采购峰会圆满落幕,端点科技AI协同供应链平台正式发布
人工智能·百度
渡我白衣1 小时前
计算机组成原理(3):计算机软件
java·c语言·开发语言·jvm·c++·人工智能·python
小oo呆2 小时前
【自然语言处理与大模型】多模态RAG的核心概念
人工智能
小马爱打代码2 小时前
Spring AI:Docker 安装向量数据库 - Redis Stack
数据库·人工智能·spring
IT_陈寒2 小时前
【SpringBoot 3.2实战】10倍性能优化的5个冷门技巧,90%开发者都不知道!
前端·人工智能·后端
小霖家的混江龙2 小时前
Token 到底怎么来的? 一文读懂大模型分词的核心逻辑, 看完秒懂!
人工智能·python·llm
dragoooon342 小时前
【OpenCV 图像处理 Python版】图像处理的基本操作
人工智能·opencv·计算机视觉
tangjunjun-owen2 小时前
OpenCV在Visual Studio中的完整配置教程
人工智能·opencv·visual studio
搜移IT科技2 小时前
加密货币市场的二元性 XBIT Wallet 硬件钱包风险缓解多元化策略
大数据·人工智能