【论文速递】2025年第39周(Sep-21-27)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

Qwen3-Omni技术报告

  • 标题: Qwen3-Omni Technical Report

  • 作者: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

  • 日期: 2025-09-22

  • ArXiv主页 : https://arxiv.org/abs/2509.17765

  • 论文链接 : https://arxiv.org/pdf/2509.17765

  • gitHub仓库 : https://github.com/QwenLM/Qwen3-Omni

英文摘要

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

中文摘要

我们推出了 Qwen3-Omni,这是一个单一的多模态模型,它首次在文本、图像、音频和视频方面保持了最先进的性能,相对于单模态模型没有任何退化。 Qwen3-Omni 的性能与 Qwen 系列中相同尺寸的单模态模型的性能相当,尤其在音频任务方面表现出色。在 36 个音频和视听基准测试中,Qwen3-Omni 在 32 个基准测试中实现开源 SOTA,在 22 个基准测试中实现整体 SOTA,表现优于 Gemini-2.5-Pro、Seed-ASR 和 GPT-4o-Transcribe 等强大的闭源模型。 Qwen3-Omni 采用 Thinker-Talker MoE 架构,将文​​本、图像、音频和视频的感知和生成统一起来,产生流畅的文本和自然的实时语音。支持119种语言的文本交互、19种语言的语音理解、10种语言的语音生成。为了减少流合成中的第一个数据包延迟,Talker 使用多码本方案自回归预测离散语音编解码器。利用这些码本的表示能力,我们用轻量级因果卷积网络取代计算密集型的逐块扩散,从而实现从第一个编解码器帧开始的流传输。在冷启动设置中,Qwen3-Omni 的理论端到端首包延迟为 234 毫秒。为了进一步加强多模态推理,我们引入了一种思维模型,可以对任何模态的输入进行明确的推理。由于研究社区目前缺乏通用的音频字幕模型,我们对 Qwen3-Omni-30B-A3B 进行了微调,以获得 Qwen3-Omni-30B-A3B-Captioner,它可以为任意音频输入生成详细的、低幻觉的字幕。 Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking 和 Qwen3-Omni-30B-A3B-Captioner 根据 Apache 2.0 许可证公开发布。


RPG:用于统一和可扩展代码库生成的存储库规划图

  • 标题: RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
  • 作者: Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang
  • 日期: 2025-09-19
  • ArXiv主页 : https://arxiv.org/abs/2509.16198
  • 论文链接 : https://arxiv.org/pdf/2509.16198

英文摘要

Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9times the strongest baseline (Claude Code) and about 64times other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.

中文摘要

大型语言模型擅长函数级和文件级代码生成,但从头开始生成完整的存储库仍然是一个基本挑战。此过程需要在提案级和实施级阶段进行连贯且可靠的规划,而自然语言由于其模糊性和冗长性,不适合忠实地表示复杂的软件结构。为了解决这个问题,我们引入了存储库规划图(RPG),这是一种持久表示,通过在一个图中编码功能、文件结构、数据流和功能来统一提案级和实施级规划。 RPG 用明确的蓝图取代了模糊的自然语言,从而实现了长期规划和可扩展的存储库生成。在 RPG 的基础上,我们开发了 ZeroRepo,这是一个图形驱动的框架,用于从头开始生成存储库。它分三个阶段运行:提案级规划和实施级细化以构建图形,然后是图形引导的代码生成和测试验证。为了评估这一设置,我们构建了 RepoCraft,这是包含 1,052 项任务的 6 个现实项目的基准。在 RepoCraft 上,ZeroRepo 生成平均近 36K LOC 的存储库,大约是最强基线(克劳德代码)的 3.9 倍,大约是其他基线的 64 倍。它实现了 81.5% 的功能覆盖率和 69.7% 的通过率,分别超过 Claude Code 27.3 和 35.8 个百分点。进一步的分析表明,RPG 可以对复杂的依赖关系进行建模,通过近线性扩展实现更加复杂的规划,并增强 LLM 对存储库的理解,从而加速代理本地化。


Baseer:阿拉伯文文档转 Markdown OCR 的视觉语言模型

英文摘要

Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.

中文摘要

由于阿拉伯语的草书文字、不同的字体、变音符号和从右到左的方向,阿拉伯语文档 OCR 仍然是一项具有挑战性的任务。虽然现代多模态大型语言模型 (MLLM) 对高资源语言具有先进的文档理解能力,但它们在阿拉伯语上的性能仍然有限。在这项工作中,我们引入了 Baseer,这是一种专门针对阿拉伯文档 OCR 进行微调的视觉语言模型。利用结合了合成和真实世界文档的大规模数据集,Baseer 使用仅解码器微调策略进行训练,以适应预先训练的 MLLM,同时保留一般视觉特征。我们还推出了 Misraj-DocOCR,这是一种经过专家验证的高质量基准,专为严格评估阿拉伯语 OCR 系统而设计。我们的实验表明,Baseer 的性能显着优于现有的开源和商业解决方案,实现了 0.25 的 WER,并在阿拉伯文档 OCR 领域建立了新的最先进技术。我们的结果强调了通用 MLLM 的特定领域适应的好处,并为阿拉伯语等形态丰富的语言的高精度 OCR 建立了坚实的基线。


VCRL:大型语言模型的基于方差的课程强化学习

英文摘要

Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

中文摘要

基于策略的强化学习目前在提高法学硕士数学推理任务方面发挥着重要作用。然而,现有的基于rollout的强化学习方法(GRPO、DAPO、GSPO等)未能明确考虑LLM对不同难度级别样本的学习能力,这与人类对数学推理任务由易到难的认知过程相悖。直观上,我们发现RLVR中rollout组奖励的方差部分反映了LLM当前样本的难度。太简单或太困难的样本具有较低的方差,而中等难度的样本具有较高的方差。基于此,我们提出了VCRL,一种基于群体奖励方差动态控制训练样本难度的课程强化学习框架。对五个数学基准和两个模型的实验揭示了 VCRL 相对于当前 LLM RL 基线的优势。


MMR1:通过方差感知采样和开放资源增强多模态推理

英文摘要

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

中文摘要

大型多模态推理模型取得了快速进展,但其进步受到两大限制:缺乏开放、大规模、高质量的长链思维(CoT)数据,以及强化学习(RL)算法在训练后的不稳定性。组相对策略优化(GRPO)是强化学习微调的标准框架,当奖励方差较低时,很容易出现梯度消失,从而削弱优化信号并损害收敛性。这项工作做出了三个贡献:(1)我们提出了方差感知采样(VAS),这是一种以方差提升分数(VPS)为指导的数据选择策略,结合了结果方差和轨迹多样性,以促进奖励方差并稳定策略优化。 (2) 我们发布了大规模、精心策划的资源,其中包含约 160 万长的 CoT 冷启动数据和约 15k RL QA 对,旨在确保质量、难度和多样性,以及完全可复制的端到端训练代码库。 (3)我们开源了一系列多尺度的多模态推理模型,为社区建立标准化基线。跨数学推理基准的实验证明了策划数据和提议的 VAS 的有效性。全面的消融研究和分析可以进一步深入了解每个组成部分的贡献。此外,我们从理论上建立了奖励方差下限了预期的政策梯度幅度,而 VAS 是实现这一保证的实用机制。我们的代码、数据和检查点可在 https://github.com/LengSicong/MMR1 获取。


LIMI:代理机构少即是多

英文摘要

We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don't just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.

中文摘要

我们将代理定义为人工智能系统作为自主代理的新兴能力,通过与环境和工具的自我引导来主动发现问题、提出假设并执行解决方案。这一基本能力标志着人工智能机构时代的到来,这是由关键的行业转变推动的:迫切需要不仅会思考,而且能工作的人工智能系统。虽然当前的人工智能擅长推理和生成响应,但各行业需要能够执行任务、操作工具和推动现实世界结果的自主代理。随着代理智能成为区分认知系统与生产工人的决定性特征,有效培养机器自主性变得至关重要。当前的方法假设更多的数据会产生更好的代理,遵循语言建模的传统缩放法则。我们从根本上挑战这种范式。 LIMI(智能代理的少即是多)表明代理遵循完全不同的开发原则。通过对协作软件开发和科学研究工作流程的战略重点,我们表明复杂的代理智能可以从最小但战略策划的自主行为演示中产生。仅使用 78 个精心设计的训练样本,LIMI 在综合机构基准上就达到了 73.5%,显着优于最先进的模型:Kimi-K2-Instruct (24.1%)、DeepSeek-V3.1 (11.9%)、Qwen3-235B-A22B-Instruct (27.5%) 和 GLM-4.5 (45.1%)。最引人注目的是,LIMI 比在 10,000 个样本上训练的模型提高了 53.7%,用减少 128 倍的样本实现了卓越的代理智能。我们的研究结果确立了代理效率原则:机器自治不是来自数据丰富,而是来自高质量代理演示的战略策划。


SciReasoner:奠定跨学科科学推理基础

  • 标题: SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

  • 作者: Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai

  • 日期: 2025-09-25

  • ArXiv主页 : https://arxiv.org/abs/2509.21320

  • 论文链接 : https://arxiv.org/pdf/2509.21320

  • gitHub仓库 : https://github.com/open-sciencelab/SciReason

英文摘要

We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

中文摘要

我们提出了一种科学推理基础模型,将自然语言与异构科学表示相结合。该模型在涵盖科学文本、纯序列和序列文本对的 206B 令牌语料库上进行预训练,然后通过 SFT 在 40M 指令上进行对齐,退火冷启动引导以引出长形式的思想链,并通过特定于任务的奖励塑造进行强化学习,从而灌输深思熟虑的科学推理。它支持四个功能系列,涵盖跨工作流程的多达 103 项任务:(i) 文本和科学格式之间的忠实翻译,(ii) 文本/知识提取,(iii) 属性预测,(iv) 属性分类,(v) 无条件和条件序列生成和设计。与专业系统相比,我们的方法扩大了指令覆盖范围,提高了跨域泛化能力,并提高了保真度。我们详细介绍了数据管理和培训,并表明跨学科学习可以增强传输和下游的可靠性。该模型、指令调整数据集和评估代码均在 https://huggingface.co/SciReasonhttps://github.com/open-sciencelab/SciReason 上开源。


视频模型是零样本学习者和推理者

英文摘要

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

中文摘要

大型语言模型 (LLM) 卓越的零样本能力推动了自然语言处理从特定于任务的模型发展为统一的通用基础模型。这种转变源于简单的原语:在网络规模数据上训练的大型生成模型。奇怪的是,相同的原语也适用于当今的生成视频模型。视频模型能否走上通用视觉理解的轨道,就像法学硕士开发通用语言理解一样?我们证明 Veo 3 可以解决未经明确训练的各种任务:分割对象、检测边缘、编辑图像、理解物理属性、识别对象可供性、模拟工具使用等等。这些感知、建模和操纵视觉世界的能力使得早期形式的视觉推理(如迷宫和对称求解)成为可能。 Veo 的新兴零样本功能表明视频模型正在成为统一的通用视觉基础模型。


LLM 代理强化学习的树搜索

英文摘要

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

中文摘要

强化学习 (RL) 的最新进展显着增强了大型语言模型 (LLM) 的代理能力。在长期和多轮代理任务中,仅由结果奖励驱动的现有方法经常遇到监督稀疏的问题。为了应对这一挑战,我们提出了基于树的组相对策略优化(Tree-GRPO),这是一种基于树搜索的分组代理强化学习方法,其中每个树节点代表完整的代理交互步骤。通过共享公共前缀,树搜索采样增加了在固定的令牌或工具调用预算内可实现的部署数量。此外,我们发现,即使仅使用结果奖励,树结构轨迹自然也允许构建逐步过程监督信号。在此基础上,Tree-GRPO 估计树内和树间级别上的分组相对优势。通过理论分析,我们证明了树内级别组相对策略优化的目标与阶梯级直接偏好学习的目标等效。跨 11 个数据集和 3 种类型的 QA 任务的实验证明了所提出的基于树的 RL 相对于基于链的 RL 方法的优越性。


Seedream 4.0:迈向下一代多模态图像生成

  • 标题: Seedream 4.0: Toward Next-generation Multimodal Image Generation
  • 作者: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
  • 日期: 2025-09-24
  • ArXiv主页 : https://arxiv.org/abs/2509.20427
  • 论文链接 : https://arxiv.org/pdf/2509.20427
  • 项目链接 : https://seed.bytedance.com/en/seedream4_0

英文摘要

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

中文摘要

我们推出了 Seedream 4.0,这是一种高效、高性能的多模态图像生成系统,它将文本到图像 (T2I) 合成、图像编辑和多图像合成统一在一个框架内。我们开发了一种具有强大 VAE 的高效扩散变压器,它还可以大大减少图像令牌的数量。这可以有效地训练我们的模型,并使其能够快速生成原生高分辨率图像(例如 1K-4K)。 Seedream 4.0 经过数十亿个文本图像对的预训练,涵盖不同的分类法和以知识为中心的概念。数百个垂直场景的全面数据收集,加上优化的策略,确保训练稳定、大规模、泛化性强。通过结合精心调整的 VLM 模型,我们执行多模态后训练来联合训练 T2I 和图像编辑任务。对于推理加速,我们集成了对抗性蒸馏、分布匹配、量化以及推测解码。它实现了生成 2K 图像的推理时间长达 1.8 秒(没有 LLM/VLM 作为 PE 模型)。综合评估表明,Seedream 4.0 在 T2I 和多模态图像编辑方面均能取得最先进的结果。特别是,它在复杂任务中展示了卓越的多模态功能,包括精确的图像编辑和上下文推理,并且还允许多图像参考,并且可以生成多个输出图像。这将传统的 T2I 系统扩展为更具交互性和多维的创意工具,突破了生成式人工智能在创意和专业应用方面的界限。 Seedream 4.0 现已可通过 https://www.volcengine.com/experience/ark?launch=seedream 访问。


预训练数据的强化学习

  • 标题: Reinforcement Learning on Pre-Training Data
  • 作者: Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang
  • 日期: 2025-09-23
  • ArXiv主页 : https://arxiv.org/abs/2509.19249
  • 论文链接 : https://arxiv.org/pdf/2509.19249

英文摘要

The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of 3.0, 5.1, 8.1, 6.0, 6.6, and 5.3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

中文摘要

计算资源的指数扩展与高质量文本数据的有限增长之间日益扩大的差距现在限制了大型语言模型(LLM)的传统扩展方法。为了应对这一挑战,我们引入了预训练数据强化学习 (RLPT),这是一种用于优化 LLM 的新训练时间扩展范例。与之前主要通过监督学习来扩展训练的方法相比,RLPT 使策略能够自主探索有意义的轨迹,从预训练数据中学习,并通过强化学习 (RL) 提高其能力。虽然现有的 RL 策略(例如来自人类反馈的强化学习 (RLHF) 和可验证奖励的强化学习 (RLVR))依赖于人类注释来构建奖励,但 RLPT 通过直接从预训练数据导出奖励信号来消除这种依赖性。具体来说,它采用下一个片段推理目标,奖励根据先前上下文准确预测后续文本片段的策略。这种公式允许强化学习在预训练数据上进行扩展,鼓励在更广泛的背景下探索更丰富的轨迹,从而培养更通用的推理技能。跨多个模型的通用领域和数学推理基准的广泛实验验证了 RLPT 的有效性。例如,当应用于 Qwen3-4B-Base 时,RLPT 在 MMLU、MMLU-Pro、GPQA-Diamond、KOR-Bench、AIME24 和 AIME25 上分别产生 3.0、5.1、8.1、6.0、6.6 和 5.3 的绝对改进。结果进一步证明了有利的扩展行为,表明通过更多计算持续获得收益的巨大潜力。此外,RLPT 提供了坚实的基础,扩展了 LLM 的推理边界并增强了 RLVR 的性能。


OmniInsert:通过扩散变压器模型无掩模视频插入任何参考

英文摘要

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

中文摘要

基于扩散模型的视频插入的最新进展令人印象深刻。然而,现有方法依赖于复杂的控制信号,但难以实现主题一致性,限制了其实际适用性。在本文中,我们专注于无掩模视频插入的任务,旨在解决三个关键挑战:数据稀缺、主题场景平衡和插入协调。为了解决数据稀缺的问题,我们提出了一种新的数据管道InsertPipe,自动构建不同的跨对数据。基于我们的数据管道,我们开发了 OmniInsert,这是一种新颖的统一框架,用于从单个和多个主题引用插入无掩模视频。具体来说,为了保持主题场景平衡,我们引入了一种简单而有效的条件特定特征注入机制来明显地注入多源条件,并提出了一种新颖的渐进训练策略,使模型能够平衡来自主题和源视频的特征注入。同时,我们设计了主题聚焦损失来改善主题的细节外观。为了进一步增强插入协调性,我们提出了一种插入偏好优化方法,通过模拟人类偏好来优化模型,并在参考过程中加入上下文感知重述模块,将主题无缝集成到原始场景中。为了解决该领域缺乏基准的问题,我们推出了 InsertBench,这是一个综合基准,包含不同的场景和精心挑选的主题。 InsertBench 的评估表明 OmniInsert 的性能优于最先进的闭源商业解决方案。代码将被发布。


MANZANO:带有混合视觉分词器的简单且可扩展的统一多模式模型

  • 标题: MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
  • 作者: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen
  • 日期: 2025-09-19
  • ArXiv主页 : https://arxiv.org/abs/2509.16197
  • 论文链接 : https://arxiv.org/pdf/2509.16197

英文摘要

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

中文摘要

能够理解和生成视觉内容的统一多模式大语言模型 (LLM) 拥有巨大的潜力。然而,现有的开源模型通常会受到这些功能之间性能权衡的影响。我们提出了 Manzano,一个简单且可扩展的统一框架,通过将混合图像标记器与精心策划的训练方案相结合,大大减少了这种紧张。单个共享视觉编码器为两个轻量级适配器提供数据,这些适配器在公共语义空间内生成用于图像到文本理解的连续嵌入和用于文本到图像生成的离散标记。统一的自回归 LLM 以文本和图像标记的形式预测高级语义,随后使用辅助扩散解码器将图像标记转换为像素。该架构与理解和生成数据的统一训练方法一起,实现了这两种功能的可扩展联合学习。 Manzano 在统一模型中取得了最先进的结果,并且与专业模型具有竞争力,特别是在文本丰富的评估方面。我们的研究表明,任务冲突最小化,并且通过缩放模型大小获得一致的收益,验证了我们对混合分词器的设计选择。


您在视觉运动策略中需要本体感觉状态吗?

英文摘要

Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0% to 85% in height generalization and from 6% to 64% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.

中文摘要

基于模仿学习的视觉运动策略已广泛应用于机器人操作中,通常同​​时采用视觉观察和本体感觉状态来实现精确控制。然而,在本研究中,我们发现这种常见做法使得策略过度依赖本体感受状态输入,从而导致训练轨迹过度拟合并导致空间泛化不良。相反,我们提出了无状态政策,消除了本体感受状态输入并仅根据视觉观察来预测动作。无状态策略建立在相对的末端执行器动作空间中,并应确保与任务相关的完整视觉观察,此处由双广角腕式摄像机提供。实证结果表明,无状态策略比基于状态的策略实现了明显更强的空间泛化:在现实世界的任务中,例如拾放、具有挑战性的衬衫折叠和复杂的全身操作,跨越多个机器人实例,高度泛化的平均成功率从 0% 提高到 85%,水平泛化的平均成功率从 6% 提高到 64%。此外,它们还显示出数据效率和跨实施例适应性方面的优势,增强了它们在实际部署中的实用性。


MiniCPM-V 4.5:通过架构、数据和训练配方烹饪高效的 MLLM

  • 标题: MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

  • 作者: Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun

  • 日期: 2025-09-16

  • ArXiv主页 : https://arxiv.org/abs/2509.18154

  • 论文链接 : https://arxiv.org/pdf/2509.18154

  • gitHub仓库 : https://github.com/OpenBMB/MiniCPM-V

英文摘要

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7% GPU memory cost and 8.7% inference time of Qwen2.5-VL 7B.

中文摘要

多模态大语言模型(MLLM)正在快速发展,代表了人工智能发展的前沿。然而,它们的训练和推理效率已成为使 MLLM 更易于访问和扩展的核心瓶颈。为了应对这些挑战,我们推出了 MiniCPM-V 4.5,这是一种专为高效率和强大性能而设计的 8B 参数模型。我们在模型架构、数据策略和训练方法上引入了三个核心改进:用于对图像和视频进行高度紧凑编码的统一3D-Resampler模型架构、无需大量数据工程的文档知识和文本识别的统一学习范式、以及用于熟练掌握短推理和长推理模式的混合强化学习策略。 OpenCompass 评估中的综合实验结果表明,MiniCPM-V 4.5 超越了广泛使用的专有模型(例如 GPT-4o-latest)以及更大的开源模型(例如 Qwen2.5-VL 72B)。值得注意的是,强劲的性能是以卓越的效率实现的。例如,在广泛采用的 VideoMME 基准测试中,MiniCPM-V 4.5 在 30B 大小以下的模型中实现了最先进的性能,仅使用 Qwen2.5-VL 7B 46.7% 的 GPU 内存成本和 8.7% 的推理时间。


潜在分区网络:生成建模、表示学习和分类的统一原则

英文摘要

Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

中文摘要

生成建模、表示学习和分类是机器学习 (ML) 中的三个核心问题,但它们最先进的 (SoTA) 解决方案在很大程度上仍然是脱节的。在本文中,我们要问:统一的原则可以解决这三个问题吗?这种统一可以简化机器学习管道并促进跨任务的更大协同作用。我们引入潜在分区网络(LZN)作为实现这一目标的一步。 LZN 的核心是创建一个共享的高斯潜在空间,对所有任务的信息进行编码。每种数据类型(例如图像、文本、标签)都配备了一个将样本映射到不相交潜在区域的编码器,以及一个将潜在区域映射回数据的解码器。 ML任务被表示为这些编码器和解码器的组合:例如,标签条件图像生成使用标签编码器和图像解码器;图像嵌入使用图像编码器;分类使用图像编码器和标签解码器。我们在三个日益复杂的场景中展示了 LZN 的前景:(1)LZN 可以增强现有模型(图像生成):与 SoTA 整流流模型结合时,LZN 将 CIFAR10 上的 FID 从 2.76 提高到 2.59,而无需修改训练目标。 (2) LZN 可以独立解决任务(表示学习):LZN 可以在没有辅助损失函数的情况下实现无监督表示学习,在 ImageNet 上的下游线性分类上分别比开创性的 MoCo 和 SimCLR 方法高出 9.3% 和 0.2%。 (3) LZN 可以同时解决多个任务(联合生成和分类):通过图像和标签编码器/解码器,LZN 通过设计联合执行这两个任务,改进 FID 并在 CIFAR10 上实现 SoTA 分类精度。代码和训练模型可在 https://github.com/microsoft/latent-zoning-networks 上获取。该项目网站位于 https://zinanlin.me/blogs/latent_zoning_networks.html。


SIM-CoT:有监督的隐式思维链

英文摘要

Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.

中文摘要

隐式思维链 (CoT) 方法为大型语言模型 (LLM) 中的显式 CoT 推理提供了一种有前途的、令牌有效的替代方案,但持续的性能差距限制了隐式 CoT 的应用。我们通过扩展隐式 CoT 方法的计算预算来识别核心潜在的不稳定性问题:当我们增加隐式推理标记的数量以提高性能时,训练过程通常会变得不稳定并崩溃。我们的分析表明,这种不稳定性是由于潜在表示变得同质并失去语义多样性而引起的,这是现有隐式 CoT 方法中步骤级监督不足造成的失败。为了解决这个问题,我们提出了 SIM-CoT,这是一种即插即用的训练模块,它引入了步骤级监督来稳定和丰富潜在推理空间。具体来说,SIM-CoT 在训练期间采用辅助解码器,将每个隐式标记与其相应的显式推理步骤对齐,确保潜在状态捕获不同且有意义的信息。所提出的辅助解码器在推理过程中被移除,保留了隐式 CoT 方法的计算效率,而没有增加开销。此外,辅助解码器通过将每个潜在标记投影到显式推理词汇上,提供隐式推理的可解释性,从而实现语义角色和诊断的每步可视化。 SIM-CoT 显着增强了各种隐式 CoT 方法的域内精度和域外稳定性,将 Coconut 等基线在 GPT-2 上提高了 +8.2%,在 LLaMA-3.1 8B 上将 CODI 提高了 +3.0%。 SIM-CoT 展示了强大的可扩展性,比 GPT-2 上的显式 CoT 基线高出 2.1%,令牌效率提高了 2.3 倍,同时大大缩小了 LLaMA-3.1 8B 等较大模型的性能差距。


EmbeddingGemma:强大且轻量级的文本表示

  • 标题: EmbeddingGemma: Powerful and Lightweight Text Representations
  • 作者: Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divya Sreepat, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini
  • 日期: 2025-09-24
  • ArXiv主页 : https://arxiv.org/abs/2509.20354
  • 论文链接 : https://arxiv.org/pdf/2509.20354

英文摘要

We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.

中文摘要

我们推出 EmbeddingGemma,这是一种基于 Gemma 3 语言模型系列的新型轻量级开放文本嵌入模型。我们的创新训练方法通过编码器-解码器初始化和几何嵌入蒸馏,战略性地从更大的模型中捕获知识。我们通过扩展正则化器提高模型的鲁棒性和表现力,并通过合并来自各种优化混合物的检查点来确保通用性。 EmbeddingGemma (300M) 在跨多语言、英语和代码域的大规模文本嵌入基准 (MTEB) 上进行了评估,取得了最先进的结果。值得注意的是,它的性能优于之前的顶级模型(无论是专有模型还是开放模型),参数数量少于 500M,并且提供的性能可与两倍大小的模型相媲美,具有卓越的性价比。值得注意的是,在量化模型权重或截断嵌入输出时,这种领先优势仍然存在。这使得 EmbeddingGemma 特别适合低延迟和高吞吐量的用例,例如设备上的应用程序。我们提供消融研究,探索我们的关键设计选择。我们向社区发布 EmbeddingGemma 以促进进一步的研究。


Hunyuan3D-Omni:3D资产可控生成的统一框架

  • 标题: Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
  • 作者: Team Hunyuan3D, Bowen Zhang, Chunchao Guo, Haolin Liu, Hongyu Yan, Huiwen Shi, Jingwei Huang, Junlin Yu, Kunhong Li, Linus, Penghao Wang, Qingxiang Lin, Sicong Liu, Xianghui Yang, Yixuan Tang, Yunfei Zhao, Zeqiang Lai, Zhihao Liang, Zibo Zhao
  • 日期: 2025-09-25
  • ArXiv主页 : https://arxiv.org/abs/2509.21245
  • 论文链接 : https://arxiv.org/pdf/2509.21245
  • 项目链接 : https://3d.hunyuan.tencent.com/

英文摘要

Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

中文摘要

3D 原生生成模型的最新进展加速了游戏、电影和设计的资产创建。然而,大多数方法仍然主要依赖于图像或文本调节,缺乏细粒度的跨模式控制,这限制了可控性和实际采用。为了解决这一差距,我们推出了Hunyuan3D-Omni,这是一个基于Hunyuan3D 2.1构建的细粒度、可控3D资产生成的统一框架。除了图像之外,Hunyuan3D-Omni 还接受点云、体素、边界框和骨骼姿势先验作为条件信号,从而能够精确控制几何、拓扑和姿势。我们的模型不是将每种模态分开,而是将所有信号统一在一个跨模态架构中。我们采用渐进式、难度感知的采样策略进行训练,该策略为每个示例选择一种控制模式,并将采样偏向于较难的信号(例如骨骼姿势),同时降低较容易的信号(例如点云)的权重,鼓励稳健的多模式融合和对丢失输入的优雅处理。实验表明,这些附加控件可以提高生成精度,实现几何感知转换,并提高生产工作流程的稳健性。


SWE-QA:语言模型可以回答存储库级别的代码问题吗?

英文摘要

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

中文摘要

对整个软件存储库的理解和推理是智能软件工程工具的一项基本能力。虽然 CoSQA 和 CodeQA 等现有基准测试已经推动了该领域的发展,但它们主要关注小型、独立的代码片段。这些设置无法捕捉现实世界存储库的复杂性,其中有效的理解和推理通常需要导航多个文件、理解软件架构以及在远程代码依赖关系中提供答案。在本文中,我们提出了 SWE-QA,这是一个存储库级代码问答 (QA) 基准,旨在促进对现实代码环境中的自动化 QA 系统的研究。 SWE-QA 涉及 576 个高质量问答对,涵盖不同类别,包括意图理解、跨文件推理和多跳依赖性分析。为了构建 SWE-QA,我们首先从 11 个流行存储库中爬取了 77,100 个 GitHub 问题。基于对从这些问题中提取的自然出现的开发人员问题的分析,我们开发了存储库级问题的两级分类法,并为每个类别构建了一组种子问题。对于每个类别,我们手动策划和验证问题并收集相应的答案。作为原型应用程序,我们进一步开发了 SWE-QA-Agent,这是一个代理框架,其中 LLM 代理进行推理并采取行动以自动寻找答案。我们在各种情境增强策略下评估了 SWE-QA 上的六名高级法学硕士。实验结果凸显了法学硕士(尤其是我们的 SWE-QA-Agent 框架)在解决存储库级 QA 方面的前景,同时也揭示了开放的挑战并指出了未来的研究方向。


ARE:扩大代理环境和评估

英文摘要

We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI's second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.

中文摘要

我们介绍元代理研究环境 (ARE),这是一个研究平台,用于可扩展的环境创建、合成或真实应用程序的集成以及代理编排的执行。 ARE 提供简单的抽象来构建复杂多样的环境,每个环境都有自己的规则、工具、内容和验证器,有助于弥合模型开发和实际部署之间的差距。我们还提出了 Gaia2,这是一个在 ARE 中构建的基准,旨在衡量通用代理的能力。除了搜索和执行之外,Gaia2 还要求智能体处理歧义和噪音,适应动态环境,与其他智能体协作,并在时间限制下运行。与之前的基准测试不同,Gaia2 异步运行,呈现在静态设置中不可见的新故障模式。我们的实验表明,没有任何系统在整个智能领域占据主导地位:更强的推理通常是以效率为代价的,预算缩放曲线趋于稳定,凸显了对新架构和自适应计算策略的需求。也许更重要的是,ARE 抽象能够将 Gaia2 不断扩展到其他环境,使社区能够快速创建适合其领域的新基准。在人工智能的下半场,进步越来越依赖于定义有意义的任务和强大的评估来推动前沿能力的发展。


OnePiece:将情境工程和推理引入工业级联排名系统

  • 标题: OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
  • 作者: Sunhao Dai, Jiakai Tang, Jiahua Wu, Kun Wang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, Anxiang Zeng, Wenjie Wang, Xu Chen, Jun Xu, See-Kiong Ng
  • 日期: 2025-09-22
  • ArXiv主页 : https://arxiv.org/abs/2509.18091
  • 论文链接 : https://arxiv.org/pdf/2509.18091

英文摘要

Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems. In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over +2% GMV/UU and a +2.90% increase in advertising revenue.

中文摘要

尽管人们对在工业搜索和推荐系统中复制大型语言模型 (LLM) 的规模化成功越来越感兴趣,但大多数现有的工业工作仍然仅限于移植 Transformer 架构,这只能带来相对于强大的深度学习推荐模型 (DLRM) 的渐进式改进。从第一原理的角度来看,法学硕士的突破不仅源于它们的架构,还源于两种互补的机制:上下文工程,它通过上下文线索丰富原始输入查询,以更好地引出模型功能;以及多步推理,它通过中间推理路径迭代地细化模型输出。然而,在工业排名系统中,这两种机制及其实现实质性改进的潜力在很大程度上仍未得到充分开发。在本文中,我们提出了 OnePiece,这是一个统一的框架,它将 LLM 风格的上下文工程和推理无缝集成到工业级联管道的检索和排序模型中。 OnePiece 建立在纯粹的 Transformer 主干之上,并进一步引入了三项关键创新:(1) 结构化上下文工程,它通过偏好和场景信号增强交互历史记录,并将它们统一为结构化标记化输入序列,用于检索和排名; (2)分块潜在推理,为模型提供多步细化表示,并通过块大小缩放推理带宽; (3)渐进式多任务训练,利用用户反馈链有效监督训练过程中的推理步骤。 OnePiece已部署在Shopee的主要个性化搜索场景中,并在不同关键业务指标上实现了一致的在线收益,包括超过2%的GMV/UU和+2.90%的广告收入增长。


AutoIntent:用于文本分类的 AutoML

英文摘要

AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.

中文摘要

AutoIntent 是一种用于文本分类任务的自动化机器学习工具。与现有解决方案不同,AutoIntent 提供端到端自动化,包括嵌入模型选择、分类器优化和决策阈值调整,所有这些都在一个类似于 sklearn 的模块化界面中。该框架旨在支持多标签分类和范围外检测。与标准意图分类数据集上的现有 AutoML 工具相比,AutoIntent 表现出卓越的性能,并使用户能够平衡有效性和资源消耗。


TrustJudge:作为法官的法学硕士的不一致以及如何缓解这些不一致

  • 标题: TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
  • 作者: Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
  • 日期: 2025-09-25
  • ArXiv主页 : https://arxiv.org/abs/2509.21117
  • 论文链接 : https://arxiv.org/pdf/2509.21117

英文摘要

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.

中文摘要

采用大型语言模型(LLM)作为自动评估器(LLM-as-a-judge)揭示了当前评估框架中严重的不一致之处。我们确定了两种基本类型的不一致:(1) 分数比较不一致,即在成对比较中,评分较低的响应优于评分较高的响应;(2) 成对传递性不一致,通过循环偏好链 (A>B>C>A) 和等价矛盾 (A=B=C\neq A) 表现出来。我们认为这些问题来自离散评级系统中的信息丢失和成对评估过程中模糊的平局判断。我们提出了 TrustJudge,这是一个概率框架,它通过两个关键创新来解决这些限制:1)分布敏感评分,根据离散评级概率计算连续期望,保留信息熵以实现更精确的评分;2)似然感知聚合,使用双向偏好概率或困惑度解决传递性违规。我们还正式阐述了当前 LLM 作为法官框架的理论局限性,并演示了 TrustJudge 的组件如何克服这些局限性。当使用我们的数据集以 Llama-3.1-70B-Instruct 作为判断进行评估时,TrustJudge 将分数比较不一致性降低了 8.43%(从 23.32% 到 14.89%),将成对传递性不一致性降低了 10.82%(从 15.22% 到 4.40%),同时保持了更高的评估精度。我们的工作首次对法学硕士法官范式中的评估框架不一致进行了系统分析,为可靠的自动化评估提供了理论见解和实用解决方案。该框架展示了跨各种模型架构和规模的一致改进,无需额外的培训或人工注释即可实现更值得信赖的 LLM 评估。这些代码可以在 https://github.com/TrustJudge/TrustJudge 找到。


VLM 距离视觉空间智能还有多远?基准驱动的视角

英文摘要

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

中文摘要

视觉空间推理(VSR)是人类的核心认知能力,也是推进具身智能和自主系统的关键要求。尽管视觉语言模型(VLM)最近取得了进展,但由于三维空间上表示和推理的复杂性,实现人类水平的 VSR 仍然极具挑战性。在本文中,我们对 VLM 中的 VSR 进行了系统研究,包括对输入模式、模型架构、训练策略和推理机制方面的现有方法进行回顾。此外,我们将空间智能分为三个级别的能力,即基本感知、空间理解、空间规划,并策划了 SIBench,这是一个包含跨 23 个任务设置的近 20 个开源数据集的空间智能基准。最先进的 VLM 的实验揭示了感知和推理之间存在明显的差距,因为模型在基本感知任务中显示出能力,但在理解和规划任务中始终表现不佳,特别是在数值估计、多视图推理、时间动态和空间想象方面。这些发现强调了实现空间智能仍然面临的巨大挑战,同时提供了系统的路线图和全面的基准来推动该领域的未来研究。本研究的相关资源可访问 https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/。


相关推荐
夏天是冰红茶6 小时前
DINO原理详解
人工智能·深度学习·机器学习
吴佳浩9 小时前
Python入门指南(六) - 搭建你的第一个YOLO检测API
人工智能·后端·python
SHIPKING3939 小时前
【AI应用开发设计指南】基于163邮箱SMTP服务实现验证登录
人工智能
yong999010 小时前
基于SIFT特征提取与匹配的MATLAB图像拼接
人工智能·计算机视觉·matlab
知秋一叶12310 小时前
Miloco 深度打通 Home Assistant,实现设备级精准控制
人工智能·智能家居
春日见10 小时前
在虚拟机上面无法正启动机械臂的控制launch文件
linux·运维·服务器·人工智能·驱动开发·ubuntu
————A10 小时前
强化学习----->轨迹、回报、折扣因子和回合
人工智能·python
CareyWYR11 小时前
每周AI论文速递(251215-251219)
人工智能
weixin_4093831211 小时前
在kaggle训练Qwen/Qwen2.5-1.5B-Instruct 通过中二时期qq空间记录作为训练数据 训练出中二的模型为目标 第一次训练 好像太二了
人工智能·深度学习·机器学习·qwen