【论文速递】2025年第38周(Sep-14-20)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

OmniWorld:用于4D世界建模的多域和多模式数据集

英文摘要

The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.

中文摘要

4D世界建模的领域 - 旨在共同捕获空间几何形状和时间动态 - 近年来,在大规模生成模型和多模式学习的进步驱动下,近年来取得了显着的进步。但是,真正一般的4D世界模型的发展基本上仍然受高质量数据的可用性的限制。现有的数据集和基准通常缺乏支持关键任务(例如4D几何重建,未来预测和摄像机对照视频的生成)所需的动态复杂性,多域多样性和时空注释。为了解决这一差距,我们介绍了Omniworld,这是一种专门为4D世界建模设计的大型,多域的多模式数据集。OmniWorld由新收集的OmniWorld游戏数据集和几个跨越不同域的策划的公共数据集组成。与现有的合成数据集相比,OmniWorld-Game提供了更丰富的方式覆盖范围,更大的规模和更现实的动态交互。基于此数据集,我们建立了一个具有挑战性的基准测试,该基准揭示了对复杂4D环境建模时最新方法(SOTA)方法的局限性。此外,对OmniWorld的现有SOTA方法进行了微调,从而在4D重建和视频生成任务中取得了显着的性能提高,从而强烈验证了OmniWorld,作为培训和评估的有力资源。我们设想Omniworld是加速通用4D世界模型的发展的催化剂,最终促进了机器对物理世界的整体理解。


Scalecua:缩放开源计算机使用跨平台数据的代理

  • 标题: ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

  • 作者: Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

  • 日期: 2025-09-18

  • ArXiv主页 : https://arxiv.org/abs/2509.15221

  • 论文链接 : https://arxiv.org/pdf/2509.15221

  • gitHub仓库 : https://github.com/OpenGVLab/ScaleCUA

英文摘要

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

中文摘要

视觉语言模型(VLM)已启用了自动运行的计算机使用剂(CUAS),显示出巨大的潜力,但进展受到缺乏大规模开放式计算机使用数据和基础模型的限制。在这项工作中,我们介绍了Scalecua,这是迈向开源CUAS的一步。它提供了一个跨越6个操作系统和3个任务域的大规模数据集,该数据集是通过闭环管道与人类专家团结起来的。Scalecua经过培训,经过扩展数据,可以在平台上无缝运行。具体而言,它在ScreenSpot-Pro上的基准量(在Webarena-lite-V2上+26.6,+10.7上的+26.6)带来了很大的收益,并设定了新的状态结果(MMBENCH-GUI L1-HARD的94.4%,OSWorld-G的MMBench-GUI L1-HARD,60.6%的OSWORLD-G,Webarena-lite-v2上的47.4%)。这些发现强调了通用计算机使用代理的数据驱动缩放量表的功能。我们将发布数据,模型和代码以推进未来的研究:https://github.com/opengvlab/scalecua。


Webweaver:通过动态概述进行开放式深度研究的构建网络尺度证据

英文摘要

This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.

中文摘要

本文解决了开放式深度研究(OEDR),这是一个复杂的挑战,AI代理必须将大量的网络规模信息综合为有见地的报告。当前的方法受到双重限制的困扰:静态研究管道,使计划从证据获取和单发一代范式中取消计划,这些范式很容易遭受长期存在的故障问题,例如"中间的损失"和幻觉。为了应对这些挑战,我们介绍了WebWeaver,这是一个新型的双重代理框架,模仿了人类研究过程。该计划者在动态周期中运行,迭代地交织的证据获取,并优化概述,以产生与证据记忆库链接的全面源构想的轮廓。然后,作者执行层次结构检索和写作过程,并按节撰写报告。通过仅针对每个部分中的记忆库中的必要证据进行有针对性的检索,它有效地减轻了长期以来的问题。我们的框架在包括Deepresearch Bench,DeepConsult和Deepresearchgym在内的主要OEDR基准中建立了一个新的最先进。这些结果证明了我们以人为中心的迭代方法,表明自适应计划和集中综合对于产生高质量,可靠和结构良好的报告至关重要。


通过持续预训练来缩放剂

  • 标题: Scaling Agents via Continual Pre-training
  • 作者: Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
  • 日期: 2025-09-16
  • ArXiv主页 : https://arxiv.org/abs/2509.13310
  • 论文链接 : https://arxiv.org/pdf/2509.13310
  • 项目链接 : https://tongyi-agent.github.io/blog/

英文摘要

Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.

中文摘要

大型语言模型(LLMS)已演变为能够使用自主工具的代理系统和用于复杂问题解决的多步推理。但是,培训后的方法以通用基础模型为基础,在代理任务中始终表现不佳,尤其是在开源实施中。我们确定根本原因:缺乏强大的代理基础模型强制训练期间的模型同时学习各种代理行为,同时使它们与专家演示保持一致,从而产生基本的优化张力。为此,我们是第一个提议将代理的持续预训练(代理CPT)纳入深度研究代理培训管道以构建强大的代理基础模型的人。基于这种方法,我们开发了一个名为AgentFounder的深层研究代理模型。我们在10个基准上评估了AgentFounder-30b,并实现最先进的性能,而保留了强大的工具使用能力,尤其是BrowseComp-en的39.9%,在BrowseComp-ZH上为43.3%,HLE的31.5%通过@1。


FlowRL:LLM推理的匹配奖励分布

  • 标题: FlowRL: Matching Reward Distributions for LLM Reasoning

  • 作者: Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin

  • 日期: 2025-09-18

  • ArXiv主页 : https://arxiv.org/abs/2509.15207

  • 论文链接 : https://arxiv.org/pdf/2509.15207

  • gitHub仓库 : https://github.com/Xuekai-Zhu/FlowRL

英文摘要

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

中文摘要

我们提出FlowRL:通过流量平衡匹配完整的奖励分布,而不是在大语言模型(LLM)增强学习(RL)中最大化奖励。最近的高级推理模型采用了奖励最大化方法(\ eg,PPO和GRPO),这些方法倾向于过度优势占主导地位的奖励信号,同时忽略了较不频繁但有效的推理路径,从而减少了多样性。相比之下,我们使用可学习的分区函数将标量奖励转换为归一化的目标分布,然后最大程度地减少策略和目标分布之间的反向KL差异。我们将这一思想作为一种流量平衡的优化方法,可促进各种探索和可推广的推理轨迹。我们进行数学和代码推理任务的实验:FlowRL的平均平均提高比GRPO的平均提高比GRPO的10.0%,而在数学基准测试中,PPO的平均提高为5.1%,并且在代码推理任务上的表现始终如一。这些结果强调了奖励分配匹配是在LLM增强学习中迈向有效探索和各种推理的关键一步。


HALA技术报告:大规模建立以阿拉伯语为中心的教学和翻译模型

英文摘要

We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong ARleftrightarrowEN teacher to FP8 (yielding sim2times higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" (leq2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

中文摘要

我们介绍了Hala,这是一个以阿拉伯语为中心的教学和翻译模型的家族,并通过我们的翻译管道建立。我们首先将强大的Arleftrightarrowen老师压缩为FP8(在没有质量损失的情况下产生更高的SIM2Times吞吐量),并使用它来创建高保真双语的监督。然后对此数据进行了微调,轻巧的语言模型LFM2-1.2B进行了微调,并用于将高质量的英语教学集转化为阿拉伯语,并产生了一百万级的语料库。我们以350m,700m,1.2b和9b参数训练HALA型号,并应用SLERP合并以平衡阿拉伯专业和基本模型强度。在以阿拉伯语为中心的基准上,Hala在" Nano"(LEQ2B)和" Small"(7-9B)类别中都取得了最先进的结果,表现优于其基础。我们发布模型,数据,评估和食谱,以加速阿拉伯语NLP的研究。


webailor-v2:通过合成数据和可扩展的增强学习将鸿沟桥接到专有代理

英文摘要

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

中文摘要

超越人类认知局限性代表了LLM培训中的关键领域。诸如Deepresearch之类的专有代理系统已经在极其复杂的信息寻求信息基准(例如Browsecomp)上展示了超人的能力,例如Br​​owsecomp,这是以前无法实现的壮举。我们认为,他们的成功取决于开源模型中不存在的复杂推理模式:在浏览大量信息景观时系统地减少极端不确定性的能力。基于这种见解,我们介绍了Webledor,这是一种完整的培训方法,旨在灌输这种关键能力。我们的方法涉及通过结构化的采样和信息混淆,rft冷启动以及有效的代理RL训练算法来生成新颖的高确定性任务,重复采样策略优化(DUPO)。通过这种集成的管道,Webailor在复杂的信息寻求任务中大大优于所有开源代理,匹配专有代理的性能并缩小功能差距。


通过环境缩放迈向一般代理智能

英文摘要

Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.

中文摘要

高级代理情报是在实用的现实应用程序中部署大型语言模型的先决条件。多样化的现实世界API需要精确,可靠的功能呼叫智能,该智能需要代理来通过各种环境中的互动来发展这些功能。功能调用能力的广度与对代理训练的环境的多样性紧密相关。在这项工作中,我们将环境扩展为推进一般代理智能的一步。这引起了两个核心挑战:(i)如何以原则性的方式扩展环境,以及(ii)如何通过与这些环境的互动得出的经验有效地训练代理能力。为了解决这些问题,我们设计了一个可扩展的框架,该框架自动构建了完全模拟的异质环境,从系统地拓宽了功能调用场景的空间。我们进一步调整了两阶代理微调策略:首先赋予具有基本代理能力的代理,然后专门针对特定领域的环境。对代理基准,tau板,tau2板凳和Acebench进行的广泛实验表明,我们训练的模型,AdgentScaler可显着增强模型的功能称呼能力。


简历:通过上下文摘要解锁长马搜索智能

英文摘要

Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of up to 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing existing open-source web agents.

中文摘要

基于大型语言模型(LLM)的Web代理在知识密集型任务上表现出强大的性能,但受到诸如React等范式的上下文限制的阻碍。复杂的查询涉及多个实体,相互交织的关系以及高度不确定性需求广泛的搜索周期,这些搜索周期在达到完整的解决方案之前会迅速耗尽上下文预算。为了克服这一挑战,我们介绍了简历,这是一种新颖的范式,可以通过定期上下文摘要实现无限期的探索。简历将不断增长的互动历史转化为紧凑的推理状态,在绕过上下文约束时保持对先前发现的认识。为了适应范式,我们提出了简历,将GRPO与细分轨迹训练和优势广播集成在一起,以使代理商熟悉摘要条件的推理。在三个基准的不同尺度的Web代理上进行了广泛的实验表明,简历比React的平均绝对改善为4.5%,而在简历GRPO培训后,进一步增长了8.2 \%。值得注意的是,只有1K培训样本,我们的Webresummer-30b(Weblem-Grpo训练的WebSailor-30b版本)在BrowseComp-ZH上获得了33.3 \%Pass@1,在BrowseComp-en上达到18.3 \%,超过现有的开放式网络代理商。


Webresearcher:释放长匹马代理的无界推理能力

英文摘要

Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.

中文摘要

深入研究系统的最新进展证明了AI代理人自主发现和综合外部来源知识的潜力。在本文中,我们介绍了Webresearcher,这是一个通过两个关键组成部分来构建此类代理的新型框架:(1)Webresearcher,一种迭代的深入研究范式,将深入研究重新进行了深入研究,作为马尔可夫决策过程,在该过程中,代理会定期整合发现的发现,同时将重点的工作空间弥补,以弥补上下文的范围,从而弥补了现有的噪声,从而使现有的噪声降低了,从而构成了调查。(2)WebFrontier是一种可扩展的数据综合引擎,通过工具增强的复杂性升级生成高质量的培训数据,从而使系统创建研究任务的系统创建,从而弥合被动知识回忆和主动知识构建之间的差距。值得注意的是,我们发现来自范式的训练数据也可以显着增强工具使用功能,即使是传统的单语言方法。此外,我们的范式自然会通过平行思考来扩展,从而使并发多代理探索得出更全面的结论。跨6个具有挑战性的基准进行的广泛实验表明,Webresearcher实现了最先进的性能,甚至超过了前沿专有系统。


界限推理:通过测试时间审议增强规格对齐

英文摘要

Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

中文摘要

大型语言模型(LLMS)越来越多地应用于不同的现实世界中,每种情况都由用户或组织量身定制的定制行为和安全规范(SPEC)管辖。这些规格分为安全规格和行为规格,在各种情况下都不同,并且随着偏好和要求的变化而发展。我们将这一挑战正式为规范对齐,重点是LLMS从行为和安全角度遵循动态,特定方案规格的能力。为了应对这一挑战,我们提出了Align3,这是一种轻巧的方法,该方法采用测试时间审议(TTD),并进行层次反射和修订以在规范边界上进行推理。我们进一步介绍了Specbench,这是用于测量规范对齐的统一基准,涵盖了5个方案,103个规格和1,500个提示。对15种推理和18种具有多种TTD方法的指导模型进行的实验,包括自我refine,TPO和Morethink,得出三个关键的发现:(i)测试时间审议增强了规格对齐;(ii)Align3以最小的开销来提高安全性权衡的边界;(iii)规格有效地揭示了对齐差距。这些结果突出了测试时间审议的潜力,这是对现实世界规范界限推理的有效策略。


UI-S1:通过半结加强学习来提高GUI自动化

  • 标题: UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
  • 作者: Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang
  • 日期: 2025-09-15
  • ArXiv主页 : https://arxiv.org/abs/2509.11543
  • 论文链接 : https://arxiv.org/pdf/2509.11543

英文摘要

Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.

中文摘要

图形用户界面(GUI)代理在通过增强学习来自动化复杂的用户界面交互方面表现出了显着的进展。但是,当前的方法面临着根本的困境:离线RL可以对预采用的轨迹进行稳定的培训,但是由于缺乏轨迹级的奖励信号而在多步任务执行方面进行了斗争;在线RL通过环境互动捕获了这些信号,但遭受了稀疏的奖励和良好的部署成本。为了解决这个问题,我们提出了半交线增强学习,这是一种新颖的范式,可以模拟在线轨迹上的在线RL。在每个推出过程中,我们都会在多转向对话中保留原始模型输出,在该对话中,补丁模块可以自适应地恢复推出和专家轨迹之间的差异。为了捕获长期的培训信号,半online RL将未来的未来收益引入奖励计算中,并通过加权的台阶和情节级别的优势优化政策。我们进一步介绍了半对线的性能(SOP),该指标与真实的在线绩效更好地保持一致,并作为现实世界评估的实用有效代理。实验表明,我们的半online RL在四个动态基准的7B模型之间达到了SOTA性能,并且基本模型的增长显着增长(例如,AndroidWorld的 +12.0%,AITW上的 +23.8%),在弥合离线训练效率和在线多层理性之间的差距方面取得了重大进展。该代码可从https://github.com/x-plug/mobileagent/tree/main/ui-s1获得。


SAIL-VL2技术报告

  • 标题: SAIL-VL2 Technical Report
  • 作者: Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng
  • 日期: 2025-09-17
  • ArXiv主页 : https://arxiv.org/abs/2509.14033
  • 论文链接 : https://arxiv.org/pdf/2509.14033

英文摘要

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

中文摘要

我们介绍了Sail-VL2,这是一种开放式视觉语言基础模型(LVM),用于全面的多模式理解和推理。作为Sail-VL的继任者,Sail-VL2在不同图像和视频基准的2B和8B参数尺度上实现了最先进的性能,这表明从细粒度的感知到复杂的推理,都表明了强大的能力。三项核心创新推动了其有效性。首先,具有评分和过滤策略的大规模数据策划管道可增强字幕,OCR,QA和视频数据的质量和分布,从而提高培训效率。其次,渐进式训练框架始于强大的预训练的视觉编码器(SAIL-VIT),通过多模式预训练的进步,并在系统地增强模型能力的思维融合SFT-RL混合范式中达到顶点。第三,建筑的进步超出了密集的LLM,到有效的稀疏专家混合物(MOE)设计。有了这些贡献,SAIL-VL2在106个数据集中表现出竞争性能,并在挑战MMMU和Mathvista等挑战推理基准方面取得了最先进的结果。此外,在OpenCompass排行榜上,Sail-VL2-2B在4B参数量表下正式发布开源模型中排名第一,同时是开源多模式社区的有效且可扩展的基础。


Hunyuan3D Studio:端到端的AI管道,用于游戏准备就绪的3D资产生成

  • 标题: Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset Generation
  • 作者: Biwen Lei, Yang Li, Xinhai Liu, Shuhui Yang, Lixin Xu, Jingwei Huang, Ruining Tang, Haohan Weng, Jian Liu, Jing Xu, Zhen Zhou, Yiling Zhu, Jiankai Xing, Jiachen Xu, Changfeng Ma, Xinhao Yan, Yunhan Yang, Chunshi Wang, Duoteng Xu, Xueqi Ma, Yuguang Chen, Jing Li, Mingxin Yang, Sheng Zhang, Yifei Feng, Xin Huang, Di Luo, Zebin He, Puhua Jiang, Changrong Hu, Zihan Qin, Shiwei Miao, Haolin Liu, Yunfei Zhao, Zeqiang Lai, Qingxiang Lin, Zibo Zhao, Kunhong Li, Xianghui Yang, Huiwen Shi, Xin Yang, Yuxuan Wang, Zebin Yao, Yihang Lian, Sicong Liu, Xintong Han, Wangchen Qin, Caisheng Ouyang, Jianyin Liu, Tianwen Yuan, Shuai Jiang, Hong Duan, Yanqi Niu, Wencong Lin, Yifu Sun, Shirui Huang, Lin Niu, Gu Gong, Guojian Xiao, Bojian Zheng, Xiang Yuan, Qi Chen, Jie Xiao, Dongyang Zheng, Xiaofeng Yang, Kai Liu, Jianchen Zhu, Lifu Wang, Qinglin Lu, Jie Liu, Liang Dong, Fan Jiang, Ruibin Chen, Lei Wang, Chao Zhang, Jiaxin Lin, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Yinhe Wu, Jiayao Du, Jupeng Chen, Xinyue Mao, Dongyuan Guo, Yixuan Tang, Yulin Tsai, Yonghao Tan, Jiaao Yu, Junlin Yu, Keren Zhang, Yifan Li, Peng Chen, Tian Liu, Di Wang, Yuhong Liu, Linus, Jie Jiang, Zhuo Chen, Chunchao Guo
  • 日期: 2025-09-16
  • ArXiv主页 : https://arxiv.org/abs/2509.12815
  • 论文链接 : https://arxiv.org/pdf/2509.12815
  • 项目链接 : https://3d.hunyuan.tencent.com/

英文摘要

The creation of high-quality 3D assets, a cornerstone of modern game development, has long been characterized by labor-intensive and specialized workflows. This paper presents Hunyuan3D Studio, an end-to-end AI-powered content creation platform designed to revolutionize the game production pipeline by automating and streamlining the generation of game-ready 3D assets. At its core, Hunyuan3D Studio integrates a suite of advanced neural modules (such as Part-level 3D Generation, Polygon Generation, Semantic UV, etc.) into a cohesive and user-friendly system. This unified framework allows for the rapid transformation of a single concept image or textual description into a fully-realized, production-quality 3D model complete with optimized geometry and high-fidelity PBR textures. We demonstrate that assets generated by Hunyuan3D Studio are not only visually compelling but also adhere to the stringent technical requirements of contemporary game engines, significantly reducing iteration time and lowering the barrier to entry for 3D content creation. By providing a seamless bridge from creative intent to technical asset, Hunyuan3D Studio represents a significant leap forward for AI-assisted workflows in game development and interactive media.

中文摘要

高质量的3D资产的创建是现代游戏开发的基石,长期以来一直以劳动密集型和专业的工作流程为特征。本文介绍了Hunyuan3D Studio,这是一个端到端AI驱动的内容创建平台,旨在通过自动化和简化游戏准备就绪的3D资产来彻底改变游戏生产管道。Hunyuan3D Studio的核心将一组高级神经模块(例如零件级3D代,多边形生成,语义紫外线等)集成到一个凝聚力和用户友好的系统中。这个统一的框架允许将单个概念图像或文本描述快速转换为完全实现的,生产质量的3D模型,并具有优化的几何形状和高保真PBR纹理。我们证明,Hunyuan3D Studio生成的资产不仅在视觉上引人注目,而且还遵守当代游戏引擎的严格技术要求,从而大大缩短了迭代时间,并降低了创建3D内容的进入障碍。通过提供从创意意图到技术资产的无缝桥梁,Hunyuan3D Studio代表了游戏开发和互动媒体中AI辅助工作流程的重大飞跃。


回报递减的幻想:测量LLMS中的长范围执行

英文摘要

Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

中文摘要

大型语言模型(LLM)的持续缩放是否会导致回报减少?现实世界的价值通常源于代理可以完成的任务长度。我们通过观察一个简单但违反直觉的事实开始这项工作,即单步准确性的边际收益可以使模型可以成功完成的任务长度增加指数改进。然后,我们认为,在执行中的错误而不是无法推理时,LLM的失败会导致更长的时间。我们通过明确提供解决长途任务所需的知识和计划来提出隔离执行能力。我们发现,即使小型型号具有100 \%单转的精度,较大的模型也可以正确执行更多的转弯。我们观察到,随着步骤数量的增加,模型的每步准确性会降低。这不仅是由于长期以来的限制 - 奇怪的是,我们观察到一种自我调节效果 - 当上下文包含在先前的转弯中包含其错误时,模型变得更有可能犯错。仅通过缩放模型大小来减少自我调节。相比之下,最近的思维模型不是自我条件,也可以单一转弯执行更长的任务。我们结束了根据他们可以在单个转弯执行的任务长度上基准边界思维模型的基础测试。总体而言,通过关注执行能力,我们希望对LLMS如何解决复杂的推理问题进行调和,但在完成更长的时间后,在简单任务上失败了,并突出显示了缩放模型大小和长途任务的顺序测试时间计算的巨大好处。


单流策略优化

英文摘要

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.

中文摘要

我们从单流的角度重新审视针对大语言模型(LLMS)的政策颁奖典礼优化。诸如GRPO之类的基于群体的主要方法降低了与现有基线的差异,但患有关键缺陷:频繁的退化组擦除学习信号和同步障碍会阻碍可扩展性。我们引入了单流策略优化(SPO),该策略优化通过设计消除了这些问题。SPO用持续的KL自适应值跟踪器替代了每组基线,并在整个批处理中均具有归一化的优势,为每个样本提供了一个稳定的低变化学习信号。SPO无组,可以在长期或工具集成的设置中有效地逐渐缩放和尺度,而生成时间有所不同。此外,持久的值跟踪器自然可以通过优先采样来实现自适应课程。使用QWEN3-8B进行的实验表明,SPO的收敛性更高,并且比GRPO更高的精度,同时消除了在退化组上浪费的计算。消融研究证实,SPO的收益源于其原则上的基线估计方法和优势归一化的方法,为LLM推理提供了更强大,更有效的路径。在具有QWEN3 8B的五个硬数学基准测试中,SPO将平均Maj@32提高了+3.4个百分点(PP),而GRPO比GRPO提高了+3.4.3 pp,包括+7.3 pp,包括Brumo 25,+7.3 pp,在AIME 25,+3.3 pp上的HMMT 25和Achie vers vers kn y and kn y vers vers kn nme and kn y pp and kn y vers vers kn n of ach y vers vers kn@ach ex@ach yens kn。SPO的成功挑战了为RL算法增加偶然复杂性的主要趋势,突出了一条道路,即基本原理而不是建筑解决方法,推动了LLM推理中的下一步进步。


没有标签的不断发展的语言模型:大多数驱动选择,新颖性促进变化

英文摘要

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.

中文摘要

大型语言模型(LLM)越来越多地通过从可验证的奖励(RLVR)学习的加强学习中训练,但是现实世界中的部署要求模型可以在没有标签或外部法官的情况下自我爆发。现有的无标签方法,最小化的信心,自矛盾或多数票数目标,稳定学习,但稳步收缩探索,导致熵崩溃:世代变得更短,多样化和脆弱。与先前的方法(例如测试时间增强学习(TTRL))不同,该方法主要将模型适应了即时的未标记数据集,我们的目标是更广泛的:在不牺牲模型固有的探索能力和概括能力(即进化)的情况下,可以实现一般改进。我们正式化了这个问题,并提出了面向进化的和无标签的加固学习(EVOL-RL),这是一个简单的规则,它使稳定性与无标签设置下的稳定性结合在一起。Evol-RL将大多数投票的答案作为稳定的锚(选择),同时添加了新颖的意识奖励,这种奖励有利于反应的反应,其推理与已经在语义空间中测量的已经产生的(变体)不同。EVOL-RL通过GRPO实施,还使用了不对称的剪辑来保留强信号和熵正常器来维持搜索。这种多数派选择 +新颖性的变化设计可防止崩溃,维持更长,更有信息的思想链,并改善PASS@1和Pass@n。EVOL-RL始终优于仅大多数TTRL基线;例如,对无标签AIME24的培训将QWEN3-4B-BASE AIME25通过@1从TTRL的4.6%到16.4%,并将@16通过从18.5%传递至37.9%。EVOL-RL不仅可以防止多样性崩溃,而且可以解锁跨领域的更强概括(例如GPQA)。此外,我们证明了EVOL-RL还提高了RLVR设置中的性能,突出了其广泛的适用性。


InternScenes:具有现实布局的大规模模拟室内场景数据集

英文摘要

The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce InternScenes, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

中文摘要

体现AI的进步在很大程度上依赖于以场景多样性和现实布局为特征的大规模,可模拟的3D场景数据集。但是,现有的数据集通常会遭受数据量表或多样性的局限性,缺乏小项目的消毒布局以及严重的对象碰撞。为了解决这些缺点,我们介绍了InternScenes,这是一种新颖的大型室内场景数据集,其中包括大约40,000个不同的场景,通过整合了三个不同的场景来源,现实世界扫描,程序上生成的场景,以及设计师创建的场景以及设计师创建的场景,包括1.96m 3D的对象和覆盖15D公共场景和288个公共场景和288个对象类型。我们特别在场景中保留大量小项目,从而导致每个区域平均41.5个物体进行现实和复杂的布局。我们的全面数据处理管道通过为真实世界扫描创建真实到SIM的复制品,通过将交互式对象纳入这些场景来增强交互性,并通过物理模拟解决对象碰撞来确保相互作用。我们通过两个基准应用程序演示了实习生的价值:场景布局生成和点目标导航。两者都表明了复杂和现实的布局带来的新挑战。更重要的是,InternScenes为对这两个任务的模型培训扩展铺平了道路,从而使在如此复杂的场景中产生和导航成为可能。我们致力于开放数据,模型和基准,以使整个社区受益。


Infgen:可扩展图像综合的分辨率 - 不合稳定范式

英文摘要

Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the InfGen, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.

中文摘要

任意分辨率图像生成提供了跨设备的一致的视觉体验,并为生产者和消费者提供了广泛的应用。当前的扩散模型通过分辨率四次增加计算需求,从而导致4K图像生成延迟超过100秒。为了解决这个问题,我们探索了潜在扩散模型的第二代,其中扩散模型生成的固定潜在被视为内容表示形式,我们建议使用使用一步生成器使用紧凑的潜在生成的潜在生成的潜在分辨率图像。因此,我们呈现Infgen,用新的发电机代替VAE解码器,以从固定尺寸的潜在分辨率生成图像,而无需再培训扩散模型,从而简化了过程,降低了计算复杂性,并可以使用相同的潜在空间应用于任何模型。实验表明,INFGEN能够将许多模型改进到任意高分辨率时代,同时将4K图像的生成时间降低到10秒以下。


FinSearchComp:对财务搜索和推理的现实,专家级别的评估

  • 标题: FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
  • 作者: Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang
  • 日期: 2025-09-16
  • ArXiv主页 : https://arxiv.org/abs/2509.13160
  • 论文链接 : https://arxiv.org/pdf/2509.13160
  • 项目链接 : https://randomtutu.github.io/FinSearchComp/

英文摘要

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

中文摘要

搜索已成为基于LLM的代理的核心基础架构,并且在通往更一般智能的道路上被广泛认为是至关重要的。金融是一个特别苛刻的验证基础:分析师通常会通过时间敏感的,特定于域的数据进行复杂的多步骤搜索,这是评估搜索能力和知识基础推理的理想选择。然而,没有现有的开放金融数据集评估端到端代理的数据搜索能力,主要是因为构建现实,复杂的任务需要深厚的财务专业知识和时间敏感的数据。我们提出FinSearchComp,这是第一个全面的开源代理基准,用于现实,开放域财务搜索和推理。FinSearchComp包括三个任务 - 时间敏感的数据获取,简单的历史查找和复杂的历史调查 - 密切复制了现实世界中财务分析师的工作流程。为了确保难度和可靠性,我们与70名专业财务专家进行注释,并实施严格的多阶段质量保证管道。基准包括635个跨越全球和大中国市场的问题,我们在其上评估了21种型号(产品)。Grok 4(Web)以全球子集为顶点,接近专家级的准确性。Doubao(Web)领导着大中国。实验分析表明,配备网络搜索和财务插件可大大改善FinSearchComp的结果,模型和工具的来源很大程度上影响了性能。通过与现实的分析师任务保持一致并提供端到端评估,FinSearchComp提供了专业的,高难以实现的财务搜索床位,以进行复杂的财务搜索和推理。


在生成之前要了解:自我引导的培训,以产生自回旋图像生成

英文摘要

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

中文摘要

最近的研究表明,高质量的视觉表示在图像生成中的重要性,并强调了生成模型在图像理解中的局限性。作为最初为自然语言设计的生成范式,自回归模型面临类似的挑战。在这项工作中,我们介绍了对将下一个预测范式应用于视觉领域的机制进行的首次系统研究。我们确定了妨碍高级视觉语义学习的三个关键属性:局部和有条件依赖性,步进语义上的不一致和空间不变性缺陷。我们表明,可以通过在培训期间引入自我监督的目标来有效解决这些问题,从而导致新型的培训框架,自导向自回旋模型(ST-AR)的自我引入培训。ST-AR在不依赖预训练的表示模型的情况下显着增强了自回归模型的图像理解能力,并提高了发电质量的提高。具体而言,ST-AR为Llamagen-L带来了大约42%的FID提高,而Lamagen-XL则提高了49%的FID,同时保持了相同的抽样策略。


全景:体现AI时代的全向视觉的兴起

  • 标题: PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
  • 作者: Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Lu Qi, Li Chen, Danda Pani Paudel, Kailun Yang, Linfeng Zhang, Luc Van Gool, Xuming Hu
  • 日期: 2025-09-16
  • ArXiv主页 : https://arxiv.org/abs/2509.12989
  • 论文链接 : https://arxiv.org/pdf/2509.12989

英文摘要

Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.

中文摘要

使用360度愿景来了解环境的全向视觉已变得越来越关键,例如机器人技术,工业检查和环境监测。与传统的针孔视觉相比,全向视觉提供了整体的环境意识,可显着提高场景感知的完整性和决策的可靠性。但是,这一领域的基础研究历来落后于传统的针孔视觉。本演讲呈现出体现的AI时代的新兴趋势:在工业需求和学术兴趣不断增长的推动下,全向视觉的快速发展。我们重点介绍了全向产生,全向感知,全向理解和相关数据集的最新突破。利用学术界和行业的见解,我们提出了体现的AI时代Panorama的理想全景体系结构,该系统由四个关键子系统组成。此外,我们提供了与新兴趋势和跨社区影响有关的深入意见,以及全景视野与体现AI的交集以及未来的路线图和开放挑战。这一概述综合了最新的进步,并概述了在体现的AI时代建立强大的通用全向AI系统方面未来研究的挑战和机会。


虚拟代理经济体

英文摘要

The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the "sandbox economy" as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI "mission economies" to coordinate around achieving collective goals, and socio-technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity's long-term collective flourishing.

中文摘要

自主AI代理的迅速采用正在引起一个新的经济层,在该层中,代理商以尺度和速度进行了直接监督的尺度和速度进行协调。我们建议"沙盒经济"作为分析这种新兴系统的框架,沿两个关键维度进行表征:其起源(紧急与故意)及其与已建立的人类经济的分离程度(可渗透与不可渗透)。我们目前的轨迹指向巨大且高度可渗透的AI代理经济的自发出现,向我们展示了实现前所未有的协调程度以及重大挑战的机会,包括系统性的经济风险和加剧的不平等。在这里,我们讨论了许多可能导致安全可通道的AI代理市场的设计选择。特别是,我们考虑了公平资源分配和偏好解决方案的拍卖机制,AI"使命经济体"的设计围绕实现集体目标进行协调,以及确保信任,安全和问责制所需的社会技术基础设施。通过这样做,我们主动主动设计可进入的代理市场,以确保即将到来的技术转变与人类的长期集体蓬勃发展保持一致。


长期情绪:在长篇小说互动中测量大语言模型的情绪智力

英文摘要

Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at https://github.com/LongEmotion/LongEmotion, and the project page can be found at https://longemotion.github.io/.

中文摘要

大型语言模型(LLM)在情绪智力(EI)和长篇小说理解方面取得了重大进展。但是,现有的基准倾向于在长期文化的情况下,尤其是在往来的实用环境下,互动漫长,多样化且通常嘈杂的情况下,倾向于忽略EI的某些方面。为了朝着这种现实的设置迈进,我们提出了长远的长远,这是专门针对长篇文本EI任务设计的基准。它涵盖了各种各样的任务,包括情感分类,情感检测,情感质量质量质量检查,情感对话,情感摘要和情感表达。平均而言,这些任务的输入长度达到8,777个令牌,情绪表达需要长期产生。为了提高在现实限制下的性能,我们合并了检索功能的生成(RAG)和协作情感建模(COEM),并将其与基于标准的及时及时方法进行比较。与传统的方法不同,我们的抹布方法将对话上下文和大型语言模型本身作为检索来源,从而避免依赖外部知识基础。COEM方法通过将任务分解为五个阶段,从而整合了检索增强和有限的知识注入来进一步提高性能。实验结果表明,抹布和COEM始终在大多数长篇小说任务中增强与EI相关的性能,从而将LLMS推向了更实用和现实的EI应用。此外,我们对GPT系列进行了比较案例研究实验,以证明各种模型之间的差异。代码可在https://github.com/longemotion/longemotion上找到,可以在https://longemotion.github.io/上找到项目页面。


嵌入中丢失:视觉模型中的信息损失

英文摘要

Vision--language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.

中文摘要

视觉模型(VLMS)通常通过验证的视觉编码器处理视觉输入,然后通过连接器组件投影到语言模型的嵌入空间中。尽管对于模态融合至关重要,但该投影步骤引起的潜在信息损失及其对模型能力的直接影响仍在研究中。我们介绍了两种补充方法,通过分析潜在表示空间来检查和量化这一损失。首先,我们通过分析投影前后图像表示之间的k-neart邻居关系的变化来评估语义信息保存。其次,我们通过从投影表示形式重建视觉嵌入,直接测量信息丢失,并将损失定位在图像贴片级别。实验表明,连接器实质上扭曲了视觉表示的局部几何形状,k-neartiment邻居在投票后的反射40--60 \%\%,与检索性能的降解相关。补丁级的嵌入重建提供了可解释的见解,可以在视觉上接地的问题索问题任务上进行模型行为,发现高信息损失的领域可靠地预测了模型挣扎的实例。


XPART:高忠诚度和结构相干形状分解

  • 标题: X-Part: high fidelity and structure coherent shape decomposition
  • 作者: Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, Chunchao Guo
  • 日期: 2025-09-10
  • ArXiv主页 : https://arxiv.org/abs/2509.08643
  • 论文链接 : https://arxiv.org/pdf/2509.08643

英文摘要

Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.

中文摘要

在零件级别生成3D形状对于下游应用程序(例如网状复位,紫外线映射和3D打印)至关重要。但是,现有的基于部分的生成方法通常缺乏足够的可控性,并且具有较差的语义有意义的分解。为此,我们介绍了X-Part,这是一种可控的生成模型,旨在将整体3D对象分解为具有高几何忠诚度的语义有意义且结构相干的部分。X-Part利用边界框作为零件生成的提示,并注入有意义的分解的点上的语义特征。此外,我们为交互式零件生成设计了可编辑的管道。广泛的实验结果表明,XPART在部分形状生成中实现了最先进的性能。这项工作建立了一个新的范式,用于创建准备生产,可编辑和结构上的3D资产。代码将发布用于公共研究。


WorldForge:通过无训练指导解锁视频扩散模型中的新兴3D/4D代

英文摘要

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

中文摘要

最近的视频扩散模型表明,由于其丰富的潜在世界先验,在空间情报任务中的潜力很大。但是,这种潜力受到了有限的可控性和几何不一致的阻碍,从而在强大的先验和在3D/4D任务中的实际使用之间差距。结果,当前的方法通常依赖于重新调整或微调,这有可能降低预验证的知识并带来高度的计算成本。为了解决这个问题,我们建议WorldForge是一个由三个紧密耦合模块组成的无培训,推理时间框架。步骤内递归完善在推理过程中引入了递归完善机制,该机制反复优化了每个剥离步骤中的网络预测,以实现精确的轨迹注入。流门控的潜在融合利用了从潜在空间中的外观和选择性地注入与运动相关的通道中的轨迹引导的光流相似性。双路径自校正指导将指导和未指导的denoising路径与由嘈杂或未对准的结构信号引起的适应性正确的轨迹漂移进行了比较。这些组件共同注入了无训练的细粒度,轨迹对准指导,可以实现准确的运动控制和影片含量的产生。跨不同基准的广泛实验验证了我们在现实主义,轨迹一致性和视觉保真度方面的优势。这项工作介绍了一种新颖的插件范式,用于可控视频综合,为利用生成先验的空间智能提供了新的视角。


INTREX:用于建模参与教育对话的数据集

英文摘要

Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.

中文摘要

参与和动机对于第二语言的获取至关重要,但是保持学习者对教育对话的兴趣仍然是一个挑战。虽然先前的研究探讨了使教育文本变得有趣的是什么,但对推动对话参与的语言特征知之甚少。为了解决这一差距,我们介绍了Intrex,这是第一个注释的大型数据集,以引起教师互动中的兴趣和预期兴趣。Intrex建立在教师聊天室语料库(TSCC)的基础上,通过合并序列级别的注释,扩展了先前的工作,从而使互动的研究超越了孤立的转弯,以捕捉对扩展对话的兴趣如何发展。我们采用严格的注释过程,使用100多种第二语言学习者,采用一种基于比较的评分方法,该方法受到对人类反馈(RLHF)学习的启发,以提高共识。我们研究大型语言模型(LLMS)是否可以预测人类的兴趣判断。我们发现,llms(7b/8b参数)对趣味性评级进行了微调,其表现优于较大的专有模型,例如GPT-4O,这表明了专业数据集以建模教育环境中的参与度的潜力。最后,我们分析了语言和认知因素,例如具体性,可理解性(可读性)和采用,会影响参与教育对话。


Hanrag:多跳问题的启发式准确的抗噪声检索生成

  • 标题: HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering
  • 作者: Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu
  • 日期: 2025-09-08
  • ArXiv主页 : https://arxiv.org/abs/2509.09713
  • 论文链接 : https://arxiv.org/pdf/2509.09713

英文摘要

The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.

中文摘要

检索功能生成(RAG)方法通过将信息检索(IR)技术与大语言模型(LLMS)集成(LLMS),从而增强了提问系统和对话生成任务。该策略从外部知识库中检索信息以增强生成模型的响应能力,并取得了一定的成功。但是,当前的破布方法在处理多跳查询时仍然面临许多挑战。例如,某些方法过于依赖于迭代检索,在复合查询上浪费了太多的检索步骤。此外,使用原始的复杂查询进行检索可能无法捕获与特定子查询相关的内容,从而导致嘈杂的检索内容。如果没有管理噪声,则可能导致噪声积累问题。为了解决这些问题,我们介绍了Hanrag,这是一种基于启发式的新型框架,旨在有效地解决不同复杂性的问题。在强大的启示器驱动器的驱动器中,Hanrag路由查询,将其分解为子查询,并从检索到的文档中过滤噪音。这增强了系统的适应性和抗噪声性,使其高度能够处理各种查询。我们将提议的框架与各种基准的其他领先行业方法进行了比较。结果表明,我们的框架在单跳和多跳的提问任务中获得了卓越的性能。


atoken:视觉的统一令牌

英文摘要

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9% classification accuracy for 3D. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.

中文摘要

我们介绍了Atoken,这是第一个跨图像,视频和3D资产的高保真重建和语义理解的统一视觉令牌。与专门研究单个模式的重建或理解的现有图形不同,atoken将这些不同的视觉输入编码为共享的4D潜在空间,在单个框架中统一任务和模态。具体而言,我们引入了带有4D旋转位置嵌入的纯变压器体系结构,以处理任意分辨率和时间持续时间的视觉输入。为了确保稳定的培训,我们引入了一个无对抗性的训练目标,该目标结合了感知和革兰氏矩阵损失,以实现最新的重建质量。通过采用渐进培训课程,Atoken逐渐从单个图像,视频和3D扩展,并支持连续和离散的潜在令牌。Atoken的图像具有82.2%Imagenet精度的0.21 RFID,视频的32.6%MSRVTT检索的3.01 RFVD和28.19 PSNR,3D的分类精度为90.9%。在下游应用程序中,Atoken可以启用视觉生成任务(例如,具有连续和离散令牌的图像生成,文本到视频生成,图像到3D综合)和理解任务(例如,多模式LLMS),在所有基准标准中实现竞争性能。这些结果阐明了下一代多模式AI系统建立在统一的视觉令牌上。


相关推荐
qq_12498707532 小时前
基于协同过滤算法的在线教育资源推荐平台的设计与实现(源码+论文+部署+安装)
java·大数据·人工智能·spring boot·spring·毕业设计
一水鉴天2 小时前
整体设计 定稿 之7 共享给定表格文档的分析(豆包助手)
人工智能·架构
C嘎嘎嵌入式开发2 小时前
NLP 入门:从原理到实战的个人经验总结
人工智能·python·自然语言处理·nlp
水如烟2 小时前
孤能子视角:人工智能的“计算博弈“––“标量“即“矢量“
人工智能
Hugging Face2 小时前
Codex 正在推动开源 AI 模型的训练与发布
人工智能
程途拾光1582 小时前
发展中国家的AI弯道超车:医疗AI的低成本本土化之路
大数据·人工智能
vi121233 小时前
土壤与水分遥感反演技术综述:原理、方法与应用
人工智能·算法·无人机
我不是QI3 小时前
周志华《机器学习—西瓜书》八
人工智能·机器学习
shenzhenNBA3 小时前
python如何调用AI之deepseek的API接口?
人工智能·python·deepseek·调用deepseek api