中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- GLM-4.5:代理,推理和编码(ARC)基础模型
- [We-Math 2.0:一种用于激励视觉数学推理的多功能数学手册系统](#We-Math 2.0:一种用于激励视觉数学推理的多功能数学手册系统)
- NextStep-1:迈向具有连续令牌的自回旋图像生成
- Webwatcher:打破视觉深度研究代理的新领域
- 推理:以强大的推理能力授权通道排名
- 实时搜索:基准测试代理广泛信息寻求
- 对自我发展的AI代理的全面调查:新的范式桥接基础模型和终身代理系统
- 矩阵3D:全向探索3D世界一代
- Story2板:一种无训练的情节板生成方法
- 前奏:旨在在漫长的背景下需要全球理解和推理的基准
- OMNI效应:统一和可控制的视觉效果生成
- VOOST:双向虚拟试验和试验的统一且可扩展的扩散变压器
- 看,聆听,记忆和推理:具有长期记忆的多模式代理
- Tooncomposer:用生成后的keyframing简化卡通制作
- 超越十回合:通过大规模异步RL解锁长胜下的代理搜索
- 第一部分:技巧还是陷阱?深入研究LLM推理的RL
- Sonar-llm:在句子嵌入中思考并在代币中说话的自回旋变压器
- Molmoact:可以在太空中推理的动作推理模型
- UI-Venus技术报告:建造具有RFT的高性能UI代理
- mol-r1:迈向分子发现中的显式长期推理
- Klear-Reasoner:通过梯度裁剪策略优化提高推理能力
- 复杂的逻辑教学生成
- 替补:视频生成的轻巧和插件的身份控制
- 角色旋转:可控且一致的4D角色动画
- BrowseComp-Plus:深入研究代理的更公平,更透明的评估基准
- 时间是一个功能:在扩散语言模型中利用时间动态
- Vertexregen:网格生成具有连续的细节水平
- 关于扩散语言模型的调查
- MEMP:探索代理程序记忆
GLM-4.5:代理,推理和编码(ARC)基础模型
-
标题: GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
-
作者: GLM-4. 5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang
-
日期: 2025-08-08
-
ArXiv主页 : https://arxiv.org/abs/2508.06471
-
gitHub仓库 : https://github.com/zai-org/GLM-4.5
英文摘要
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.
中文摘要
我们提出GLM-4.5,是具有355B总参数和32B激活参数的开源外源混合物(MOE)大型语言模型,具有一种支持思维和直接响应模式的混合推理方法。通过对23T代币的多阶段培训以及通过专家模型迭代和强化学习的全面培训,GLM-4.5实现了跨代理,推理和编码(ARC)任务的强劲性能,在Tau-Bench上得分为70.1%,AIME 24的91.0%,在SWE-Bench上获得了64.2%的验证。GLM-4.5的参数比几个竞争对手少得多,在所有评估模型中排名第三,而在代理基准上排名第二。我们同时发布GLM-4.5(355B参数)和紧凑版本GLM-4.5-Air(106B参数),以推动推理和代理AI系统的研究。代码,模型和更多信息可在https://github.com/zai-org/glm-4.5上获得。
We-Math 2.0:一种用于激励视觉数学推理的多功能数学手册系统
- 标题: We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning
- 作者: Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, Honggang Zhang
- 日期: 2025-08-14
- ArXiv主页 : https://arxiv.org/abs/2508.10433
- 论文链接 : https://arxiv.org/pdf/2508.10433
- 项目链接 : https://we-math2.github.io/
- gitHub仓库 : https://github.com/We-Math/We-Math2.0
英文摘要
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard & Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.
中文摘要
多模式的大语言模型(MLLM)在各种任务中都表现出了令人印象深刻的功能,但仍在复杂的数学推理上挣扎。现有研究主要集中在数据集构建和方法优化上,通常忽略两个关键方面:全面的知识驱动设计和以模型为中心的数据空间建模。在本文中,我们介绍了We-Math 2.0,这是一个整合结构化数学知识系统,以模型为中心的数据空间建模和基于强化的培训范式的统一系统,以全面增强MLLM的数学推理能力。We-Math 2.0的主要贡献是四倍:(1)数学知识系统:我们构建了一个涉及491个知识点和1,819个基本原理的五级层次结构系统。(2)数学标准和专业:我们开发数学标准,这是一个数据集,该数据集可通过双重扩展确保广泛的概念覆盖和灵活性。此外,我们定义了一个三维的难度空间,并为每个问题产生7个渐进式变体,以构建MathBook-Pro,这是一个充满挑战的可靠培训数据集。(3)MathBook-rl:我们提出了一个两个阶段的RL框架,其中包括:(i)冷启动微调,该框架将模型与面向知识链的推理保持一致;(ii)渐进式一致性RL,利用平均奖励学习和动态数据调度来实现跨难度级别的逐步对齐。(4)MathBookeval:我们介绍了一个全面的基准,涵盖了所有491个知识点,并具有不同的推理步骤分布。实验结果表明,MathBook-RL在四个广泛使用的基准上的现有基准竞争性能,并在MathBookeval上取得了强大的结果,这表明在数学推理中有希望的概括。
NextStep-1:迈向具有连续令牌的自回旋图像生成
- 标题: NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
- 作者: NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu
- 日期: 2025-08-14
- ArXiv主页 : https://arxiv.org/abs/2508.10711
- 论文链接 : https://arxiv.org/pdf/2508.10711
- 项目链接 : https://stepfun.ai/research/en/nextstep1
- gitHub仓库 : https://github.com/stepfun-ai/NextStep-1
英文摘要
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
中文摘要
盛行文本对图像生成的自动回收(AR)模型依赖于重型,计算密集型扩散模型来处理连续的图像令牌,或采用矢量量化(VQ)来获得具有量化损失的离散令牌。在本文中,我们将自回归范式推向NextStep-1,这是14B自回归型号,配对157m流量匹配的头部,对离散文本令牌进行训练以及连续的图像令牌,并具有下一步的预测目标。NextStep-1在文本到图像生成任务中实现自回归模型的最新性能,在高保真图像合成中表现出强大的功能。此外,我们的方法在图像编辑中显示出强烈的性能,突出了我们统一方法的功能和多功能性。为了促进开放研究,我们将向社区发布代码和模型。
Webwatcher:打破视觉深度研究代理的新领域
- 标题: WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
- 作者: Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
- 日期: 2025-08-07
- ArXiv主页 : https://arxiv.org/abs/2508.05748
- 论文链接 : https://arxiv.org/pdf/2508.05748
- 项目链接 : https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
- gitHub仓库 : https://github.com/Alibaba-NLP/WebAgent
英文摘要
Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.
中文摘要
诸如深入研究之类的网络代理商表明了超人的认知能力,能够解决高度挑战性的信息寻求问题。但是,大多数研究仍然主要以文本为中心,忽略现实世界中的视觉信息。这使多模式深度研究高度挑战性,因为与基于文本的代理相比,这些代理需要更强的感知,逻辑,知识和使用更复杂的工具的能力。为了解决此限制,我们介绍了Webwatcher,Webwatcher是一种具有增强视觉语言推理功能的深入研究的多模式代理。它利用高质量的合成多模式轨迹进行有效的冷启动训练,利用各种工具进行深层推理,并通过增强学习进一步增强概括。为了更好地评估多模式代理的功能,我们提出了BrowseComp-VL,这是一种具有BrowseComp风格的基准,需要复杂的信息检索,涉及视觉和文本信息。实验结果表明,Webwatcher在四个具有挑战性的VQA基准中大大优于专有基线,RAG工作流程和开源代理,这为解决复杂的多模式寻求信息的任务铺平了道路。
推理:以强大的推理能力授权通道排名
- 标题: ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
- 作者: Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou
- 日期: 2025-08-09
- ArXiv主页 : https://arxiv.org/abs/2508.07050
- 论文链接 : https://arxiv.org/pdf/2508.07050
- 项目链接 : https://brightbenchmark.github.io/
- gitHub仓库 : https://github.com/8421BCD/ReasonRank
英文摘要
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker ReasonRank outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/.} Our codes are available at https://github.com/8421BCD/ReasonRank.
中文摘要
基于大型语言模型(LLM)的列表排名在许多段落排名任务中表现出卓越的性能。随着大型推理模型的发展,许多研究表明,测试时间期间的分步推理有助于提高列表的排名绩效。但是,由于缺乏推理密集型培训数据,在许多复杂的排名方案中,现有的读者表现较差,而推理密集型重读者的排名能力仍然很大程度上欠发达。在本文中,我们首先提出了一个自动推理密集型培训数据综合框架,该框架采购了来自不同领域的培训查询和段落,并应用DeepSeek-R1来生成高质量的培训标签。自洽数据过滤机制旨在确保数据质量。为了使列表的重读者能够具有强大的推理能力,我们进一步提出了一种两阶段的培训方法,其中包括一个冷启动的监督微调(SFT)阶段,用于推理模式学习和增强阶段(RL)阶段,以进一步提高排名能力。在RL阶段,基于ListWise排名的性质,我们设计了一个多视图排名奖励,这比基于标准的奖励更有效。广泛的实验表明,我们训练的推理密集型重读者Reasonarank的表现明显优于现有基准,并且比Pointwise Reranker Rank1的延迟低得多。通过进一步的实验,我们的ReasonRank在明亮的排行榜\脚注{https://brightbenchmark.github.io/.}上实现了最新的(SOTA)性能40.6。
实时搜索:基准测试代理广泛信息寻求
- 标题: WideSearch: Benchmarking Agentic Broad Info-Seeking
- 作者: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.07999
- 论文链接 : https://arxiv.org/pdf/2508.07999
- 项目链接 : https://widesearch-seed.github.io/
- gitHub仓库 : https://github.com/ByteDance-Seed/WideSearch
英文摘要
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/
中文摘要
从专业研究到日常计划,许多任务都被广泛的信息寻求瓶装,这比认知上复杂更重复。随着大型语言模型(LLM)的快速发展,由LLMS提供支持的自动搜索剂为使人类摆脱这项繁琐的工作提供了有希望的解决方案。但是,由于缺乏合适的基准,这些代理人可以可靠和完全完全无法评估这种"广泛的"集合的能力在很大程度上仍未被估算。为了弥合这一差距,我们介绍了跨越搜索,这是一种新的基准测试,该基准旨在评估这些大规模收集任务的代理可靠性。该基准具有来自15个以上不同域的200个手动策划的问题(100英语,中文100个),这些域以实际用户查询为基础。每个任务都要求代理人收集大规模的原子信息,可以客观地对其进行一个一一验证,并将其安排为组织良好的输出。严格的五阶段质量控制管道可确保数据集的难度,完整性和可验证性。我们基准了10个最先进的代理搜索系统,包括单一代理,多代理框架和端到端的商业系统。大多数系统的总体成功率接近0 \%,表现最好的人只达到5%。但是,如果有足够的时间,多个人类测试人员的交叉验证可以达到接近100 \%的成功率。这些结果表明,当前的搜索代理在大规模的信息中存在严重的缺陷,并强调了为代理搜索中未来研究和开发的紧急领域。我们的数据集,评估管道和基准结果已在https://widesearch-seed.github.io/上公开发布。
对自我发展的AI代理的全面调查:新的范式桥接基础模型和终身代理系统
- 标题: A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
- 作者: Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng
- 日期: 2025-08-10
- ArXiv主页 : https://arxiv.org/abs/2508.07407
- 论文链接 : https://arxiv.org/pdf/2508.07407
- 项目链接 : https://huggingface.co/spaces/X-iZhang/Awesome-Self-Evolving-Agents
- gitHub仓库 : https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents
英文摘要
Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.
中文摘要
大型语言模型的最新进展激发了人们对能够解决复杂的现实世界任务的AI代理人的兴趣。但是,大多数现有的代理系统都依赖于部署后保持静态的手动制作的配置,从而限制了它们适应动态和不断发展的环境的能力。为此,最近的研究探索了旨在根据交互数据和环境反馈自动增强代理系统的代理进化技术。这个新兴方向为自我发展的AI代理奠定了基础,该代理启动了基础模型的静态能力,并具有终身代理系统所需的连续适应性。在这项调查中,我们对自我发展的代理系统的现有技术进行了全面审查。具体而言,我们首先引入了一个统一的概念框架,该框架抽象了自我不断发展的代理系统设计的反馈循环。该框架突出了四个关键组成部分:系统输入,代理系统,环境和优化器,是理解和比较不同策略的基础。基于此框架,我们系统地回顾了针对代理系统不同组成部分的广泛的自我发展技术。我们还研究了针对专业领域(例如生物医学,编程和金融)开发的特定领域的进化策略,其中优化目标与域约束紧密相结合。此外,我们还为自我发展的代理系统提供了有关评估,安全和道德考虑因素的专门讨论,这对于确保其有效性和可靠性至关重要。这项调查旨在使研究人员和从业人员对自我发展的AI代理有系统的了解,为开发更自适应,自主和终身代理系统奠定了基础。
矩阵3D:全向探索3D世界一代
- 标题: Matrix-3D: Omnidirectional Explorable 3D World Generation
- 作者: Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, Eric Li, Yang Liu, Yikai Wang, Hao-Xiang Guo, Yahui Zhou
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.08086
- 论文链接 : https://arxiv.org/pdf/2508.08086
- 项目链接 : https://matrix-3d.github.io/
英文摘要
Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.
中文摘要
可以从单个图像或文本提示中探索的3D世界一代形成了空间智能的基石。最近的作品利用视频模型实现了宽范围和可推广的3D世界一代。但是,现有的方法通常在生成的场景中遭受有限的范围。在这项工作中,我们提出了Matrix-3D,该框架利用全景代表来用于宽覆盖的全向探索3D世界一代,结合了条件视频生成和全景3D重建。我们首先训练轨迹引导的全景视频扩散模型,该模型采用场景网格作为条件,以实现高质量和几何一致的场景视频生成。为了将Panorama场景视频提升到3D世界,我们提出了两种单独的方法:(1)用于快速3D场景重建的馈送大型全景重建模型,以及(2)基于优化的基于优化的管道,用于准确详细的3D场景重建。为了促进有效的培训,我们还介绍了矩阵pano数据集,这是第一个大规模合成集合,其中包括116K高质量的静态全景序列,具有深度和轨迹注释。广泛的实验表明,我们提出的框架在全景视频生成和3D世界一代中实现了最先进的表现。请参阅https://matrix-3d.github.io中的更多信息。
Story2板:一种无训练的情节板生成方法
- 标题: Story2Board: A Training-Free Approach for Expressive Storyboard Generation
- 作者: David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski
- 日期: 2025-08-13
- ArXiv主页 : https://arxiv.org/abs/2508.09983
- 论文链接 : https://arxiv.org/pdf/2508.09983
- 项目链接 : https://daviddinkevich.github.io/Story2Board/
- gitHub仓库 : https://github.com/daviddinkevich/Story2Board
英文摘要
We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.
中文摘要
我们介绍了Story2board,这是一个无训练的框架,用于从自然语言中发挥表达性故事板的生成。现有的方法狭义地关注主题身份,忽略了视觉讲故事的关键方面,例如空间组成,背景演变和叙事节奏。为了解决这个问题,我们介绍了一个由两个组件组成的轻质一致性框架:潜在面板锚定,该框架保留了整个面板之间的共享字符参考,以及相互注意值混合,该参考值将视觉特征在令牌对之间柔和地融合在一起,并具有强烈的相互关注。这些机制共同提高了连贯性,而无需建筑变化或微调,从而使最先进的扩散模型能够产生视觉上多样化但一致的故事板。为了结构生成,我们使用现成的语言模型将自由形式的故事转换为接地面板级别的提示。为了评估,我们提出了丰富的故事板基准,这是一套开阔的域叙事,旨在评估布局多样性和背景故事的讲故事,此外还具有一致性。我们还引入了一个新的场景多样性指标,该度量量化了故事板之间的空间和姿势变化。我们的定性和定量结果以及用户研究表明,Story2板比现有基准产生的动态,连贯和叙事的故事板更具动态,连贯和引人入胜的故事板。
前奏:旨在在漫长的背景下需要全球理解和推理的基准
- 标题: PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
- 作者: Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
- 日期: 2025-08-13
- ArXiv主页 : https://arxiv.org/abs/2508.09848
- 论文链接 : https://arxiv.org/pdf/2508.09848
- 项目链接 : https://gorov.github.io/prelude/leaderboard.html
英文摘要
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
中文摘要
我们介绍了Prelude,这是一个基准,用于通过确定角色的前传故事是否与原始书的规范叙述一致,以评估长篇小说的理解。与现有基准相比,我们的任务对全球理解和深厚的推理提出了更大的需求 - 因为前传不是原始故事的一部分,因此评估其合理性通常需要搜索和集成仅间接相关的信息。从经验上讲,88%的实例需要叙事多个部分的证据。实验结果强调了我们任务的挑战:最先进的LLM中的文章学习,抹布和内域培训以及商业深入研究服务,落后于人类的落后于15%。一项进一步的人类研究表明,模型通常会出于有缺陷的推理产生正确的答案,与人类相比,推理准确性的差距超过30%。这些发现强调了长期理解和推理的重大空间。
OMNI效应:统一和可控制的视觉效果生成
- 标题: Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
- 作者: Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.07981
- 论文链接 : https://arxiv.org/pdf/2508.07981
- 项目链接 : https://amap-ml.github.io/Omni-Effects.github.io/
- gitHub仓库 : https://github.com/AMAP-ML/Omni-Effects
英文摘要
Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.
中文摘要
视觉效果(VFX)是现代电影制作基础的必不可少的视觉增强。尽管视频生成模型为VFX生产提供了成本效益的解决方案,但是当前方法受到每效洛拉培训的限制,该方法将生成限制为单个效果。这种基本限制阻碍了需要空间控制的复合效应的应用,即同时生成指定位置的多个效应。但是,将各种效果整合到统一的框架中面临着重大挑战:在多VFX联合培训期间的效果变化和空间不合格性的干扰。为了应对这些挑战,我们提出了Omni-Effects,这是第一个统一框架,能够产生迅速引导效果和可空间控制的复合效应。我们框架的核心包括两个关键的创新:(1)基于洛拉的专家(Lora-Moe)的混合物,该专家采用了一组专家洛拉斯,在统一模型中将各种效果整合在一起,同时有效地减轻了交叉任务的干扰。(2)空间感知提示(SAP)将空间掩码信息包含到文本令牌中,从而实现精确的空间控制。此外,我们引入了集成在SAP中的独立信息流(IIF)模块,隔离了对应于单个效应的控制信号,以防止任何不必要的混合。为了促进这项研究,我们通过新的数据收集管道结合图像编辑和第一last框架到视频(FLF2V)合成,构建了全面的VFX数据集Omni-VFX,并引入了一个专用的VFX评估框架,以验证模型性能。广泛的实验表明,OMNI效应可实现精确的空间控制和多样化的效果生成,使用户能够指定所需效果的类别和位置。
VOOST:双向虚拟试验和试验的统一且可扩展的扩散变压器
- 标题: Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
- 作者: Seungyong Lee, Jeong-gi Kwak
- 日期: 2025-08-06
- ArXiv主页 : https://arxiv.org/abs/2508.04825
- 论文链接 : https://arxiv.org/pdf/2508.04825
- 项目链接 : https://nxnai.github.io/Voost/
- gitHub仓库 : https://github.com/nxnai/Voost
英文摘要
Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
中文摘要
虚拟试验旨在综合穿着目标服装的人的现实形象,但准确地对衣服的对应关系进行建模仍然是一个持续的挑战,尤其是在姿势和外观变化下。在本文中,我们提出了VOOST-统一且可扩展的框架,该框架共同学习了通过单个扩散变压器学习虚拟的尝试和试用。通过对这两个任务进行建模,VOOST可以使每个服装人对对方向进行监督,并支持在生成方向和服装类别上的灵活调节,从而增强了不具有特定于任务的网络,辅助损失或其他标签的服装与体型的关系推理。此外,我们介绍了两种推理时间技术:注意力温度缩放,以解决解决方案或掩盖变化的鲁棒性,以及在任务之间利用双向一致性的自校正抽样。广泛的实验表明,VOOST在试用和试用基准方面都取得了最新的结果,在对齐精度,视觉保真度和概括方面始终超过强大的基准。
看,聆听,记忆和推理:具有长期记忆的多模式代理
- 标题: Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
- 作者: Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
- 日期: 2025-08-13
- ArXiv主页 : https://arxiv.org/abs/2508.09736
- 论文链接 : https://arxiv.org/pdf/2508.09736
- 项目链接 : https://m3-agent.github.io/
- gitHub仓库 : https://github.com/ByteDance-Seed/m3-agent
英文摘要
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent
中文摘要
我们介绍了M3 Agent,这是一种具有长期记忆的新型多模式框架。像人类一样,M3代理可以处理实时的视觉和听觉输入以构建和更新其长期内存。除了情节记忆之外,它还会发展出语义记忆,从而使其能够随着时间的推移积累世界知识。它的记忆是以以实体为中心的多模式格式组织的,可以更深入,更一致地了解环境。鉴于指令,M3代理自主执行多转弯,迭代推理,并从内存中检索相关信息以完成任务。为了评估多模式代理中的记忆效率和基于内存的推理,我们开发了M3 Bench,这是一个新的长时间video问题回答基准。M3板凳由从机器人的角度(M3 Bench-robot)捕获的100个新录制的现实世界视频和跨不同场景(M3 Bench-Web)的929个网络节目视频。我们注释问答式的配对,旨在测试对代理应用所必需的关键功能,例如人类理解,常识提取和跨模式推理。实验结果表明,通过增强学习训练的M3代理胜过最强的基线,这是一种使用Gemini-1.5-Pro和GPT-4O的提示代理,在M3 Bench-Robot,M3-Bench-web和Videmme tong上分别获得了6.7%,7.7%和5.3%的精度。我们的工作将多模式的代理推向更类似人类的长期记忆,并为其实用设计提供了见解。型号,代码和数据可在https://github.com/bytedance-seed/m3-agent上获得
Tooncomposer:用生成后的keyframing简化卡通制作
- 标题: ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
- 作者: Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan
- 日期: 2025-08-14
- ArXiv主页 : https://arxiv.org/abs/2508.10881
- 论文链接 : https://arxiv.org/pdf/2508.10881
- 项目链接 : https://lg-li.github.io/project/tooncomposer
- gitHub仓库 : https://github.com/TencentARC/ToonComposer
英文摘要
Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.
中文摘要
传统的卡通和动漫生产涉及需要密集的手动努力的关键帧,渐进式和着色阶段。尽管AI的最新进展,现有方法通常会分别分别处理这些阶段,从而导致错误积累和人工制品。例如,内部方法在大型动作中遇到困难,而着色方法需要密集的人均草图。为了解决这个问题,我们介绍了ToonComposer,这是一个生成模型,将嵌入式和着色统一为单个后的关键阶段。Tooncomposer采用稀疏的草图注入机制,使用KeyFrame草图提供精确的控制。此外,它使用带有空间低级适配器的卡通改编方法来定制现代视频基础模型,同时保持其暂时的先验完整。ToonComposer需要少于单个草图和彩色参考框架,稀疏输入也很出色,同时还支持任何时间位置的多个草图,以进行更精确的运动控制。这种双重能力可减少手动工作量并提高灵活性,从而在现实世界中赋予艺术家能力。为了评估我们的模型,我们进一步创建了PKBench,这是一种基准,具有模拟现实世界中用例的人绘制草图。我们的评估表明,ToonComposer在视觉质量,运动一致性和生产效率方面的表现优于现有方法,为AI辅助卡通生产提供了优越,更灵活的解决方案。
超越十回合:通过大规模异步RL解锁长胜下的代理搜索
-
标题: Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
-
作者: Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu
-
日期: 2025-08-11
-
ArXiv主页 : https://arxiv.org/abs/2508.07976
-
gitHub仓库 : https://github.com/inclusionAI/ASearcher
英文摘要
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.
中文摘要
基于LLM的代理商的最新进展表明,通过整合外部工具来处理复杂的,知识密集的任务。在不同的工具选择中,搜索工具在获取广泛的外部知识中起着关键作用。但是,开源代理仍然无法实现专家级搜索智能,能够解决模棱两可的查询,生成精确的搜索,分析结果并进行彻底探索的能力。现有方法的可伸缩性,效率和数据质量缺乏。例如,现有的在线RL方法中的小转弯限制,例如<= 10,限制复杂的策略学习。本文介绍了Asearcher,这是一个开源项目,用于搜索剂的大规模RL培训。我们的主要贡献包括:(1)可扩展的完全异步RL训练,可在维持高训练效率的同时进行长马搜索。(2)基于及时的LLM代理,自主综合了高质量和具有挑战性的QA,创建了一个大型QA数据集。通过RL培训,我们迅速的QWQ-32B代理商取得了重大改进,分别在XBench和Gaia上获得46.7%和20.8%的AVG。值得注意的是,我们的代理商表现出极端的长马搜索,工具呼叫超过40圈,输出令牌在训练时间内超过150k。ASEARCHER-WEB-QWQ凭借简单的代理设计和没有外部LLM,在XBench上获得42.1的AVG,在Gaia上达到52.8分数,超过了现有的开源32B代理。我们在https://github.com/inclusionai/asearcher中开放源模型,培训数据和代码。
第一部分:技巧还是陷阱?深入研究LLM推理的RL
- 标题: Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
- 作者: Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.08221
- 论文链接 : https://arxiv.org/pdf/2508.08221
英文摘要
Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
中文摘要
LLM推理的强化学习已迅速成为一个著名的研究领域,其标志着有关算法创新和实际应用的相关研究的巨大激增。尽管取得了这种进步,但仍有一些关键挑战,包括缺乏采用RL技术的标准化准则以及对其潜在机制的分散理解。此外,实验环境不一致,训练数据的变化以及模型初始化的差异导致了矛盾的结论,掩盖了这些技术的关键特征,并在选择适当的技术时会在从业者中造成混乱。本文通过严格的复制品和统一的开源框架内的严格复制和隔离评估来系统地审查广泛采用的RL技术。我们通过细粒度实验(包括各种难度,模型尺寸和体系结构的数据集)分析了每种技术的内部机制,适用的方案和核心原理。基于这些见解,我们介绍了选择针对特定设置的RL技术的明确指南,并为从业者提供可靠的路线图,为LLM域的RL导航。最后,我们揭示了两种技术的简约组合可以使用香草PPO损失来解锁无评论家政策的学习能力。结果表明,我们的简单组合一致地提高了性能,超过了GRPO和DAPO等策略。
Sonar-llm:在句子嵌入中思考并在代币中说话的自回旋变压器
- 标题: SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
- 作者: Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev
- 日期: 2025-08-07
- ArXiv主页 : https://arxiv.org/abs/2508.05305
- 论文链接 : https://arxiv.org/pdf/2508.05305
英文摘要
The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.
中文摘要
最近提出的大型概念模型(LCM)通过预测具有均值错误或扩散目标的一系列句子级嵌入和训练来生成文本。我们介绍了Sonar-Llm,这是一种仅解码器的变压器,在同一连续的声纳嵌入空间中"思考",但通过通过冷冻声纳解码器传播的令牌跨透明镜进行了监督。该混合物镜保留了LCM的语义抽象,同时消除了其扩散采样器并恢复基于可能性的训练信号。在39m到1.3B参数的型号尺寸之间,Sonar-Llm具有竞争力的生成质量。我们报告缩放趋势,消融,基准结果,并发布完整的培训代码和所有验证的检查站,以促进可重复性和未来的研究。
Molmoact:可以在太空中推理的动作推理模型
- 标题: MolmoAct: Action Reasoning Models that can Reason in Space
- 作者: Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.07917
- 论文链接 : https://arxiv.org/pdf/2508.07917
- 项目链接 : https://allenai.org/blog/molmoact
- gitHub仓库 : https://github.com/allenai/molmoact
英文摘要
Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact
中文摘要
推理对于有目的的行动至关重要,但是大多数机器人基础模型将感知和指示直接绘制为控制,从而限制了适应性,泛化和语义基础。我们介绍了动作推理模型(ARM),这是一种通过结构化的三阶段管道整合感知,计划和控制的一类视觉语言行动模型。我们的模型Molmoact将观测和指令编码为深度感知令牌,将中层空间计划作为可编辑的轨迹痕迹生成中层空间计划,并预测精确的低级动作,实现可解释和可解释的行为。MOLMOACT-7B-D在模拟和现实世界中实现了强大的性能:SimpleReenv视觉匹配任务上的70.5%零射击精度,超过了封闭源PI-0和GR00T N1;Libero的平均成功率为86.6%,其中包括长期训练任务的6.3%增长;在现实世界中,在PI-0速率上额外增加了10%(单臂)和22.7%的任务进展。它还超过了基本线的分布概括,并获得了23.3%的概括,并获得了最高的人类偏好分数,以进行开放式教学和轨迹转向。此外,我们首次发布了Molmoact数据集 - 中型训练机器人数据集,该数据集在不同的情况和任务中包括10,000多个高质量的机器人轨迹。使用该数据集的培训可以使基本模型的一般性能平均提高5.5%。我们释放所有模型权重,培训代码,收集的数据集和我们的动作推理数据集,将Molmoact确立为最先进的机器人基础模型,又是用于通过结构化推理将感知转化为有目的行动的武器的开放蓝图。Blogpost:https://allenai.org/blog/molmoact
UI-Venus技术报告:建造具有RFT的高性能UI代理
- 标题: UI-Venus Technical Report: Building High-performance UI Agents with RFT
- 作者: Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang
- 日期: 2025-08-14
- ArXiv主页 : https://arxiv.org/abs/2508.10833
- 论文链接 : https://arxiv.org/pdf/2508.10833
- 项目链接 : https://github.com/inclusionAI/UI-Venus
- gitHub仓库 : https://github.com/inclusionAI/UI-Venus
英文摘要
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.
中文摘要
我们提出UI-Venus,这是一种本机UI代理,仅根据多模式大语模型将屏幕截图作为输入。UI-Venus通过基于QWEN2.5-VL的强化芬特(RFT)仅使用数十万高质量的培训样品在UI接地和导航任务上实现SOTA性能。Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on theAndroidworld是一个在线UI导航领域,我们的7b和72b变体在其上实现了49.1%和65.9%的成功率,也可以击败现有模型。为了实现这一目标,我们为UI接地和导航任务和相应的数据清洁策略提供了精心设计的奖励功能,以进一步提高了自我验证的跨度,我们会增强自我验证的跨度,我们可以自我跨度的跨度范围,以使导航的历史跨度延伸,并跨度跨度的历史悠久和跨度的历史范围。跟踪和平衡稀疏但至关重要的动作的分布,从而导致更连贯的计划和在复杂的UI任务中更好地概括。我们的贡献包括发布SOTA开源UI代理,全面的数据清洁协议以及改善导航绩效的新型自我发展框架,这鼓励了社区的进一步研究和发展。代码可从https://github.com/antgroup/ui-venus获得。
mol-r1:迈向分子发现中的显式长期推理
- 标题: Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery
- 作者: Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, Qing Li
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.08401
- 论文链接 : https://arxiv.org/pdf/2508.08401
英文摘要
Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.
中文摘要
大型语言模型(LLMS),特别是诸如DeepSeek-R1和QWQ之类的明确的长链(COT)推理模型,表现出强大的推理能力,在常识性推理和数学推断方面取得了令人印象深刻的表现。尽管具有有效性,但长期推理模型经常因其有限的能力和低效率而受到批评,例如分子发现等知识密集型领域。在该领域的成功需要对领域知识的精确理解,包括分子结构和化学原理,这是由于分子数据的固有复杂性和高质量专家注释的稀缺性而具有挑战性的。为了弥合这一差距,我们引入了Mol-R1,这是一个新颖的框架,旨在提高基于文本的分子生成中R1样明确的长期推理LLM的解释性和推理性能。我们的方法始于高质量的推理数据集,该数据集是通过通过封闭式蒸馏(PRID)事先调节(PRID)策划的,这是一种专门的蒸馏策略,可有效地生成以先前法规为指导的配对推理痕迹。在此基础上,我们介绍了Moia,Moia,分子迭代适应性,这是一种复杂的训练策略,迭代地结合了受监督的微调(SFT)和增强策略优化(RPO),该策略量身定制,旨在提高分子发现的R1样推理的推理性能。最后,我们检查了Mol-R1在基于文本的分子推理生成任务中的性能,显示出对现有基准的卓越性能。
Klear-Reasoner:通过梯度裁剪策略优化提高推理能力
- 标题: Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
- 作者: Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Guorui Zhou
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.07629
- 论文链接 : https://arxiv.org/pdf/2508.07629
- 项目链接 : https://github.com/suu990901/KlearReasoner
- gitHub仓库 : https://github.com/suu990901/KlearReasoner
英文摘要
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.
中文摘要
我们提出了Klear-Reasoner,这是一个具有较长推理能力的模型,在解决问题期间表现出了仔细的审议,并在多个基准测试中实现了出色的性能。尽管在当前社区中已经有许多与推理模型相关的出色作品,但由于培训细节的披露不完整,重现高性能推论模型仍然存在许多问题。该报告提供了对推理模型的深入分析,涵盖了从数据制备和长期经过经过经过经过经过经验的监督链监督的微调(长COT SFT)到加固学习(RL)的整个培训工作流程,以及每个实验组件的详细消融研究。对于SFT数据,我们的实验表明,少数高质量的数据源比大量不同的数据源更有效,并且困难的样本无需精确过滤即可获得更好的结果。此外,我们研究了RL中当前剪切机制的两个关键问题:剪裁抑制了关键的探索信号,而忽略了次优轨迹。为了应对这些挑战,我们提出了具有梯度的剪裁策略优化(GPPO),该剪辑策略优化(GPPO)轻轻地将剪辑令牌的梯度缩回。GPPO不仅增强了模型的勘探能力,而且还提高了其从负样本中学习的效率。Klear-Reasoner在数学和编程中表现出了出色的推理能力,在AIME 2024上得分为90.5 \%,AIME 2025,66.0 \%在Livecodebench V5上的Aime,83.2 \%,在LiveCodeBench v6上的LiveCodeBench V5和58.1 \%。
复杂的逻辑教学生成
-
标题: Complex Logical Instruction Generation
-
作者: Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song
-
日期: 2025-08-12
-
ArXiv主页 : https://arxiv.org/abs/2508.09125
-
gitHub仓库 : https://github.com/mianzhang/LogicIF
英文摘要
Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF
中文摘要
以下指导催化了最近的大语言模型(LLM)时代,并且是基本技能,基础是推理和代理行为等更先进的能力。随着任务越来越具有挑战性,嵌入自然语言指令中的逻辑结构变得越来越复杂。但是,LLM在此类逻辑丰富的指令上的性能仍然不足。我们提出了逻辑和逻辑效果。Logicifgen是一个可扩展的自动化框架,用于从代码函数中生成可验证的指令,可以自然表达富含逻辑,例如条件,嵌套,递归和功能调用。我们进一步策划了复杂代码功能的集合,并使用Logicifgen来构建LogicifeVal,这是一个包括426个可验证逻辑丰富的说明的基准。我们的实验表明,当前最新的LLM仍在努力遵循逻辑效果中的说明。大多数LLM只能遵循少于60%的说明,从而揭示了遵循指令能力的明显缺陷。代码和基准:https://github.com/mianzhang/logicif
替补:视频生成的轻巧和插件的身份控制
- 标题: Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- 作者: Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li
- 日期: 2025-08-11
- ArXiv主页 : https://arxiv.org/abs/2508.07901
- 论文链接 : https://arxiv.org/pdf/2508.07901
- 项目链接 : https://stand-in-video.github.io/
- gitHub仓库 : https://github.com/WeChatCV/Stand-In
英文摘要
Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just sim1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
中文摘要
在生成AI领域,生成与用户指定身份相匹配的高保真人类视频既重要却又具有挑战性。现有方法通常依赖于过多的培训参数,并且与其他AIGC工具缺乏兼容性。在本文中,我们提出了Stand-In,这是一个轻巧和插件的框架,用于视频生成中的身份保存。具体而言,我们将条件图像分支引入预先训练的视频生成模型。身份控制是通过有条件位置映射的受限制自我来实现的,只能通过2000对快速学习。尽管仅使用SIM1 \%的其他参数,我们的框架在视频质量和身份保存方面取得了出色的成果,表现优于其他全参数培训方法。此外,我们的框架可以无缝集成到其他任务,例如主题驱动的视频生成,姿势引用的视频生成,风格化和面部交换。
角色旋转:可控且一致的4D角色动画
-
标题: CharacterShot: Controllable and Consistent 4D Character Animation
-
作者: Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, Cairong Zhao
-
日期: 2025-08-10
-
ArXiv主页 : https://arxiv.org/abs/2508.07409
-
gitHub仓库 : https://github.com/Jeoyal/CharacterShot
英文摘要
In this paper, we propose CharacterShot, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.
中文摘要
在本文中,我们提出了一个可控且一致的4D字符动画框架的角色示威,使任何单个设计师都能从单个参考字符图像和2D姿势序列创建动态3D字符(即4D字符动画)。我们首先根据基于尖端DIT的图像到视频模型预处理功能强大的2D角色动画模型,该模型允许任何2D姿势Sequnce作为可控信号。然后,我们将动画模型从2D提高到3D,通过在生成具有时空和空间视图一致性的多视图视频之前引入双意见模块以及相机。最后,我们在这些多视频视频上采用了一种新颖的邻居限制的4D高斯脱落优化,从而产生了连续且稳定的4D字符表示。此外,为了提高以角色为中心的性能,我们构建了一个大规模的数据集字符4D,其中包含13115个独特的字符,具有不同的外观和动作,从多个观点呈现出来。在我们新建的基准测试标准中进行的广泛实验表明,我们的方法表现优于当前的最新方法。代码,模型和数据集将在https://github.com/jeoyal/charactershot上公开获取。
BrowseComp-Plus:深入研究代理的更公平,更透明的评估基准
- 标题: BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
- 作者: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
- 日期: 2025-08-08
- ArXiv主页 : https://arxiv.org/abs/2508.06600
- 论文链接 : https://arxiv.org/pdf/2508.06600
- 项目链接 : https://texttron.github.io/BrowseComp-Plus/
- gitHub仓库 : https://github.com/texttron/BrowseComp-Plus
英文摘要
Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
中文摘要
将大型语言模型(LLMS)与搜索工具集成的深入研究代理在提高处理复杂查询的有效性方面取得了成功,该查询需要迭代搜索计划和推理,而不是搜索结果。对Browsecomp等当前基准测试的评估依赖于黑框实时Web搜索API,在(1)公平性中具有明显的限制:动态和不透明的Web Apis阻碍了公平的比较和深度研究方法的可重复性;(2)透明度:缺乏对文档语料库的控制,因此很难隔离猎犬的贡献。换句话说,当前的评估可以比较在给定时间的完整深入研究系统,但它们并没有促进良好控制的实验,以提供对潜在深入研究LLM的能力的见解。为了应对这些挑战,我们引入了BrowseComp-Plus,这是一种源自BrowseComp的基准,采用了固定的,精心策划的语料库。BrowseComp-Plus中的每个查询都包括人为验证的支持文档和挖掘具有挑战性的负面因素,从而实现了受控的实验。该基准显示可有效区分深度研究系统的性能。例如,与BM25猎犬配对时,开源模型搜索R1的精度为3.86%,而GPT-5则达到55.9%。将GPT-5与QWEN3-EBEDDING-8B回收者集成在一起,进一步将其准确性提高到70.1%,而搜索呼叫较少。该基准允许对深度研究代理和检索方法进行全面评估和分析分析,从而促进了深入研究系统中对检索有效性,引用准确性和上下文工程的见解。
时间是一个功能:在扩散语言模型中利用时间动态
- 标题: Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
- 作者: Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen
- 日期: 2025-08-12
- ArXiv主页 : https://arxiv.org/abs/2508.09138
- 论文链接 : https://arxiv.org/pdf/2508.09138
- 项目链接 : https://aim-uofa.github.io/dLLM-MidTruth/
- gitHub仓库 : https://github.com/aim-uofa/dLLM-MidTruth
英文摘要
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
中文摘要
扩散大语言模型(DLLM)通过迭代deNOSISING生成文本,但当前的解码策略丢弃了丰富的中间预测,而有利于最终输出。我们在这里的工作揭示了一种关键现象,时间振荡,在中间过程中经常出现正确的答案,但在以后的DeNoSising步骤中被覆盖。为了解决这个问题,我们介绍了利用时间一致性的两种互补方法:1)时间段落自称投票,一种无培训的测试时间解码策略,汇总了跨deNo的步骤的预测,以选择最一致的输出;2)一种称为时间一致性增强的训练后方法,该方法使用时间语义熵(TSE),这是一个跨中间预测的语义稳定性的度量,作为鼓励稳定世代的奖励信号。多个基准的经验结果证明了我们方法的有效性。仅使用负TSE奖励,我们观察到倒计时数据集的平均平均值比现有DLLM的平均提高了24.7%。结合准确性奖励,我们在GSM8K上获得了2.0%的绝对增长,在MATH500上获得了4.3%,SVAMP的绝对增长率分别为6.6%,倒计时分别为25.3%。我们的发现强调了DLLM中时间动态的未开发潜力,并提供了两个简单而有效的工具来利用它们。
Vertexregen:网格生成具有连续的细节水平
- 标题: VertexRegen: Mesh Generation with Continuous Level of Detail
- 作者: Xiang Zhang, Yawar Siddiqui, Armen Avetisyan, Chris Xie, Jakob Engel, Henry Howard-Jenkins
- 日期: 2025-08-12
- ArXiv主页 : https://arxiv.org/abs/2508.09062
- 论文链接 : https://arxiv.org/pdf/2508.09062
- 项目链接 : https://vertexregen.github.io/
英文摘要
We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.
中文摘要
我们介绍了Vertexregen,这是一种新型的网格生成框架,可以连续的细节级别生成。现有的自回归方法以部分到完整的方式生成网格,因此产生的中间步骤代表不完整的结构。Vertexregen从渐进式网格中汲取灵感,并重新制定该过程,因为边缘崩溃的逆转(即顶点分裂)通过生成模型学到了。实验结果表明,Vertexregen产生与最先进方法相当的网格,同时独特地提供任何时间生成,并在任何步骤中都会停止任何步骤,以产生有效的网格,并具有不同级别的细节。
关于扩散语言模型的调查
- 标题: A Survey on Diffusion Language Models
- 作者: Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen
- 日期: 2025-08-14
- ArXiv主页 : https://arxiv.org/abs/2508.10875
- 论文链接 : https://arxiv.org/pdf/2508.10875
英文摘要
Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.
中文摘要
扩散语言模型(DLM)迅速成为了主要自回旋(AR)范式的强大而有希望的替代品。通过通过迭代授予过程并行生成令牌,DLM在减少推理潜伏期和捕获双向上下文方面具有固有的优势,从而可以对生成过程进行细粒度的控制。在实现多个速度的同时,最近的进步使DLM可以显示出与自回归同行相当的性能,这使其成为各种自然语言处理任务的令人信服的选择。在这项调查中,我们提供了当前DLM景观的整体概述。我们追踪其演变和与其他范式的关系,例如自回归和掩盖的语言模型,并涵盖了基本原理和最先进的模型。我们的工作提供了最新的,全面的分类法和对当前技术的深入分析,从培训前策略到先进的训练后方法。这项调查的另一个贡献是对DLM推理策略和优化的彻底综述,包括改进并行性,缓存机制和发电质量的改进。我们还强调了DLM的多模式扩展的最新方法,并在各种实际情况下描述了其应用程序。此外,我们的讨论解决了DLM的局限性和挑战,包括效率,长期处理和基础设施要求,同时概述了未来的研究指示,以维持这个快速发展的领域的进步。Project GitHub可从https://github.com/vila-lab/awesome-dlms获得。
MEMP:探索代理程序记忆
- 标题: Memp: Exploring Agent Procedural Memory
- 作者: Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
- 日期: 2025-08-08
- ArXiv主页 : https://arxiv.org/abs/2508.06433
- 论文链接 : https://arxiv.org/pdf/2508.06433
英文摘要
Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.
中文摘要
大型语言模型(LLMS)的代理在各种任务上都表现出色,但它们遭受了脆性的程序记忆的困扰,这些记忆是在静态参数中手动设计或纠缠的。在这项工作中,我们研究了策略,以赋予代理商具有可学习,可更新和终身的程序记忆。我们提出的MEMP将过去的代理轨迹提炼成细粒度,分步说明和更高级别,类似脚本的抽象,并探讨不同策略对构建,检索和更新过程内存的影响。再加上一种动态方案,该方案不断更新,纠正和弃用其内容,该存储库随着新的体验而演变。对旅行计划者和ALFWORLD的经验评估表明,随着记忆存储库的完善,代理在类似任务上稳定地获得更高的成功率和更高的效率。此外,由更强模型构建的程序内存保留了其价值:将程序记忆迁移到较弱的模型可带来可观的性能增长。