【论文速递】2026年第03周(Jan-11-17)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译，翻译不对的地方以英文为准

观看、推理和搜索：开放网络上代理视频推理的视频深度研究基准
- 英文摘要
- 中文摘要
BabyVision：超越语言的视觉推理
- 英文摘要
- 中文摘要
[STEP3-VL-10B 技术报告](#STEP3-VL-10B 技术报告)
- 英文摘要
- 中文摘要
用地图思考：用于地理定位的强化并行地图增强代理
- 英文摘要
- 中文摘要
具有视觉语言推理的城市社会语义分割
- 英文摘要
- 中文摘要
奖励稀有：法学硕士中用于创造性解决问题的独特性感知强化学习
- 英文摘要
- 中文摘要
DeepResearchEval：深度研究任务构建和代理评估的自动化框架
- 英文摘要
- 中文摘要
用于算法代码优化的受控自我进化
- 英文摘要
- 中文摘要
MMFormalizer：野外多模态自动形式化
- 英文摘要
- 中文摘要
[MAXS：使用 LLM 代理进行元自适应探索](#MAXS：使用 LLM 代理进行元自适应探索)
- 英文摘要
- 中文摘要
用于推理的协作多智能体测试时强化学习
- 英文摘要
- 中文摘要
PaCoRe：学习通过并行协调推理扩展测试时间计算
- 英文摘要
- 中文摘要
A^3-Bench：通过锚点和吸引子激活对内存驱动的科学推理进行基准测试
- 英文摘要
- 中文摘要
MemGovern：通过学习受管理的人类经验来增强代码代理
- 英文摘要
- 中文摘要
FlowAct-R1：迈向交互式人形视频生成
- 英文摘要
- 中文摘要
视频生成的运动归因
- 英文摘要
- 中文摘要
太阳能公开技术报告
- 英文摘要
- 中文摘要
VIBE：基于可视化指令的编辑器
- 英文摘要
- 中文摘要
[用于卓越长 CoT 推理的分布对齐序列蒸馏](#用于卓越长 CoT 推理的分布对齐序列蒸馏)
- 英文摘要
- 中文摘要
部长3
- 英文摘要
- 中文摘要
思维的分子结构：绘制长链思维推理的拓扑
- 英文摘要
- 中文摘要
KnowMe-Bench：对终生数字伴侣的人的理解进行基准测试
- 英文摘要
- 中文摘要
[Qwen3-VL-Embedding 和 Qwen3-VL-Reranker：最先进的多模态检索和排名的统一框架](#Qwen3-VL-Embedding 和 Qwen3-VL-Reranker：最先进的多模态检索和排名的统一框架)
- 英文摘要
- 中文摘要
Fast-ThinkAct：通过可言语的潜在规划进行高效的视觉-语言-动作推理
- 英文摘要
- 中文摘要
ArenaRL：通过基于锦标赛的相对排名扩展开放式智能体的强化学习
- 英文摘要
- 中文摘要
[CaricatureGS：用高斯曲率夸大 3D 高斯飞溅面](#CaricatureGS：用高斯曲率夸大 3D 高斯飞溅面)
- 英文摘要
- 中文摘要
大规模使用工具的面向用户的多轮对话生成
- 英文摘要
- 中文摘要
MHLA：通过令牌级多头恢复线性注意力的表现力
- 英文摘要
- 中文摘要

观看、推理和搜索：开放网络上代理视频推理的视频深度研究基准

标题: Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
作者: Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Jianheng Hou, Hao Peng, Chengwei Qin, Xiaobin Hu, Hong Peng, Ronghao Chen, Huacan Wang
日期: 2026-01-11
ArXiv主页 : https://arxiv.org/abs/2601.06943
论文链接 : https://arxiv.org/pdf/2601.06943
项目链接 : https://videodr-benchmark.github.io/#/home
gitHub仓库 : https://github.com/QuantaAlpha/VideoDR-Benchmark

英文摘要

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

中文摘要

在现实世界的视频问答场景中，视频通常仅提供本地化的视觉提示，而可验证的答案则分布在开放网络上；因此，模型需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为了弥补这一差距，我们构建了第一个视频深度研究基准 VideoDR。VideoDR以视频条件开放域视频问答为中心，需要跨帧视觉锚点提取、交互式网络检索以及对联合视频网络证据的多跳推理；通过严格的人工注释和质量控制，我们获得了跨越六个语义领域的高质量视频深度研究样本。我们在 Workflow 和 Agentic 范式下评估了多个闭源和开源多模态大语言模型，结果表明 Agentic 并不总是优于 Workflow：它的增益取决于模型在长检索链上维持初始视频锚点的能力。进一步分析表明，目标漂移和长期一致性是核心瓶颈。总之，VideoDR 为在开放网络环境中研究视频代理提供了系统基准，并揭示了下一代视频深度研究代理面临的主要挑战。

BabyVision：超越语言的视觉推理

标题: BabyVision: Visual Reasoning Beyond Language
作者: Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li
日期: 2026-01-10
ArXiv主页 : https://arxiv.org/abs/2601.06521
论文链接 : https://arxiv.org/pdf/2601.06521
项目链接 : https://unipat.ai/blog/BabyVision
gitHub仓库 : https://github.com/UniPat-AI/BabyVision

英文摘要

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.

中文摘要

虽然人类早在获得语言之前就已经发展了核心视觉技能，但当代多模态法学硕士（MLLM）仍然严重依赖语言先验来弥补其脆弱的视觉理解。我们发现了一个关键事实：最先进的 MLLM 在执行人类（甚至 3 岁儿童）可以轻松解决的基本视觉任务时始终失败。为了系统地研究这一差距，我们引入了 BabyVision，这是一个旨在评估 MLLM 独立于语言知识的核心视觉能力的基准。BabyVision 涵盖广泛的任务，共有 388 个项目，分为 4 个关键类别的 22 个子类。经验结果和人类评估表明，领先的 MLLM 的表现明显低于人类基线。Gemini3-Pro-Preview 得分为 49.7，落后于 6 岁人类，也远远落后于成人平均得分 94.1。这些结果表明，尽管在知识密集型评估方面表现出色，但当前的 MLLM 仍然缺乏基本的视觉原语。BabyVision 的进步代表着向人类水平的视觉感知和推理能力迈出了一步。我们还通过提出 BabyVision-Gen 和自动评估工具包来探索使用生成模型解决视觉推理问题。我们的代码和基准数据发布在 https://github.com/UniPat-AI/BabyVision 以供复制。

STEP3-VL-10B 技术报告

标题: STEP3-VL-10B Technical Report
作者: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09668
论文链接 : https://arxiv.org/pdf/2601.09668
项目链接 : https://stepfun-ai.github.io/Step3-VL-10B
gitHub仓库 : https://github.com/stepfun-ai/PaCoRe

英文摘要

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10times-20times larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

中文摘要

我们推出了 STEP3-VL-10B，这是一种轻量级开源基础模型，旨在重新定义紧凑效率和前沿级多模态智能之间的权衡。STEP3-VL-10B是通过两个战略转变来实现的：首先，在1.2T多模态令牌上采用统一的、完全解冻的预训练策略，将语言对齐的感知编码器与Qwen3-8B解码器集成在一起，以建立内在的视觉语言协同作用；其次，一个规模化的训练后管道，具有超过 1000 次强化学习迭代。至关重要的是，我们实现了并行协调推理（PaCoRe）来扩展测试时计算，将资源分配给可扩展的感知推理，以探索和综合不同的视觉假设。因此，尽管 STEP3-VL-10B 占用空间紧凑，但 STEP3-VL-10B 可以媲美或超越 10 倍至 20 倍大的型号（例如 GLM-4.6V-106B、Qwen3-VL-235B）以及 Gemini 2.5 Pro 和 Seed-1.5-VL 等顶级专有旗舰产品。它提供了一流的性能，在 MMBench 上得分为 92.2%，在 MMMU 上得分为 80.11%，同时在复杂推理方面表现出色，在 AIME2025 上得分为 94.43%，在 MathVision 上得分为 75.95%。我们发布了完整的模型套件，为社区提供强大、高效且可重复的基线。

用地图思考：用于地理定位的强化并行地图增强代理

标题: Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
作者: Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu
日期: 2026-01-08
ArXiv主页 : https://arxiv.org/abs/2601.05432
论文链接 : https://arxiv.org/pdf/2601.05432
项目链接 : https://amap-ml.github.io/Thinking-with-Map/
gitHub仓库 : https://github.com/AMAP-ML/Thinking-with-Map

英文摘要

The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model Thinking with Map ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0% to 22.1% compared to Gemini-3-Pro with Google Search/Map grounded mode.

中文摘要

图像地理定位任务旨在利用视觉线索预测地球上任何地方拍摄图像的位置。现有的大型视觉语言模型（LVLM）方法利用了世界知识、思维链推理和代理能力，但忽略了人类使用的常见策略------使用地图。在这项工作中，我们首先为 Thinking 模型配备了地图功能，并将其制定为地图中的代理循环。我们为其开发了一个两阶段优化方案，包括代理强化学习（RL）和并行测试时间缩放（TTS）。RL 增强了模型的代理能力以提高采样效率，并行 TTS 使模型能够在做出最终预测之前探索多个候选路径，这对于地理定位至关重要。为了评估我们在最新和野外图像上的方法，我们进一步提出了 MAPBench，这是一个完全由真实世界图像组成的综合地理定位训练和评估基准。实验结果表明，我们的方法在大多数指标上都优于现有的开源和闭源模型，特别是与具有 Google 搜索/地图接地模式的 Gemini-3-Pro 相比，将 Acc@500m 从 8.0% 提高到 22.1%。

具有视觉语言推理的城市社会语义分割

标题: Urban Socio-Semantic Segmentation with Vision-Language Reasoning
作者: Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li
日期: 2026-01-15
ArXiv主页 : https://arxiv.org/abs/2601.10477
论文链接 : https://arxiv.org/pdf/2601.10477
项目链接 : https://www.wangfaye.cn/paper/socioreasoner
gitHub仓库 : https://github.com/AMAP-ML/SocioReasoner

英文摘要

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.

中文摘要

作为人类活动的中心，城市表面由丰富的语义实体组成。从卫星图像中分割这些不同的实体对于一系列下游应用至关重要。当前先进的分割模型可以可靠地分割由物理属性（例如建筑物、水体）定义的实体，但仍然难以处理社会定义的类别（例如学校、公园）。在这项工作中，我们通过视觉语言模型推理实现社会语义分割。为了实现这一目标，我们引入了名为 SocioSeg 的城市社会语义分割数据集，这是一种新资源，包括卫星图像、数字地图和以分层结构组织的社会语义实体的像素级标签。此外，我们提出了一种名为 SocioReasoner 的新型视觉语言推理框架，它通过跨模式识别和多阶段推理来模拟人类识别和注释社会语义实体的过程。我们采用强化学习来优化这个不可微分的过程，并引发视觉语言模型的推理能力。实验证明了我们的方法优于最先进的模型和强大的零样本泛化能力。我们的数据集和代码可在 https://github.com/AMAP-ML/SocioReasoner 中获取。

奖励稀有：法学硕士中用于创造性解决问题的独特性感知强化学习

标题: Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
作者: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
日期: 2026-01-13
ArXiv主页 : https://arxiv.org/abs/2601.08763
论文链接 : https://arxiv.org/pdf/2601.08763
gitHub仓库 : https://github.com/zhiyuanhubj/Uniqueness-Aware-RL

英文摘要

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@k across large sampling budgets and increases the area under the pass@k curve (AUC@K) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.

中文摘要

强化学习 (RL) 已成为训练后大型语言模型 (LLM) 的核心范式，特别是对于复杂的推理任务，但它经常遭受探索崩溃的困扰：策略过早地集中于一小部分主导推理模式，改进了 pass@1，同时限制了 rollout 级别的多样性和 pass@k 的增益。我们认为这种失败源于规范本地代币行为，而不是解决方案集的多样性。为了解决这个问题，我们提出了独特性感知强化学习，这是一个推出级目标，明确奖励展示罕见高级策略的正确解决方案。我们的方法使用基于 LLM 的判断，根据高级解决策略对同一问题的推出进行聚类，忽略表面变化，并与聚类大小成反比地重新加权策略优势。因此，正确但新颖的策略比多余的策略获得更高的回报。在数学、物理和医学推理基准中，我们的方法在大样本预算中持续改进 pass@k，并在不牺牲 pass@1 的情况下增加 pass@k 曲线下的面积 (AUC@K)，同时持续探索并大规模发现更多样化的解决方案策略。

DeepResearchEval：深度研究任务构建和代理评估的自动化框架

标题: DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
作者: Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09688
论文链接 : https://arxiv.org/pdf/2601.09688
项目链接 : https://infinity-ailab.github.io/deep_research_eval/
gitHub仓库 : https://github.com/Infinity-AILab/DeepResearchEval

英文摘要

Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.

中文摘要

深度研究系统广泛用于多步骤网络研究、分析和跨源综合，但对其评估仍然具有挑战性。现有的基准通常需要注释密集型任务构建，依赖于静态评估维度，或者在引用缺失时无法可靠地验证事实。为了弥补这些差距，我们引入了 DeepResearchEval，这是一个用于深度研究任务构建和代理评估的自动化框架。对于任务构建，我们提出了一个角色驱动的管道，生成基于不同用户配置文件的现实、复杂的研究任务，应用两阶段过滤器任务资格和搜索必要性来仅保留需要多源证据集成和外部检索的任务。对于评估，我们提出了一个包含两个组件的代理管道：一个自适应逐点质量评估，根据每个生成的任务动态导出特定于任务的评估维度、标准和权重；以及一个主动事实检查，即使在引用缺失的情况下，也可以通过网络搜索自动提取和验证报告陈述。

用于算法代码优化的受控自我进化

标题: Controlled Self-Evolution for Algorithmic Code Optimization
作者: Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Yi Xu, Huacan Wang
日期: 2026-01-12
ArXiv主页 : https://arxiv.org/abs/2601.07348
论文链接 : https://arxiv.org/pdf/2601.07348
gitHub仓库 : https://github.com/QuantaAlpha/EvoControl

英文摘要

Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.

中文摘要

自进化方法通过迭代的"生成-验证-细化"循环来增强代码生成，但现有方法的探索效率较低，无法在有限的预算内发现具有较高复杂性的解决方案。这种低效率源于初始化偏差在较差的解决方案区域中捕获进化、缺乏反馈指导的不受控制的随机操作以及跨任务的经验利用不足。为了解决这些瓶颈，我们提出了受控自我进化（CSE），它由三个关键组件组成。多样化规划初始化生成结构不同的算法策略，以实现广泛的解决方案空间覆盖。遗传进化用反馈引导机制取代了随机操作，从而实现了有针对性的突变和组合交叉。分层进化记忆在任务间和任务内级别捕获成功和失败的经验。EffiBench-X 上的实验表明，CSE 始终优于各种 LLM 主干的所有基线。此外，CSE 从早期几代开始就实现了更高的效率，并在整个进化过程中保持持续改进。我们的代码可在 https://github.com/QuantaAlpha/EvoControl 上公开获取。

MMFormalizer：野外多模态自动形式化

标题: MMFormalizer: Multimodal Autoformalization in the Wild
作者: Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong
日期: 2026-01-06
ArXiv主页 : https://arxiv.org/abs/2601.03017
论文链接 : https://arxiv.org/pdf/2601.03017
项目链接 : https://mmformalizer.github.io/

英文摘要

Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io

中文摘要

自动形式化将自然语言数学转化为形式化语句以实现机器推理，但由于物理世界的多模态性质，它在野外面临着根本性的挑战，其中物理学需要从视觉元素推断隐藏的约束（例如质量或能量）。为了解决这个问题，我们提出了 MMFormalizer，它通过将自适应基础与来自现实世界数学和物理领域的实体相集成，将自动形式化扩展到文本之外。MMFormalizer 通过递归基础和公理组合，从基于感知的基元递归地构造形式命题，并通过自适应递归终止确保每个抽象都得到视觉证据的支持，并锚定在维度或公理基础上。我们在新的基准 PhyX-AF 上评估 MMFormalizer，该基准包含来自 MathVerse、PhyX、合成几何和解析几何的 115 个精选样本，涵盖各种多模态自动形式化任务。结果表明，GPT-5 和 Gemini-3-Pro 等前沿模型实现了最高的编译和语义准确性，其中 GPT-5 在物理推理方面表现出色，而几何仍然是最具挑战性的领域。总体而言，MMFormalizer 为统一的多模态自动形式化、桥接感知和形式推理提供了一个可扩展的框架。据我们所知，这是第一个能够处理经典力学（源自哈密顿量）以及相对论、量子力学和热力学的多模态自动形式化方法。更多详细信息请访问我们的项目页面：MMFormalizer.github.io

MAXS：使用 LLM 代理进行元自适应探索

标题: MAXS: Meta-Adaptive Exploration with LLM Agents
作者: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09259
论文链接 : https://arxiv.org/pdf/2601.09259
gitHub仓库 : https://github.com/exoskeletonzj/MAXS

英文摘要

Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.

中文摘要

大型语言模型（LLM）代理通过多种工具的协作展现出固有的推理能力。然而，在智能体推理过程中，现有方法经常遭受（i）由于缺乏前瞻而导致的局部近视生成，以及（ii）轨迹不稳定，其中较小的早期错误可能会升级为发散的推理路径。这些问题使得平衡全局有效性和计算效率变得困难。为了解决这两个问题，我们提出了 LLM 代理的元自适应探索 https://github.com/exosculptorzj/MAXS，这是一个基于 LLM 代理的元自适应推理框架，灵活地集成了工具执行和推理规划。MAXS采用lookahead策略将推理路径提前几步，估计工具使用的优势值，并结合步骤一致性方差和步骤间趋势斜率来共同选择稳定、一致、高价值的推理步骤。此外，我们引入了一种轨迹收敛机制，一旦实现路径一致性，该机制就会通过停止进一步推出来控制计算成本，从而在多工具推理中实现资源效率和全局有效性之间的平衡。我们对三个基本模型（MiMo-VL-7B、Qwen2.5-VL-7B、Qwen2.5-VL-32B）和五个数据集进行了广泛的实证研究，证明 MAXS 在性能和推理效率方面始终优于现有方法。进一步的分析证实了我们的前瞻策略和工具使用的有效性。

用于推理的协作多智能体测试时强化学习

标题: Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
作者: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09667
论文链接 : https://arxiv.org/pdf/2601.09667
gitHub仓库 : https://github.com/zhiyuanhubj/MATTRL

英文摘要

Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce Multi-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

中文摘要

多代理系统已发展成为许多应用程序的实用法学硕士驱动的协作器，从多样性和交叉检查中获得稳健性。然而，多智能体强化学习（MARL）训练是资源密集型且不稳定的：共同适应的队友会导致非平稳性，并且奖励通常稀疏且高方差。因此，我们引入了多智能体测试时强化学习（MATTRL），这是一个在推理时将结构化文本经验注入多智能体审议的框架。MATTRL组建多专家团队进行多轮讨论，检索并整合测试时的经验，达成共识以做出最终决策。我们还研究了构建回合经验池的学分分配，然后将其重新注入对话中。在医学、数学和教育领域具有挑战性的基准中，MATTRL 的准确性比多智能体基线平均提高了 3.67%，比同类单智能体基线提高了 8.67%。消融研究考察了不同的学分分配方案，并详细比较了它们如何影响培训结果。MATTRL 提供了一种稳定、有效且高效的途径，无需调整即可实现分布转移稳健的多智能体推理。

PaCoRe：学习通过并行协调推理扩展测试时间计算

标题: PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
作者: Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
日期: 2026-01-09
ArXiv主页 : https://arxiv.org/abs/2601.05593
论文链接 : https://arxiv.org/pdf/2601.05593
gitHub仓库 : https://github.com/stepfun-ai/PaCoRe

英文摘要

We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.

中文摘要

我们引入了并行协调推理（PaCoRe），这是一种训练和推理框架，旨在克服当代语言模型的核心限制：它们无法将测试时计算（TTC）扩展到固定上下文窗口下的顺序推理之外。PaCoRe 脱离了传统的顺序范式，通过多轮消息传递架构协调的大规模并行探索来驱动 TTC。每一轮都会启动许多并行的推理轨迹，将它们的发现压缩为上下文限制的消息，并综合这些消息来指导下一轮并最终产生最终答案。该模型通过大规模、基于结果的强化学习进行端到端训练，掌握了 PaCoRe 所需的综合能力，并可在不超出上下文限制的情况下扩展到数百万代币的有效 TTC。该方法在不同领域产生了巨大的改进，特别是推动推理超越数学前沿系统：8B 模型在 HMMT 2025 上达到 94.5%，通过将有效 TTC 扩展到大约 200 万个代币，超越了 GPT-5 的 93.2%。我们开源模型检查点、训练数据和完整的推理管道，以加速后续工作。

A^3-Bench：通过锚点和吸引子激活对内存驱动的科学推理进行基准测试

标题: A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation
作者: Jian Zhang, Yu He, Zhiyuan Wang, Zhangqi Wang, Kai He, Fangzhi Xu, Qika Lin, Jun Liu
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09274
论文链接 : https://arxiv.org/pdf/2601.09274
项目链接 : https://a3-bench.github.io/
gitHub仓库 : https://github.com/exoskeletonzj/A3-Bench

英文摘要

Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the memory-driven mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference. To address this gap, we propose A^3-Bench~ https://a3-bench.github.io, a benchmark designed to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation. First, we annotate 2,198 science reasoning problems across domains using the SAPM process(subject, anchor & attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI(Anchor--Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate A^3-Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.

中文摘要

科学推理不仅依赖于逻辑推理，还依赖于激活先验知识和经验结构。记忆可以有效地重用知识，增强推理的一致性和稳定性。然而，现有的基准主要评估最终答案或逐步的一致性，忽略了人类推理背后的记忆驱动机制，其中涉及激活锚点和吸引子，然后将它们集成到多步推理中。为了解决这一差距，我们提出了 A^3-Bench~ https://a3-bench.github.io，这是一个旨在通过双尺度内存驱动激活来评估科学推理的基准，以锚点和吸引子激活为基础。首先，我们使用 SAPM 流程（主题、锚点和吸引子、问题和记忆开发）注释了 2,198 个跨领域的科学推理问题。其次，我们引入了利用锚点和吸引子的双尺度记忆评估框架，以及 AAUI（锚点-吸引子利用率指数）指标来测量记忆激活率。最后，通过各种基础模型和范式的实验，我们验证了 A^3-Bench 并分析了记忆激活如何影响推理性能，为记忆驱动的科学推理提供了见解。

MemGovern：通过学习受管理的人类经验来增强代码代理

标题: MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences
作者: Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, Yue Hu, Shaolei Zhang, Yanbing Liu, Ronghao Chen, Huacan Wang
日期: 2026-01-11
ArXiv主页 : https://arxiv.org/abs/2601.06789
论文链接 : https://arxiv.org/pdf/2601.06789
gitHub仓库 : https://github.com/QuantaAlpha/MemGovern

英文摘要

While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a "closed-world" limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.

中文摘要

虽然自主软件工程 (SWE) 代理正在重塑编程范式，但它们目前受到"封闭世界"的限制：它们试图从头开始或仅使用本地上下文来修复错误，而忽略了 GitHub 等平台上可用的大量历史人类经验。现实世界问题跟踪数据的非结构化和碎片化性质阻碍了获得这种开放世界的体验。在本文中，我们介绍了 MemGovern，这是一个旨在管理原始 GitHub 数据并将其转换为代理可操作的经验记忆的框架。MemGovern 采用体验治理将人类经验转换为代理友好的体验卡，并引入代理体验搜索策略，支持逻辑驱动的人类专业知识检索。通过生成 135K 治理体验卡，MemGovern 实现了显着的性能提升，将 SWE-bench Verified 上的分辨率提高了 4.65%。作为一种插件方法，MemGovern 为代理友好的内存基础设施提供了解决方案。

FlowAct-R1：迈向交互式人形视频生成

标题: FlowAct-R1: Towards Interactive Humanoid Video Generation
作者: Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, Yulong Wang, Zerong Zheng, Jianwen Jiang, Chao Liang, Weifeng Chen, Xing Wang, Yuan Zhang, Mingyuan Gao
日期: 2026-01-15
ArXiv主页 : https://arxiv.org/abs/2601.10103
论文链接 : https://arxiv.org/pdf/2601.10103
项目链接 : https://grisoon.github.io/FlowAct-R1/

英文摘要

Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.

中文摘要

交互式人形视频生成旨在合成逼真的视觉代理，可以通过连续且响应式的视频与人类互动。尽管视频合成最近取得了进展，但现有方法常常难以在高保真合成和实时交互要求之间进行权衡。在本文中，我们提出了 FlowAct-R1，一个专门为实时交互式人形视频生成而设计的框架。FlowAct-R1 基于 MMDiT 架构构建，可实现任意持续时间的视频流合成，同时保持低延迟响应能力。我们引入了一种分块扩散强迫策略，并辅以一种新颖的自强迫变体，以减轻错误积累并确保连续交互过程中的长期时间一致性。通过利用高效的蒸馏和系统级优化，我们的框架在 480p 分辨率下实现了稳定的 25fps，首帧时间 (TTFF) 仅约 1.5 秒。所提出的方法提供了整体和细粒度的全身控制，使代理能够在交互场景中的不同行为状态之间自然地转换。实验结果表明，FlowAct-R1 实现了卓越的行为生动性和感知真实感，同时在不同的角色风格中保持了强大的泛化能力。

视频生成的运动归因

标题: Motion Attribution for Video Generation
作者: Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine
日期: 2026-01-13
ArXiv主页 : https://arxiv.org/abs/2601.08828
论文链接 : https://arxiv.org/pdf/2601.08828
项目链接 : https://research.nvidia.com/labs/sil/projects/MOTIVE/

英文摘要

Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

中文摘要

尽管视频生成模型取得了快速进展，但人们对数据在影响运动中的作用却知之甚少。我们提出了 Motive（视频生成的 MOTIon 归因），这是一种以运动为中心、基于梯度的数据归因框架，可扩展到现代、大型、高质量的视频数据集和模型。我们用它来研究哪些微调剪辑可以改善或降低时间动态。Motive 通过运动加权损失掩模将时间动态与静态外观隔离，从而产生高效且可扩展的运动特定影响计算。在文本到视频模型上，Motive 可以识别强烈影响运动的剪辑，并指导数据管理，从而提高时间一致性和物理合理性。利用Motive选择的高影响力数据，我们的方法提高了VBench上的运动平滑度和动态程度，与预训练的基础模型相比，实现了74.1%的人类偏好获胜率。据我们所知，这是第一个在视频生成模型中归因于运动而不是视觉外观并使用它来管理微调数据的框架。

太阳能公开技术报告

标题: Solar Open Technical Report
作者: Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh
日期: 2026-01-11
ArXiv主页 : https://arxiv.org/abs/2601.07022
论文链接 : https://arxiv.org/pdf/2601.07022
项目链接 : https://upstage.ai

英文摘要

We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.

中文摘要

我们推出 Solar Open，这是一种针对服务不足的语言的 102B 参数双语专家混合语言模型。Solar Open 展示了通过解决三个相互关联的挑战来构建有竞争力的法学硕士的系统方法。首先，为了在服务不足的语言缺乏数据的情况下进行有效训练，我们合成了 4.5T 高质量、特定领域和面向 RL 的数据令牌。其次，我们通过渐进式课程来协调这些数据，共同优化 20 万亿代币的组成、质量阈值和领域覆盖范围。第三，为了通过可扩展的强化学习实现推理能力，我们应用我们提出的框架 SnapPO 进行高效优化。在英语和韩语基准测试中，Solar Open 取得了有竞争力的表现，证明了这种方法对于服务不足的语言人工智能开发的有效性。

VIBE：基于可视化指令的编辑器

标题: VIBE: Visual Instruction Based Editor
作者: Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich
日期: 2026-01-05
ArXiv主页 : https://arxiv.org/abs/2601.02242
论文链接 : https://arxiv.org/pdf/2601.02242
项目链接 : https://riko0.github.io/VIBE/
gitHub仓库 : https://github.com/ai-forever/vibe

英文摘要

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.

中文摘要

基于指令的图像编辑是生成人工智能中发展最快的领域之一。在过去的一年里，该领域已经达到了一个新的水平，数十个开源模型与高性能的商业系统一起发布。然而，目前只有有限数量的开源方法达到了现实世界的质量。此外，扩散主干网（这些管道的主要选择）对于许多部署和研究环境来说通常很大且计算成本昂贵，广泛使用的变体通常包含 6B 到 20B 参数。本文提出了一种紧凑、高吞吐量的基于指令的图像编辑管道，该管道使用现代 2B 参数 Qwen3-VL 模型来指导编辑过程，并使用 1.6B 参数扩散模型 Sana1.5 进行图像生成。我们在架构、数据处理、训练配置和评估方面的设计决策以低成本推理和严格的源一致性为目标，同时在这种规模下可行的主要编辑类别中保持高质量。在 ImgEdit 和 GEdit 基准上进行评估，所提出的方法匹配或超过了更重的基线的性能，包括具有数倍参数和更高推理成本的模型，并且在需要保留输入图像的编辑方面特别强大，例如属性调整、对象删除、背景编辑和有针对性的替换。该模型适合 24 GB GPU 内存，在 BF16 的 NVIDIA H100 上大约 4 秒内生成分辨率高达 2K 的编辑图像，无需额外的推理优化或蒸馏。

用于卓越长 CoT 推理的分布对齐序列蒸馏

标题: Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
作者: Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09088
论文链接 : https://arxiv.org/pdf/2601.09088
项目链接 : https://github.com/D2I-ai/dasd-thinking
gitHub仓库 : https://github.com/D2I-ai/dasd-thinking

英文摘要

In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.

中文摘要

在本报告中，我们介绍了 DASD-4B-Thinking，这是一种轻量级但功能强大、完全开源的推理模型。它在数学、科学推理和代码生成方面具有挑战性的基准上，在规模相当的开源模型中实现了 SOTA 性能，甚至超越了几个更大的模型。我们首先批判性地重新审视社区中广泛采用的蒸馏范式：针对教师生成的响应的 SFT，也称为序列级蒸馏。尽管最近一系列遵循该方案的工作已经表现出显着的效率和强大的实证表现，但它们主要基于 SFT 的视角。因此，这些方法主要侧重于设计 SFT 数据过滤的启发式规则，而在很大程度上忽略了蒸馏本身的核心原则------使学生模型能够学习教师的完整输出分布，从而继承其泛化能力。具体来说，我们确定了当前实践中的三个关键局限性：i）教师序列级别分布的代表性不足；ii) 教师的产出分布与学生的学习能力之间的错位；iii) 教师强制训练与自回归推理产生的暴露偏差。综上所述，这些缺陷反映出整个蒸馏过程中系统性地缺乏明确的师生互动，使得蒸馏的本质没有得到充分利用。为了解决这些问题，我们提出了几种方法创新，这些创新共同形成了增强的序列级蒸馏训练管道。值得注意的是，DASD-4B-Thinking 仅使用 448K 训练样本就获得了有竞争力的结果，比大多数现有开源项目所使用的样本要少一个数量级。为了支持社区研究，我们公开发布我们的模型和训练数据集。

部长3

标题: Ministral 3
作者: Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Clémence Lanfranchi, Corentin Barreau, Cyprien Courtot, Daniele Grattarola, Darius Dabert, Diego de las Casas, Elliot Chane-Sane, Faruk Ahmed, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Georgii Novikov, Guillaume Kunsch, Guillaume Lample, Guillaume Martin, Gunshi Gupta, Jan Ludziejewski, Jason Rute, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Karmesh Yadav, Khyathi Chandu, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Margaret Jennings, Marie Pellat, Mark Prins, Mathieu Poirée, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mia Chiquier, Michel Schimpf, Nathan Grinsztajn, Neha Gupta, Nikhil Raghuraman, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Patrick von Platen, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Quentin Torroba, Romain Sauvestre, Roman Soletskyi, Rupert Menneer, Sagar Vaze, Samuel Barry, Sanchit Gandhi, Siddhant Waghjale, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thiziri Nait Saada, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Bewley, Tom Edwards, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Van Phung, Vincent Maladière, Virgile Richard, Wassim Bouaziz, Wen-Ding Li, William Marshall, Xinghui Li, Xinyu Yang, Yassine El Ouahidi, Yihan Wang, Yunhao Tang, Zaccharie Ramzi
日期: 2026-01-13
ArXiv主页 : https://arxiv.org/abs/2601.08584
论文链接 : https://arxiv.org/pdf/2601.08584
项目链接 : https://mistral.ai/news/mistral-3

英文摘要

We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.

中文摘要

我们推出了 Ministral 3 系列，这是一系列参数高效的密集语言模型，专为计算和内存受限的应用程序而设计，提供三种模型大小：3B、8B 和 14B 参数。对于每种模型大小，我们发布了三种变体：用于通用用途的预训练基础模型、经过微调的指令以及用于解决复杂问题的推理模型。此外，我们还介绍了通过级联蒸馏（Cascade Distillation）、迭代修剪和蒸馏技术持续训练来推导 Ministral 3 模型的方法。每个模型都具有图像理解功能，均在 Apache 2.0 许可下。

思维的分子结构：绘制长链思维推理的拓扑

标题: The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
作者: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
日期: 2026-01-09
ArXiv主页 : https://arxiv.org/abs/2601.06002
论文链接 : https://arxiv.org/pdf/2601.06002

英文摘要

Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.

中文摘要

大型语言模型 (LLM) 通常无法从人类或非长 CoT LLM 的模仿中学习有效的长思维链 (Long CoT) 推理。为了理解这一点，我们提出有效且可学习的长CoT轨迹在统一视图中具有稳定的类分子结构，这些结构由三种相互作用类型形成：深度推理（类共价）、自我反射（类氢键）和自我探索（类范德华）。对精炼轨迹的分析表明，这些结构来自 Long CoT 微调，而不是关键字模仿。我们引入了有效语义异构体，并表明只有促进快速熵收敛的键才能支持稳定的长 CoT 学习，而结构竞争会损害训练。根据这些发现，我们提出了 Mole-Syn，这是一种分布传输图方法，可指导合成有效的长 CoT 结构，从而提高跨基准的性能和 RL 稳定性。

KnowMe-Bench：对终生数字伴侣的人的理解进行基准测试

标题: KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
作者: Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen
日期: 2026-01-08
ArXiv主页 : https://arxiv.org/abs/2601.04745
论文链接 : https://arxiv.org/pdf/2601.04745
gitHub仓库 : https://github.com/QuantaAlpha/KnowMeBench

英文摘要

Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in KnowMeBench{https://github.com/QuantaAlpha/KnowMeBench}.

中文摘要

Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding.We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles.\BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning.在不同的叙述来源中，检索增强系统主要提高事实准确性，而基于时间的解释和更高层次的推理仍然存在错误，这凸显了对检索之外的记忆机制的需求。我们的数据位于 KnowMeBench{https://github.com/QuantaAlpha/KnowMeBench} 中。

Qwen3-VL-Embedding 和 Qwen3-VL-Reranker：最先进的多模态检索和排名的统一框架

标题: Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
作者: Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin
日期: 2026-01-08
ArXiv主页 : https://arxiv.org/abs/2601.04720
论文链接 : https://arxiv.org/pdf/2601.04720
项目链接 : https://qwen.ai/blog?id=qwen3-vl-embedding
gitHub仓库 : https://github.com/QwenLM/Qwen3-VL-Embedding

英文摘要

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

中文摘要

在本报告中，我们介绍了 Qwen3-VL-Embedding 和 Qwen3-VL-Reranker 模型系列，它们是基于 Qwen3-VL 基础模型构建的 Qwen 系列的最新扩展。它们共同通过将不同模式（包括文本、图像、文档图像和视频）映射到统一的表示空间中，为高精度多模式搜索提供端到端管道。Qwen3-VL-Embedding模型采用多阶段训练范例，从大规模对比预训练到重新排序模型蒸馏，以生成语义丰富的高维向量。它支持 Matryoshka 表示学习，支持灵活的嵌入维度，并可处理多达 32k 个令牌的输入。作为补充，Qwen3-VL-Reranker 使用具有交叉注意机制的交叉编码器架构对查询文档对执行细粒度的相关性估计。两个模型系列都继承了Qwen3-VL的多语言能力，支持30多种语言，并以2B和8B参数大小发布，以适应多样化的部署需求。实证评估表明，Qwen3-VL-Embedding 系列在不同的多模态嵌入评估基准上取得了最先进的结果。具体来说，Qwen3-VL-Embedding-8B在MMEB-V2上的总分达到77.8，在所有模型中排名第一（截至2025年1月8日）。本报告介绍了该系列的架构、训练方法和实践能力，展示了其在图像文本检索、视觉问答和视频文本匹配等各种多模态检索任务上的有效性。

Fast-ThinkAct：通过可言语的潜在规划进行高效的视觉-语言-动作推理

标题: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
作者: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
日期: 2026-01-14
ArXiv主页 : https://arxiv.org/abs/2601.09708
论文链接 : https://arxiv.org/pdf/2601.09708
项目链接 : https://jasper0314-huang.github.io/fast-thinkact/

英文摘要

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

中文摘要

视觉-语言-动作（VLA）任务需要对复杂的视觉场景进行推理并在动态环境中执行自适应动作。虽然最近关于推理 VLA 的研究表明，显式思维链 (CoT) 可以提高泛化能力，但由于推理轨迹过长，推理延迟较高。我们提出了 Fast-ThinkAct，这是一种高效的推理框架，可以通过可语言的潜在推理实现紧凑而高性能的规划。Fast-ThinkAct 通过从教师那里提炼，在偏好引导的目标驱动下，学习如何利用潜在 CoT 进行有效推理，以调整操作轨迹，从而传输语言和视觉规划能力以实现具体控制。这使得推理增强型策略学习能够有效地将紧凑推理与行动执行联系起来。跨各种具体操作和推理基准的大量实验表明，Fast-ThinkAct 实现了强大的性能，与最先进的推理 VLA 相比，推理延迟降低了 89.3%，同时保持有效的长期规划、小样本适应和故障恢复。

ArenaRL：通过基于锦标赛的相对排名扩展开放式智能体的强化学习

标题: ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
作者: Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha
日期: 2026-01-10
ArXiv主页 : https://arxiv.org/abs/2601.06487
论文链接 : https://arxiv.org/pdf/2601.06487
项目链接 : https://tongyi-agent.github.io/blog/arenarl/
gitHub仓库 : https://github.com/Alibaba-NLP/qqr

英文摘要

Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.

中文摘要

强化学习极大地提高了 LLM 智能体在具有可验证结果的任务上的表现，但在具有巨大解决方案空间（例如复杂的旅行计划）的开放式智能体任务上仍然举步维艰。由于这些任务缺乏客观的事实依据，当前的强化学习算法在很大程度上依赖于奖励模型，该模型为个体响应分配标量分数。我们认为，这种逐点评分存在固有的歧视崩溃：奖励模型难以区分不同轨迹之间的微妙优势，导致组内的分数被压缩到一个狭窄的范围内。因此，有效的奖励信号变得由奖励模型的噪声主导，导致优化停滞。为了解决这个问题，我们提出了 ArenaRL，这是一种从逐点标量评分转变为组内相对排名的强化学习范式。ArenaRL 引入了一种过程感知的成对评估机制，采用多级评分标准为轨迹分配细粒度的相对分数。此外，我们构建了一个组内对抗竞技场并设计了基于锦标赛的排名方案以获得稳定的优势信号。经验结果证实，所构建的种子单淘汰方案在 O(N^2) 复杂度下实现了与完全成对比较几乎相同的优势估计精度，同时仅以 O(N) 复杂度进行操作，在效率和精度之间取得了最佳平衡。此外，为了解决开放式代理缺乏全周期基准的问题，我们构建了 Open-Travel 和 Open-DeepResearch 这两个高质量的基准，具有涵盖 SFT、RL 训练和多维评估的全面管道。大量实验表明，ArenaRL 的性能大大优于标准 RL 基线，使 LLM 代理能够为复杂的现实世界任务生成更强大的解决方案。

CaricatureGS：用高斯曲率夸大 3D 高斯飞溅面

标题: CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature
作者: Eldad Matmon, Amit Bracha, Noam Rotstein, Ron Kimmel
日期: 2026-01-06
ArXiv主页 : https://arxiv.org/abs/2601.03319
论文链接 : https://arxiv.org/pdf/2601.03319
项目链接 : https://c4ricaturegs.github.io/

英文摘要

A photorealistic and controllable 3D caricaturization framework for faces is introduced. We start with an intrinsic Gaussian curvature-based surface exaggeration technique, which, when coupled with texture, tends to produce over-smoothed renders. To address this, we resort to 3D Gaussian Splatting (3DGS), which has recently been shown to produce realistic free-viewpoint avatars. Given a multiview sequence, we extract a FLAME mesh, solve a curvature-weighted Poisson equation, and obtain its exaggerated form. However, directly deforming the Gaussians yields poor results, necessitating the synthesis of pseudo-ground-truth caricature images by warping each frame to its exaggerated 2D representation using local affine transformations. We then devise a training scheme that alternates real and synthesized supervision, enabling a single Gaussian collection to represent both natural and exaggerated avatars. This scheme improves fidelity, supports local edits, and allows continuous control over the intensity of the caricature. In order to achieve real-time deformations, an efficient interpolation between the original and exaggerated surfaces is introduced. We further analyze and show that it has a bounded deviation from closed-form solutions. In both quantitative and qualitative evaluations, our results outperform prior work, delivering photorealistic, geometry-controlled caricature avatars.

中文摘要

介绍了一种逼真且可控的3D人脸漫画框架。我们从内在的基于高斯曲率的表面夸大技术开始，该技术与纹理结合时，往往会产生过度平滑的渲染。为了解决这个问题，我们采用了 3D 高斯泼溅 (3DGS)，它最近被证明可以产生逼真的自由视点头像。给定多视图序列，我们提取 FLAME 网格，求解曲率加权泊松方程，并获得其夸张形式。然而，直接使高斯变形会产生较差的结果，需要通过使用局部仿射变换将每个帧扭曲为其夸张的 2D 表示来合成伪地面真实漫画图像。然后，我们设计了一种交替真实监督和综合监督的训练方案，使单个高斯集合能够代表自然和夸张的化身。该方案提高了保真度，支持本地编辑，并允许连续控制漫画的强度。为了实现实时变形，在原始表面和夸张表面之间引入了有效的插值。我们进一步分析并表明它与封闭式解有一定的偏差。在定量和定性评估中，我们的结果都优于之前的工作，提供了逼真的、几何控制的漫画头像。

大规模使用工具的面向用户的多轮对话生成

标题: User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale
作者: Jungho Cho, Minbyul Jeong, Sungrae Park
日期: 2026-01-13
ArXiv主页 : https://arxiv.org/abs/2601.08225
论文链接 : https://arxiv.org/pdf/2601.08225

英文摘要

The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.

中文摘要

最近向大型推理模型（LRM）作为自主代理的范式转变加剧了对复杂的多轮工具使用能力的需求。然而，现有的数据集和数据生成方法受到静态、预定义工具集的限制，这些工具集无法扩展到开放式人类代理协作的复杂性。为了解决这个问题，我们最初开发了一个大规模自动生成面向任务的多轮对话的框架，利用基于 LRM 的模拟器动态生成高价值、特定领域的工具来解决指定的任务。然而，我们观察到，纯粹面向任务的设计通常会导致"仅解决任务"的轨迹，其中代理以最少的交互完成目标，无法生成现实场景中看到的高轮数对话。To bridge this gap, we shift toward a user-oriented simulation paradigm.通过将任务生成与模仿人类行为规则的专用用户模拟器（例如增量请求和逐轮反馈）解耦，我们可以促进更真实、更扩展的多回合对话，从而反映现实世界问题解决的迭代本质。我们的生成管道作为多功能、即插即用模块运行，能够从任何状态启动生成，确保生成扩展工具使用数据的高可扩展性。此外，通过促进单个轨迹内的多个任务完成，它产生了一个高密度数据集，反映了现实世界中人类与智能体交互的多方面需求。

MHLA：通过令牌级多头恢复线性注意力的表现力

标题: MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
作者: Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou
日期: 2026-01-12
ArXiv主页 : https://arxiv.org/abs/2601.07832
论文链接 : https://arxiv.org/pdf/2601.07832
项目链接 : https://dagroup-pku.github.io/MHLA/
gitHub仓库 : https://github.com/DAGroup-PKU/MHLA

英文摘要

While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement on image generation, and a 41% enhancement on video generation under the same time complexity.

中文摘要

虽然 Transformer 架构在许多领域占据主导地位，但其二次自注意力复杂性阻碍了其在大规模应用中的使用。线性注意力提供了一种有效的替代方案，但其直接应用通常会降低性能，现有的修复通常会通过额外的模块（例如深度可分离卷积）重新引入计算开销，从而违背了最初的目的。在这项工作中，我们确定了这些方法中的一个关键失败模式：全局上下文崩溃，其中模型失去了表征多样性。为了解决这个问题，我们提出了多头线性注意力（MHLA），它通过沿着令牌维度计算分割头内的注意力来保留这种多样性。我们证明 MHLA 保持了线性复杂度，同时恢复了 Softmax Attention 的大部分表达能力，并验证了其在多个领域的有效性，在相同时间复杂度下，在 ImageNet 分类上实现了 3.6% 的改进，在 NLP 上实现了 6.3% 的增益，在图像生成上实现了 12.6% 的改进，在视频生成上实现了 41% 的增强。

【论文速递】2026年第03周(Jan-11-17)(Robotics/Embodied AI/LLM)

目录

观看、推理和搜索：开放网络上代理视频推理的视频深度研究基准

英文摘要

中文摘要

BabyVision：超越语言的视觉推理

英文摘要

中文摘要

STEP3-VL-10B 技术报告

英文摘要

中文摘要

用地图思考：用于地理定位的强化并行地图增强代理

英文摘要

中文摘要

具有视觉语言推理的城市社会语义分割

英文摘要

中文摘要

奖励稀有：法学硕士中用于创造性解决问题的独特性感知强化学习

英文摘要

中文摘要

DeepResearchEval：深度研究任务构建和代理评估的自动化框架

英文摘要

中文摘要

用于算法代码优化的受控自我进化

英文摘要

中文摘要

MMFormalizer：野外多模态自动形式化

英文摘要

中文摘要

MAXS：使用 LLM 代理进行元自适应探索

英文摘要

中文摘要

用于推理的协作多智能体测试时强化学习

英文摘要

中文摘要

PaCoRe：学习通过并行协调推理扩展测试时间计算

英文摘要

中文摘要

A^3-Bench：通过锚点和吸引子激活对内存驱动的科学推理进行基准测试

英文摘要

中文摘要

MemGovern：通过学习受管理的人类经验来增强代码代理

英文摘要

中文摘要

FlowAct-R1：迈向交互式人形视频生成

英文摘要

中文摘要

视频生成的运动归因

英文摘要

中文摘要

太阳能公开技术报告

英文摘要

中文摘要

VIBE：基于可视化指令的编辑器

英文摘要

中文摘要

用于卓越长 CoT 推理的分布对齐序列蒸馏

英文摘要

中文摘要

部长3

英文摘要

中文摘要

思维的分子结构：绘制长链思维推理的拓扑

英文摘要

中文摘要

KnowMe-Bench：对终生数字伴侣的人的理解进行基准测试

英文摘要

中文摘要

Qwen3-VL-Embedding 和 Qwen3-VL-Reranker：最先进的多模态检索和排名的统一框架

英文摘要

中文摘要

Fast-ThinkAct：通过可言语的潜在规划进行高效的视觉-语言-动作推理

英文摘要

中文摘要

ArenaRL：通过基于锦标赛的相对排名扩展开放式智能体的强化学习

英文摘要

中文摘要

CaricatureGS：用高斯曲率夸大 3D 高斯飞溅面

英文摘要

中文摘要