中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 小组序列策略优化
- GUI-G^2:GUI接地的高斯奖励建模
- MiroMind-M1:通过上下文感知的多阶段策略优化在数学推理方面的开源进步
- Nablanabla:邻里自适应块级的关注
- 超越上下文限制:长马利亚推理的潜意识线程
- Yume:互动世界一代模型
- 无形的皮带:为什么RLVR可能不会逃脱其起源
- [Step-Audio 2技术报告](#Step-Audio 2技术报告)
- 像素,模式,但没有诗歌:像人类一样看到世界
- Megascience:推动训练后数据集的前沿进行科学推理
- 面具后面的魔鬼:扩散LLM的紧急安全脆弱性
- Nohumansrequired:自动高质量图像编辑三胞胎开采
- WebShaper:通过寻求信息正式化进行代理数据综合
- 一个以数据为中心的框架,用于解决俄罗斯语音生成模型中的语音和韵律挑战
- DesignLab:通过迭代检测和校正设计幻灯片
- GR-3技术报告
- MUR:大型语言模型的动量不确定性指导推理
- 电影院船长:迈向短片一代
- ThinkAct:通过增强视觉潜在计划的视觉效法推理
- SEC:通过渐进概念构建来推进复杂的视频对象细分
- 在3D高斯分裂中使用正规分数蒸馏采样的3D胶片零件编辑
- 一个领域可以帮助他人吗?通过增强学习以数据为中心的多域推理的研究
- 提取重要的是:加速扩散变压器的区域自适应潜在采样
- [斑马 - cot:一个用于交织视觉语言推理的数据集](#斑马 - cot:一个用于交织视觉语言推理的数据集)
- 拉波:通过长度自适应策略优化内部化推理效率
- Franca:嵌套的Matryoshka聚类,用于可扩展的视觉表示学习
- Ultra3D:有效且高保真3D生成,部分关注
- head-h0:大规模人类视频预测视觉性能
小组序列策略优化
- 标题: Group Sequence Policy Optimization
- 作者: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
- 日期: 2025-07-24
- ArXiv主页 : https://arxiv.org/abs/2507.18071
- 论文链接 : https://arxiv.org/pdf/2507.18071
英文摘要
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
中文摘要
本文介绍了小组序列策略优化(GSPO),这是我们用于培训大语言模型的稳定,高效且性能的增强学习算法。与以前采用令牌重要性比率的算法不同,GSPO定义了基于序列可能性的重要性比率,并执行序列级别的剪辑,奖励和优化。我们证明,与GRPO算法相比,GSPO达到了卓越的训练效率和性能,特别是稳定了Experts(MOE)RL训练的混合物,并且具有简化RL基础架构设计的潜力。GSPO的这些优点为最新QWEN3模型的显着改善做出了贡献。
GUI-G^2:GUI接地的高斯奖励建模
- 标题: GUI-G^2: Gaussian Reward Modeling for GUI Grounding
- 作者: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
- 日期: 2025-07-21
- ArXiv主页 : https://arxiv.org/abs/2507.15846
- 论文链接 : https://arxiv.org/pdf/2507.15846
- 项目链接 : https://zju-real.github.io/GUI-G2
- gitHub仓库 : https://github.com/zju-real/GUI-G2
英文摘要
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G^2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G^2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G^2, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.
中文摘要
图形用户界面(GUI)接地映射自然语言指令,以进行自主互动的精确接口位置。当前的强化学习方法使用将元素视为命中或失去目标的二进制奖励,从而产生了稀疏的信号,忽略了空间相互作用的连续性。由人类点击行为自然形成以目标元素为中心的高斯分布的动机,我们引入了Gui Gaussian接地奖励(GUI-G2),这是一个原则上的奖励框架,将GUI元素建模为跨接口平面的连续高斯分布。GUI-G2结合了两种协同机制:高斯点奖励模型通过指数衰减以元素质心为中心的分布进行精确定位,而覆盖奖励通过测量预测的高斯分布和目标区域之间的重叠来评估空间对齐。为了处理各种元素量表,我们开发了一种自适应差异机制,该机制可以根据元素维度校准奖励分布。该框架将GUI接地从稀疏的二进制分类转变为密集的连续优化,在该优化中,高斯分布会产生丰富的梯度信号,从而将模型指向最佳相互作用位置。跨屏幕柱,屏幕斑台-V2和屏幕孔波-PRO基准的大量实验表明,GUI-G2的表现显着超过了最先进的方法UI-TARS-72B,在屏幕Prot-Pro上,gui-g2最显着提高了24.7%。我们的分析表明,连续建模为界面变化和增强的概括提供了出色的鲁棒性,以使人的布局不见,从而建立了用于GUI交互任务中空间推理的新范式。
MiroMind-M1:通过上下文感知的多阶段策略优化在数学推理方面的开源进步
-
标题: MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
-
作者: Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing
-
日期: 2025-07-19
-
ArXiv主页 : https://arxiv.org/abs/2507.14683
-
gitHub仓库 : https://github.com/MiroMindAsia/MiroMind-M1
英文摘要
Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.
中文摘要
大型语言模型最近从流利的文本生成发展到跨不同领域的高级推理,从而产生了推理语言模型。在这些领域中,数学推理是代表性的基准,因为它需要精确的多步逻辑和抽象推理,可以将其推广到其他任务。诸如GPT-O3之类的封闭源RLM表现出令人印象深刻的推理能力,但它们的专有性质限制了透明度和可重复性。尽管许多开源项目旨在缩小这一差距,但大多数开源项目通过省略关键资源(例如数据集和详细的培训配置)缺乏足够的开放性,这阻碍了可重复性。为了提高RLM开发的透明度,我们介绍了MiroMind-M1系列,这是一组完全开源的RLMS,建立在QWEN-2.5骨架上,匹配或超过现有开源RLMS的性能。具体而言,我们的模型分为两个阶段进行培训:SFT在经过验证的COT轨迹的精心策划的719K数学问题的语料库中,其次是RLVR,在62k的挑战性和可验证的问题上进行了RLVR。为了提高RLVR流程的鲁棒性和效率,我们介绍了上下文感知的多阶段策略优化,该算法将长度增强培训与自适应重复惩罚相结合,以鼓励上下文意识到RL培训。我们的模型在AIME24,AIME25和MATH基准的基于QWEN-2.5的开源7B和32B模型中实现了最先进的或竞争性的性能以及卓越的令牌效率。为了促进可重复性,我们发布了完整的堆栈:模型(MiroMind-M1-SFT-7B,MiroMind-M1-RL-7B,MiroMind-M1-M1-RL-32B);数据集(Miomind-M1-SFT-719K,MiroMind-M1-RL-62K);以及所有培训和评估配置。我们希望这些资源将支持进一步的研究并促进社区的发展。
Nablanabla:邻里自适应块级的关注
-
标题: nablaNABLA: Neighborhood Adaptive Block-Level Attention
-
作者: Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov
-
日期: 2025-07-17
-
ArXiv主页 : https://arxiv.org/abs/2507.13546
-
gitHub仓库 : https://github.com/gen-ai-team/Wan2.1-NABLA
英文摘要
Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA
中文摘要
基于变压器的架构的最新进展表明,在视频生成任务中取得了巨大的成功。但是,全部注意机制的二次复杂性仍然是关键的瓶颈,尤其是对于高分辨率和长期视频序列。在本文中,我们提出了Nabla,Nabla是一种新型的邻域自适应块层的注意机制,该机制动态适应了视频扩散变压器(DITS)中的稀疏模式。通过使用自适应稀疏性驱动的阈值利用障碍物的关注,Nabla可以减少计算开销,同时保持生成质量。我们的方法不需要自定义的低级操作员设计,并且可以与Pytorch的Flex注意操作员无缝集成。实验表明,与基线相比,Nabla的训练和推理几乎没有损害定量指标(剪辑得分,VBENCH评分,人类评估得分)和视觉质量下降,而Nabla的训练和推理比基线的训练和推理相比要快2.7倍。代码和型号的权重可在此处提供:https://github.com/gen-ai-team/wan2.1-nabla
超越上下文限制:长马利亚推理的潜意识线程
- 标题: Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
- 作者: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass
- 日期: 2025-07-22
- ArXiv主页 : https://arxiv.org/abs/2507.16784
- 论文链接 : https://arxiv.org/pdf/2507.16784
- 项目链接 : https://www.subconscious.dev/
- gitHub仓库 : https://github.com/subconscious-systems/TIMRUN
英文摘要
To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
中文摘要
为了打破瓶颈推理的准确性和效率的大型语言模型(LLMS)的上下文限制,我们提出了线程推理模型(TIM),该家族是一个培训了递归和分解性问题解决问题的LLM家族,以及Timrun,一种推理跑步时间,启用了长途锻炼的结构性推理超出上下文的限制。Tim在Timrun上托管的Tim在单语言模型推理中几乎支持无限的工作记忆和多跳工具调用,克服输出限制,位置限制限制和GPU-MEMORY BOTTLENECKS。通过将自然语言建模为通过长度和深度而不是线性序列测量的推理树来实现性能。The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning.实验结果表明,即使在GPU内存中最多可操纵KV高速缓存的90%时,我们的系统仍具有较高的推理吞吐量。它还在数学任务上提供了准确的推理,并处理需要长马推理和多跳工具使用的信息检索挑战。
Yume:互动世界一代模型
- 标题: Yume: An Interactive World Generation Model
- 作者: Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang
- 日期: 2025-07-23
- ArXiv主页 : https://arxiv.org/abs/2507.17744
- 论文链接 : https://arxiv.org/pdf/2507.17744
- 项目链接 : https://stdstu12.github.io/YUME-Project/
- gitHub仓库 : https://github.com/Lixsp11/sekai-codebase
英文摘要
Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.
中文摘要
Yume的目的是使用图像,文本或视频来创建一个交互式,现实和动态的世界,该世界允许使用外围设备或神经信号进行探索和控制。在本报告中,我们提出了\ method的预览版本,该版本从输入图像创建动态世界,并允许使用键盘操作探索世界。为了实现这种高保真和交互式视频世界一代,我们引入了一个精心设计的框架,该框架由四个主要组件组成,包括摄像机运动量化,视频生成体系结构,高级采样器和模型加速度。首先,我们使用键盘输入来量化摄像机运动,以进行稳定的培训和用户友好的交互。然后,我们介绍了带有内存模块的蒙版视频扩散变压器〜(MVDT),以自动回归方式进行无限视频生成。之后,将无训练的反艺术机制(AAM)和基于随机微分方程(TTS-SDE)的时间旅行采样引入采样器,以获得更好的视觉质量和更精确的控制。此外,我们通过对对抗蒸馏和缓存机制的协同优化来研究模型加速度。我们使用高质量的世界勘探数据集\ sekai来训练\方法,并在各种场景和应用程序中取得了显着的结果。所有数据,代码库和模型权重均在https://github.com/stdstu12/yume上可用。Yume将每月更新以实现其最初的目标。项目页面:https://stdstu12.github.io/yume-project/。
无形的皮带:为什么RLVR可能不会逃脱其起源
- 标题: The Invisible Leash: Why RLVR May Not Escape Its Origin
- 作者: Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi
- 日期: 2025-07-20
- ArXiv主页 : https://arxiv.org/abs/2507.14843
- 论文链接 : https://arxiv.org/pdf/2507.14843
英文摘要
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
中文摘要
大型推理模型的最新进展突出了可验证的奖励(RLVR)作为增强AI能力的有前途的方法,尤其是在解决复杂的逻辑任务时。但是,尚不清楚RLVR是否真正扩展了模型的推理边界或仅放大基本模型已经知道以提高精度的高回报输出。这项研究提出了一项理论和实证研究,为RLVR的潜在限制提供了新的见解。首先,我们提供了一个新的理论观点,即RLVR受基本模型的支持 - 对零初始概率的样品解决方案的限制,并且作为一种保守的重新加权机制,可能限制了完全原始的解决方案。我们还确定了一个熵奖励的权衡:虽然RLVR可靠地提高了精度,但它可能会逐渐缩小勘探范围,并可能忽略正确但代表性不足的解决方案。广泛的经验实验证明,尽管RLVR始终改善PASS@1,但经验支持的收缩通常超过了在较大的采样预算下的经验支持的扩展,未能恢复基本模型以前可以访问的正确答案。有趣的是,我们还观察到,尽管RLVR有时会增加令牌级的熵,从而导致每个一代步骤的不确定性更大,但答案级熵的下降,表明这些似乎更不确定的路径最终会融合到一组较小的不同答案上。综上所述,这些发现揭示了RLVR在扩展推理视野中的潜在限制。打破这种无形的皮带可能需要未来的算法创新,例如明确的勘探机制或混合策略,将概率质量质量构成代表性不足的溶液区域。
Step-Audio 2技术报告
- 标题: Step-Audio 2 Technical Report
- 作者: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
- 日期: 2025-07-22
- ArXiv主页 : https://arxiv.org/abs/2507.16632
- 论文链接 : https://arxiv.org/pdf/2507.16632
- 项目链接 : https://www.stepfun.com/docs/en/step-audio2
- gitHub仓库 : https://github.com/stepfun-ai/Step-Audio2
英文摘要
This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
中文摘要
本文介绍了Step-Adio〜2,这是一种端到端的多模式大型语言模型,旨在行业强度的音频理解和语音对话。通过整合潜在的音频编码器和以推理为中心的增强学习(RL),Step-Adio 2可以在自动语音识别(ASR)和音频理解中实现有希望的表现。为了促进真正的端到端语音对话,Step-Audio 2将离散的音频令牌的产生结合到语言建模中,从而大大提高了其对副语言信息(例如说话风格和情感)的响应能力。为了有效利用实际数据中丰富的文本和声学知识,Step-Audio 2集成了检索功能增强的生成(RAG),并能够调用诸如Web搜索之类的外部工具以减轻幻觉和音频搜索来切换时间。Step-Audio 2经过数百万小时的语音和音频数据的培训,在各种会话方案中提供了情报和表现力。评估结果表明,与其他开源和商业解决方案相比,Step-Adio 2在各种音频理解和对话基准方面取得了最新的性能。有关更多信息,请访问https://github.com/stepfun-ai/step-audio2。
像素,模式,但没有诗歌:像人类一样看到世界
- 标题: Pixels, Patterns, but No Poetry: To See The World like Humans
- 作者: Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang
- 日期: 2025-07-21
- ArXiv主页 : https://arxiv.org/abs/2507.16863
- 论文链接 : https://arxiv.org/pdf/2507.16863
- 项目链接 : https://turingeyetest.github.io/
- gitHub仓库 : https://github.com/TuringEyeTest/TuringEyeTest
英文摘要
Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.
中文摘要
在多模式大语言模型(MLLM)中实现类似人类的感知和推理仍然是人工智能的核心挑战。尽管最近的研究主要集中在增强MLLM中的推理能力上,但一个基本问题仍然存在:多模式的大语言模型可以像人类一样真正地感知世界吗?本文将重点从推理转变为感知。我们不是专门为推理构建基准,而是引入了Turing Eye Test(TET),这是一个具有挑战性的面向感知的基准测试,其中包括四个诊断任务,这些任务评估了MLLM在人类直观上处理的合成图像上的性能。我们的发现表明,最先进的MLLM在我们对人类的琐碎任务上表现出灾难性的失败。在语言骨干上有效的语言学习和培训以前的基准测试效果以提高我们的任务的性能,同时对视觉塔进行微调,可以快速适应,这表明我们的基准测试对视觉塔的概括带来了挑战,而不是对当前MLLM和人类知名度之间的语言backbone-akone-akone keaps priceage permination和推理能力。我们在此版本中发布了TET任务的代表性子集,并将引入更多样化的任务和方法,以增强未来工作中的视觉概括。
Megascience:推动训练后数据集的前沿进行科学推理
- 标题: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
- 作者: Run-Ze Fan, Zengzhi Wang, Pengfei Liu
- 日期: 2025-07-22
- ArXiv主页 : https://arxiv.org/abs/2507.16812
- 论文链接 : https://arxiv.org/pdf/2507.16812
- 项目链接 : https://huggingface.co/MegaScience
- gitHub仓库 : https://github.com/GAIR-NLP/lm-open-science-evaluation
英文摘要
Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.
中文摘要
科学推理对于发展AI科学家和支持人类研究人员推进自然科学发现的前沿至关重要。但是,开源社区主要集中于数学和编码,同时忽略了科学领域,这在很大程度上是由于缺乏开放的大型,高质量的,可验证的科学推理数据集。为了弥合这一差距,我们首先介绍了文本策划,这是一个开放数据集,其中包含从12K大学级的科学教科书中提取的真实参考答案,其中包括650k的推理问题,涵盖了7个科学学科。我们进一步介绍了Megascience,这是通过系统的消融研究开发的高质量开源数据集的大规模混合物,总计125万个实例,该研究评估了各种数据选择方法,以确定每个公开可用的科学数据集的最佳子集。同时,我们建立了一个全面的评估系统,涵盖了15个基准的各种主题和问题类型,并结合了全面的答案提取策略,以确保准确的评估指标。我们的实验表明,与现有的开源科学数据集相比,我们的数据集以更简洁的响应长度实现了卓越的性能和训练效率。此外,我们培训Llama3.1,Qwen2.5和Qwen3系列基本模型,这些模型在平均绩效中的表现大大优于相应的官方指导模型。此外,Megascience对更大,更强大的模型具有更大的有效性,这表明对科学调整具有缩放优势。我们将数据策划管道,评估系统,数据集和七个训练有素的模型发布给社区,以推进科学推理研究。
面具后面的魔鬼:扩散LLM的紧急安全脆弱性
-
标题: The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
-
作者: Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang
-
日期: 2025-07-15
-
ArXiv主页 : https://arxiv.org/abs/2507.11097
-
gitHub仓库 : https://github.com/ZichenWen1/DIJA
英文摘要
Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.
中文摘要
基于扩散的大语言模型(DLLM)最近成为自回归LLM的强大替代方法,通过并行解码和双向建模提供更快的推理和更大的交互性。但是,尽管在代码生成和文本填充方面的性能很强,但我们确定了一个基本的安全问题:现有的一致性机制无法保护DLLMS免受上下文感知的,蒙版输入的对抗提示,从而暴露了新的脆弱性。为此,我们提出了Dija,这是第一个系统的研究和越狱攻击框架,它利用了DLLM的独特安全弱点。具体而言,我们提出的DIJA构建了对抗性的掩模文本,提示利用DLLM的文本生成机理,即双向建模和并行解码。双向建模驱动该模型,即使在有害的同时,也可以为掩盖跨度产生上下文一致的输出,同时模型的动态滤波和不安全内容的拒绝采样。这会导致标准对齐机制失败,即使在提示中直接暴露有害行为或不安全的说明,也可以在对齐调整的DLLM中实现有害的完成。通过全面的实验,我们证明了DIJA极大地胜过现有的越狱方法,并在DLLM架构中展示了先前被忽视的威胁表面。值得注意的是,我们的方法在Dream-Instruct上最多可获得100%的基于关键字的ASR,超过了先前的基线Renellm,在基于评估的ASR的ASR上,最高可达78.5%,在越狱式越狱中获得37.7分,同时不需要重新定位或在越狱及时的有害内容及时提高有害内容。我们的发现强调了在这种新兴语言模型中重新思考安全对准的迫切需要。代码可从https://github.com/zichenwen1/dija获得。
Nohumansrequired:自动高质量图像编辑三胞胎开采
- 标题: NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
- 作者: Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev
- 日期: 2025-07-18
- ArXiv主页 : https://arxiv.org/abs/2507.14119
- 论文链接 : https://arxiv.org/pdf/2507.14119
- 项目链接 : https://riko0.github.io/No-Humans-Required/
英文摘要
Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.
中文摘要
生成建模的最新进展使图像编辑助理遵循自然语言指令,而无需其他用户输入。他们的监督培训需要数百万的三胞胎:原始图像,指令,编辑图像。然而,挖掘像素精确的例子很难。每个编辑必须仅影响及时指定的区域,保留风格连贯性,尊重身体上的合理性并保持视觉吸引力。缺乏强大的自动化编辑质量指标会在大规模上阻碍可靠的自动化。我们提出了一个自动化的模块化管道,该管道将跨域,分辨率,指导复杂性和样式的高保真三胞胎矿化。我们的系统基于公共生成模型并在没有人工干预的情况下运行,直接使用任务调整的双子座验证器来评分指令依从性和美学,从而消除了对细分或接地模型的任何需求。反转和组成的自举将大约2.2倍设置的开采设置,从而实现大规模的高保真训练数据。通过自动化最重复的注释步骤,该方法允许在无需人工标记工作的情况下进行新的培训规模。为了使该资源密集型领域的研究民主化,我们发布了NHR-Edit:358K高质量三胞胎的开放数据集。在最大的跨数据库评估中,它超过了所有公共替代方案。我们还发布了一种开源微调的百吉饼模型Bagel-NHR-Edit,在我们的实验中实现了最先进的指标。
WebShaper:通过寻求信息正式化进行代理数据综合
-
标题: WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
-
作者: Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
-
日期: 2025-07-20
-
ArXiv主页 : https://arxiv.org/abs/2507.15061
-
gitHub仓库 : https://github.com/Alibaba-NLP/WebAgent
英文摘要
The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.
中文摘要
大型语言模型(LLM)能力的代理商的出现通过通过基于Web的信息寻求(IS)功能来使解决方案能够实现复杂的开放式任务,从而彻底改变了人工智能。高质量培训数据的稀缺性限制了IS代理的发展。现有方法通常采用信息驱动的范式,该范式首先收集Web数据,然后根据检索产生问题。但是,这可能会导致信息结构与推理结构,问答之间的不一致。为了减轻,我们提出了一个形式驱动的是数据综合框架WebShaper来构建数据集。WebShaper系统地正式化是通过集理论的任务。形式化的核心是知识预测(KP)的概念,它可以通过KP操作组成对推理结构进行精确控制。在合成过程中,我们首先创建种子任务,然后使用多步扩展过程。在每个步骤中,代理商扩展器都会根据我们的正式化使用检索和验证工具来扩展当前的正式问题。我们在合成的数据集上训练模型。实验结果表明,开源的Webshaper在Gaia和Webwalkerqa基准测试中实现了最先进的性能。
一个以数据为中心的框架,用于解决俄罗斯语音生成模型中的语音和韵律挑战
-
标题: A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
-
作者: Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian
-
日期: 2025-07-17
-
ArXiv主页 : https://arxiv.org/abs/2507.13563
-
gitHub仓库 : https://github.com/mtuciru/balalaika
英文摘要
Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.
中文摘要
俄罗斯语音综合提出了独特的挑战,包括减少元音,辅音Devoicing,可变的压力模式,同型歧义和不自然的语调。本文介绍了Balalaika,这是一个新颖的数据集,其中包括超过2,000个小时的录音室质量俄罗斯演讲,并具有全面的文字注释,包括标点符号和压力标记。实验结果表明,在Balalaika训练的模型显着优于在语音合成和增强任务中在现有数据集中训练的模型。我们详细介绍了数据集构造管道,注释方法和比较评估的结果。
DesignLab:通过迭代检测和校正设计幻灯片
- 标题: DesignLab: Designing Slides Through Iterative Detection and Correction
- 作者: Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu
- 日期: 2025-07-23
- ArXiv主页 : https://arxiv.org/abs/2507.17202
- 论文链接 : https://arxiv.org/pdf/2507.17202
- 项目链接 : https://yeolj00.github.io/personal-projects/designlab/
英文摘要
Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.
中文摘要
由于涉及各种设计选择的复杂性,设计高质量的演示幻灯片对于非专家来说可能具有挑战性。许多自动化工具可以建议布局和配色方案,但通常缺乏完善自己的输出的能力,这是现实世界中工作流的关键方面。我们建议DesignLab,将设计过程分为两个角色,即设计审稿人,他们确定了与设计相关的问题,以及纠正它们的设计贡献者。这种分解使迭代循环能够连续检测出问题,并撰写贡献者纠正了问题,从而使草稿在每次迭代中都可以进一步抛光,从而达到了无法实现的质量。我们通过引入受控扰动,使设计审稿人学习设计错误,并学习如何修复它们,从而为这些角色微调大型语言模型,并通过引入受控的扰动来模拟中间草稿。我们的实验表明,DesignLab通过拥抱设计的迭代性质,胜过现有的设计生成方法,包括商业工具,这可能会导致抛光,专业的幻灯片。
GR-3技术报告
- 标题: GR-3 Technical Report
- 作者: Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang
- 日期: 2025-07-21
- ArXiv主页 : https://arxiv.org/abs/2507.15493
- 论文链接 : https://arxiv.org/pdf/2507.15493
- 项目链接 : https://seed.bytedance.com/GR3
英文摘要
We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, pi_0, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.
中文摘要
我们报告了我们最近在制定通才机器人政策方面的进展,即GR-3的发展。GR-3是大规模视觉语言动作(VLA)模型。它展示了概括涉及抽象概念的新颖对象,环境和说明的特殊功能。此外,它可以通过最小的人类轨迹数据进行有效的微调,从而可以快速且具有成本效益的适应新环境。GR-3在处理长马和灵巧的任务方面也表现出色,包括需要双手操纵和移动运动的任务,展示了强大而可靠的性能。这些功能是通过多面培训配方来实现的,该配方包括与Web尺度视觉语言数据共同培训,通过VR设备收集的人类轨迹数据进行有效的微调以及与机器人轨迹数据有效的模仿学习。此外,我们介绍了Bytemini,这是一种具有出色的灵活性和可靠性,可以与GR-3集成时,能够完成各种任务。通过广泛的现实实验,我们显示GR-3超过了最新的基线方法PI_0,在各种具有挑战性的任务上。我们希望GR-3可以作为建立能够在日常生活中帮助人类的通才机器人迈出的一步。
MUR:大型语言模型的动量不确定性指导推理
- 标题: MUR: Momentum Uncertainty guided Reasoning for Large Language Models
- 作者: Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu
- 日期: 2025-07-20
- ArXiv主页 : https://arxiv.org/abs/2507.14958
- 论文链接 : https://arxiv.org/pdf/2507.14958
- 项目链接 : https://github.com/yayayacc/MUR
- gitHub仓库 : https://github.com/yayayacc/MUR
英文摘要
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
中文摘要
大型语言模型(LLMS)在推理密集型任务上取得了令人印象深刻的表现,但是优化其推理效率仍然是一个开放的挑战。尽管测试时间缩放(TTS)提高了推理质量,但它通常会导致过度思考,并在冗余计算上浪费令牌。这项工作调查了如何在没有额外培训的情况下有效和自适应指导LLM测试时间缩放。受物理动量概念的启发,我们提出了动量不确定性引导的推理(MUR),该推理将思维预算动态地分配给关键推理步骤,通过跟踪和汇总逐步的不确定性,随着时间的流逝。为了支持灵活的推理时间控制,我们引入了伽马控制,这是一种简单的机制,可以通过单个超参数调整推理预算。我们提供深入的理论证明,以支持MUR在稳定和偏见方面的优势。使用不同尺寸的近期QWEN3模型(1.7b,4b和8b)的不同尺寸的基准(Math-500,AIME24,AIME25和GPQA-DIAMOND),对MUR进行了全面评估。结果表明,MUR平均将计算降低了50%以上,同时将精度提高了0.62-3.37%。
电影院船长:迈向短片一代
- 标题: Captain Cinema: Towards Short Movie Generation
- 作者: Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, Lu Jiang
- 日期: 2025-07-24
- ArXiv主页 : https://arxiv.org/abs/2507.18634
- 论文链接 : https://arxiv.org/pdf/2507.18634
- 项目链接 : https://thecinema.ai/
英文摘要
We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai
中文摘要
我们介绍了Cinema上尉,这是短电影一代的一代框架。鉴于电影故事情节的详细文本描述,我们的方法首先生成了一系列概述整个叙述的序列,从而确保了故事情节和视觉外观(例如,场景和角色)的远距离连贯性。我们将此步骤称为自上而下的密钥帧规划。然后,这些关键帧充当视频综合模型的条件信号,该模型支持长上下文学习,以产生它们之间的时空动力学。此步骤称为自下而上的视频综合。为了支持多场叙事叙事电影作品的稳定而有效的生成,我们引入了多模式扩散变压器(MM-DIT)的交织训练策略,专门针对长篇文化视频数据。我们的模型在由交织数据对组成的特殊策划的电影数据集上进行了训练。我们的实验表明,Cinema船长在自动创作以高质量和效率的视觉连贯和叙事的短片创作中表现出色。项目页面:https://thecinema.ai
ThinkAct:通过增强视觉潜在计划的视觉效法推理
- 标题: ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
- 作者: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang
- 日期: 2025-07-22
- ArXiv主页 : https://arxiv.org/abs/2507.16815
- 论文链接 : https://arxiv.org/pdf/2507.16815
- 项目链接 : https://jasper0314-huang.github.io/thinkact-vla/
英文摘要
Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
中文摘要
视觉语言动作(VLA)推理任务要求代理来解释多模式指示,执行长摩根计划并在动态环境中自适应行动。现有方法通常以端到端的方式训练VLA模型,直接将输入映射到无明确推理的情况下,这阻碍了他们计划多个步骤或适应复杂的任务变化的能力。在本文中,我们提出了ThinkAct,这是一个双层系统框架,它通过强化的视觉潜在计划将高级推理与低水平的动作执行。ThinkAct训练多模式LLM,以根据目标完成和轨迹一致性来增强动作一致的视觉奖励,以生成体现的推理计划。这些推理计划被压缩到潜在的视觉计划中,该计划是针对目标环境中强大的动作执行的下游动作模型的条件。关于体现的推理和机器人操纵基准的广泛实验表明,在复杂体现的AI任务中,ThinkAct几乎无法进行适应性,长马计划和自我纠正行为。
SEC:通过渐进概念构建来推进复杂的视频对象细分
- 标题: SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
- 作者: Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
- 日期: 2025-07-21
- ArXiv主页 : https://arxiv.org/abs/2507.15852
- 论文链接 : https://arxiv.org/pdf/2507.15852
- 项目链接 : https://rookiexiong7.github.io/projects/SeC/
- gitHub仓库 : https://github.com/OpenIXCLab/SeC
英文摘要
Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.
中文摘要
视频对象细分(VOS)是计算机视觉中的核心任务,需要模型在视频帧中跟踪和细分目标对象。尽管最近的努力取得了显着的进步,但当前的技术仍然落后于人类在处理急剧视觉变化,遮挡和复杂场景变化方面的能力。这种局限性源于它们依赖外观匹配,忽略了对物体的类似人类的概念理解,这些理解能够在时间动力学之间进行稳健的识别。在这个差距的推动下,我们提出了细分概念(SEC),这是一个概念驱动的细分框架,从传统的特征匹配转变为高级,以对象为中心表示的渐进式构造和利用。SEC采用大型视觉模型(LVLM)来整合跨不同框架的视觉线索,从而构建了稳健的概念先验。在推断期间,SEC基于处理的帧形成目标的全面语义表示,从而实现了随访帧的稳健分割。此外,SEC可以自适应地平衡基于LVLM的语义推理与增强的功能匹配,并根据场景复杂性动态调整计算工作。为了在需要高级概念推理和强大的语义理解的情况下严格评估VOS方法,我们介绍了语义复杂的场景视频对象分割基准(SECVOS)。SECVO包括160个手动注释的多幕科视频,旨在挑战具有实质性外观变化和动态场景转换的模型。特别是,SEC在SECVO上实现了SAM 2.1的11.8分,建立了概念感知的视频对象细分的新最新。
在3D高斯分裂中使用正规分数蒸馏采样的3D胶片零件编辑
- 标题: Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
- 作者: Hayeon Kim, Ji Ha Jang, Se Young Chun
- 日期: 2025-07-15
- ArXiv主页 : https://arxiv.org/abs/2507.11061
- 论文链接 : https://arxiv.org/pdf/2507.11061
- 项目链接 : https://janeyeon.github.io/romap
- gitHub仓库 : https://github.com/janeyeon/romap-code
英文摘要
Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.
中文摘要
3D神经表示和实例级编辑模型的最新进展已使高质量3D内容的有效创建。但是,由于不一致的多视图2D零件分段以及得分蒸馏采样(SDS)损失的固有性质,因此实现精确的本地3D编辑仍然具有挑战性,尤其是对于高斯裂缝。为了解决这些局限性,我们提出了Romap,这是一种新型的本地3D高斯编辑框架,可以实现精确而剧烈的零件级修改。首先,我们使用3D几何的标签预测(3D-GALP)引入了强大的3D掩码生成模块,该预测(3D-GALP)使用球形谐波(SH)系数(SH)系数来模拟与视图相关的标签变化和软标签属性,从而产生跨视点的准确且一致的零件段。其次,我们提出了一个正规的SDS损失,将标准SDS损失与其他正规化器结合在一起。特别是,通过计划的潜在混合和零件(缩水)编辑方法引入L1锚损失,该方法生成高质量的部分编辑的2D图像,并仅将修改限制在目标区域的同时,同时保留上下文相干性。诸如高斯事先删除之类的其他正规机构通过允许超出现有上下文的变化来进一步提高灵活性,而强大的3D遮罩可以阻止意外编辑。实验结果表明,我们的ROMAP在重构和生成的高斯场景和物体上都在定性和定量上实现了最先进的本地3D编辑,从而实现了更强大和灵活的零件级别3D高斯编辑。代码可从https://janeyeon.github.io/romap获得。
一个领域可以帮助他人吗?通过增强学习以数据为中心的多域推理的研究
- 标题: Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
- 作者: Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu
- 日期: 2025-07-23
- ArXiv主页 : https://arxiv.org/abs/2507.17512
- 论文链接 : https://arxiv.org/pdf/2507.17512
英文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.
中文摘要
具有可验证奖励(RLVR)的增强学习已成为增强LLMS推理能力的强大范式。现有的研究主要集中在孤立的推理领域,例如数学解决问题,编码任务或逻辑推理。但是,现实世界的推理情景固有地要求多种认知技能的综合应用。尽管如此,在强化学习下的这些推理技能之间的相互作用仍然很糟糕。为了弥合这一差距,我们对RLVR框架内的多域推理进行了系统的调查,明确关注三个主要领域:数学推理,代码生成和逻辑拼图求解。我们进行了一项包括四个关键组成部分的全面研究:(1)利用GRPO算法和QWEN-2.5-7B模型家族,我们的研究在单域数据集对模型的内域改进和跨域概括功能进行了详尽的评估。(2)此外,我们研究了复杂的相互作用,包括在跨域训练中出现的相互增强和冲突。(3)为了进一步了解SFT对RL的影响,我们还分析和比较了相同的RL配置下的基础模型和指导模型之间的性能差异。(4)此外,我们深入研究了关键的RL培训细节,系统地探讨了课程学习策略的影响,奖励设计的变化以及特定于语言的因素。通过广泛的实验,我们的结果为控制领域相互作用的动态提供了重要的见解,揭示了影响专业和可推广推理性能的关键因素。这些发现为优化RL方法提供了宝贵的指导,以培养LLMS中全面的多域推理能力。
提取重要的是:加速扩散变压器的区域自适应潜在采样
-
标题: Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
-
作者: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun
-
日期: 2025-07-11
-
ArXiv主页 : https://arxiv.org/abs/2507.08422
-
gitHub仓库 : https://github.com/ignoww/RALU
英文摘要
Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0times speed-up on FLUX and 3.0times on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.
中文摘要
扩散变压器已成为用于高保真图像和视频生成的基于U-NET的扩散模型的替代方法,提供了卓越的可扩展性。但是,他们的大量计算仍然是现实部署的主要障碍。现有的加速方法主要利用时间维度,例如在扩散时间段中重复使用缓存的特征。在这里,我们提出了区域自适应潜伏采样(RALU),这是一个无训练的框架,可加速沿空间维度的推理。Ralu在三个阶段进行了混合分辨率抽样:1)低分辨率降低潜在扩散以有效捕获全球语义结构,2)在容易出现的特定区域,在特定区域上易于进行全分辨率,而3)所有潜在的潜伏在全分辨率上进行全分辨率,以详细介绍。为了稳定跨分辨率过渡的世代,我们利用重新安排的噪声刺激来适应各种分辨率的噪声水平。我们的方法可大大降低计算,同时通过在稳定扩散3上的速度加快速度和3.0倍的速度,并以最小的降解来降低图像质量。此外,Ralu与现有的时间加速度(例如缓存方法)相辅相成,因此可以无缝集成以进一步降低推理潜伏期而不会损害发电质量。
斑马 - cot:一个用于交织视觉语言推理的数据集
-
标题: Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
-
作者: Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum
-
日期: 2025-07-22
-
ArXiv主页 : https://arxiv.org/abs/2507.16746
-
gitHub仓库 : https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT
英文摘要
Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce Zebra-CoT, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
中文摘要
在解决复杂问题时,人类经常使用视觉辅助工具,例如图表或草图。训练多模型进行相同的训练,称为视觉链(视觉cot),这是由于以下方面的挑战:(1)造成的视觉COT性能不佳,这阻碍了增强型学习,以及(2)缺乏高质量的视觉COT训练数据。我们介绍了Zebra-Cot,这是一个具有182,384个样本的多样性大型数据集,其中包含逻辑上一致的交织文本图像推理轨迹。我们专注于四类任务,在这些任务中,素描或视觉推理特别自然,涵盖了科学问题,例如几何,物理和算法。2D视觉推理任务,例如视觉搜索和拼图拼图;3D推理任务,包括3D多跳推断,体现和机器人计划;视觉逻辑问题和战略游戏等国际象棋。对斑马 - 培训语料库的Anole-7b模型进行微调,导致我们的测试准确性提高了 +12%,并且在标准VLM基准评估中的性能增长 + +13%。微调Bagel-7b产生了一种模型,该模型产生高质量的交织视觉推理链,强调Zebra-Cot在开发多模式推理能力方面的有效性。我们开放数据集和模型,以支持视觉婴儿床的开发和评估。
拉波:通过长度自适应策略优化内部化推理效率
-
标题: LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
-
作者: Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang
-
日期: 2025-07-21
-
ArXiv主页 : https://arxiv.org/abs/2507.15758
-
gitHub仓库 : https://github.com/ZJU-REAL/LAPO
英文摘要
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.
中文摘要
大型推理模型通过扩展的经济序列序列实现了出色的性能,但是这种计算自由也导致了过多的令牌产生,即使是简单的问题也是如此。我们提出了长度自适应策略优化(LAPO),这是一个新颖的框架,将推理长度控制从外部约束转换为内在模型能力。与施加严格限制或依赖于事后干预措施的现有方法不同,Lapo使模型能够通过两阶段的加固学习过程内部地了解对适当的推理深度的理解。在第一阶段,模型通过发现成功解决方案长度的统计分布来学习自然推理模式。第二阶段将这些模式作为元认知指导,将它们直接嵌入模型的推理环境中,以确保推理时间灵活性。数学推理基准的实验表明,拉波将令牌用法降低了40.9 \%,同时将准确性提高了2.3 \%。我们的分析表明,接受LAPO培训的模型发展出紧急的能力,可以根据问题的复杂性分配计算资源,在不牺牲质量的情况下实现有效的推理。
Franca:嵌套的Matryoshka聚类,用于可扩展的视觉表示学习
-
标题: Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
-
作者: Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano
-
日期: 2025-07-18
-
ArXiv主页 : https://arxiv.org/abs/2507.14137
-
gitHub仓库 : https://github.com/valeoai/Franca
英文摘要
We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
中文摘要
我们介绍弗朗卡(发音为弗兰卡):免费一个;the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B.除了模型发布之外,我们还可以解决SSL聚类方法中的临界限制。尽管现代模型依赖于通过cluster算法(例如sindhorn-knopp)将图像功能分配给大型代码书,但它们无法解决聚类语义中固有的歧义。为了解决这个问题,我们基于嵌套的Matryoshka表示,介绍了一个参数效率高的多头聚类投影仪。该设计逐渐将功能优化为越来越细粒的簇,而无需增加模型大小,从而使性能和记忆效率既可以。此外,我们提出了一种新颖的位置分离策略,该策略明确地消除了密集表示的位置偏见,从而改善了语义内容的编码。这导致了几个下游基准测试的一致增长,这表明了清洁特征空间的实用性。我们的贡献为透明,高性能的愿景模型建立了新的标准,并为更广泛的AI社区开辟了更可重现和可概括的基础模型的道路。代码和模型检查点可在https://github.com/valeoai/franca上找到。
Ultra3D:有效且高保真3D生成,部分关注
- 标题: Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention
- 作者: Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin
- 日期: 2025-07-23
- ArXiv主页 : https://arxiv.org/abs/2507.17745
- 论文链接 : https://arxiv.org/pdf/2507.17745
- 项目链接 : https://buaacyw.github.io/ultra3d/
英文摘要
Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.
中文摘要
稀疏体素表示的最新进展显着提高了3D含量产生的质量,从而实现了具有细粒几何形状的高分辨率建模。但是,由于其两阶段扩散管道中注意机制的二次复杂性,现有框架遭受了严重的计算效率低下。在这项工作中,我们提出了Ultra3D,这是一个有效的3D生成框架,可显着加速稀疏体素建模而不会损害质量。我们的方法利用紧凑型VECSET表示在第一阶段有效地生成粗对象布局,从而减少了令牌计数并加速体素坐标预测。为了完善第二阶段的人均潜在特征,我们引入了部分注意力,这是一种几何学意识到的局部注意机制,限制了语义上一致的部分区域内的注意力计算。该设计可保留结构连续性,同时避免了不必要的全球关注,在潜在一代中达到了6.7倍的速度。为了支持这种机制,我们构建了可扩展的零件注释管道,该管道将原始网格转换为部分标记的稀疏体素。广泛的实验表明,Ultra3D以1024分辨率支持高分辨率的3D生成,并在视觉保真度和用户偏好方面达到最先进的性能。
head-h0:大规模人类视频预测视觉性能
- 标题: Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
- 作者: Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu
- 日期: 2025-07-21
- ArXiv主页 : https://arxiv.org/abs/2507.15597
- 论文链接 : https://arxiv.org/pdf/2507.15597
- 项目链接 : https://beingbeyond.github.io/Being-H0
- gitHub仓库 : https://github.com/BeingBeyond/Being-H0
英文摘要
We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.
中文摘要
我们介绍了H-H0,这是一种灵巧的视觉语言动作模型(VLA),该模型在大型人类视频中训练。现有的VLA与需要高灵活性的复杂操纵任务斗争,并且对新的场景和任务的推广不佳,这主要是由于它们依赖于综合数据具有大量的SIM到SIM到真实差距或远程操作示威,缺乏规模和多样性。为了解决此数据瓶颈,我们提出利用人类作为基础操纵器,利用Web数据中存在的丰富灵巧性和可扩展性。我们的方法以物理教学调整为中心,这是一种新颖的训练范式,结合了人类视频的大规模VLA预处理,3D推理的物理空间对齐以及用于机器人任务的训练后适应。此外,我们引入了一种零件级运动令牌化方法,该方法可实现毫米级的重建精度,以模拟用于动作学习的精确手轨迹。为了支持我们提出的范式,我们进一步开发了一条全面的数据策划管道,该管道将异构来源(包括运动捕获,VR和仅RGB的视频)整合到一个大规模数据集中,其中包含数百万个基于运动的教学实例。从经验上讲,我们在手动运动和指导下的卓越表现出了卓越,并且与模型和数据大小相关。重要的是,随着物理指导调整,我们观察到现实世界机器人操纵中的预期h0的预期收益。更多详细信息,请访问https://beingbeyond.github.io/being-h0。