中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 将RL缩放到长视频
- 备忘录:AI系统的内存OS
- T-lora:单图扩散模型自定义,而无需过度拟合
- Singlora:使用单个矩阵的低等级适应
- 4Kagent:对4K超分辨率的任何图像进行代理
- 关于潜在推理的调查
- 我们是否还应该用蒙版语言建模预先编码?
- Mirix:基于LLM的代理的多代理存储系统
- 代理KB:利用跨域经验解决特工问题解决
- Skywork-R1V3技术报告
- Omnipart:具有语义脱钩和结构内聚力的部分感知3D代
- 转到零:使用数百万尺度数据迈向零射击运动
- [如何培训LLM Web代理:统计诊断](#如何培训LLM Web代理:统计诊断)
- 可追溯的证据增强了视觉基础推理:评估和方法论
- 多模式推理的感知感知政策优化
- 流vln:通过慢速上下文建模的流视觉和语言导航
- 多粒时时空令牌合并以训练视频LLMS的无训练加速度
- 简易数据集:一个统一且可扩展的框架,用于从非结构化文档合成LLM微调数据
- 评论家:批评家指导的加强学习数学形式化
- Dreamvla:具有全面世界知识的视觉语言行动模型
- OST基础:评估MLLM在在线时空场景中的功能
- 4DSLOMO:高速场景的4D重建以及异步捕获
- 预训练的政策歧视者是一般奖励模型
- [langsplatv2:高维3D语言高斯分裂,450+ fps](#langsplatv2:高维3D语言高斯分裂,450+ fps)
- GPT-4O对愿景的了解如何?评估标准计算机视觉任务的多模式基础模型
- 跳过一层还是循环?测试时间深度适应验证的LLM
- 几何强迫:结合视频扩散和3D表示,以保持一致的世界建模
- Pyvision:具有动态工具的代理视觉
- RLVER:通过可智力的情感奖励的加强学习
- [Robobrain 2.0技术报告](#Robobrain 2.0技术报告)
将RL缩放到长视频
- 标题: Scaling RL to Long Videos
- 作者: Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07966
- 论文链接 : https://arxiv.org/pdf/2507.07966
- 项目链接 : https://github.com/NVlabs/Long-RL
- gitHub仓库 : https://github.com/NVlabs/Long-RL
英文摘要
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).
中文摘要
我们介绍了一个全栈框架,该框架将视觉模型(VLM)中的推理扩展到了长时间的视频,从而利用了强化学习。我们通过整合三个关键组件来应对长期视频推理的独特挑战:(1)一个大规模的数据集,Longvideo-Reseason,包括52K长的视频QA对,具有高质量的推理注释,包括体育,游戏和Vlogs等各种领域;(2)一条两阶段的培训管道,通过经过经过经过经过监管的链条的微调(COT-SFT)和增强学习(RL)扩展VLM;(3)长期视频RL的训练基础架构,称为多模式增强序列并行性(MR-SP),该序列并行了序列并行性和针对长视频的VLLM基于VLLM的发动机,并使用加速视频嵌入式嵌入式嵌入式嵌入式嵌入式嵌入式嵌入,以有效地推出和预填充。在实验中,Longvila-R1-7B在长时间视频基准(例如Videmomme)上取得了强劲的性能。它还胜过视频R1-7B,甚至匹配跨时间的推理,目标和目的推理,空间推理以及在我们的Longvideo-Reason-Reason-Reason-eval基准上的情节推理的匹配。值得注意的是,我们的MR-SP系统在长期视频RL培训中达到了2.1倍的速度。Longvila-R1作为输入视频帧量表的数量表现出一致的性能增长。Longvila-R1标志着在VLM中迈出了长期视频推理的坚定一步。此外,我们发布了公共可用性的培训系统,该系统支持各种模式(视频,文本和音频),各种模型(Vila和Qwen系列)的RL培训,甚至图像和视频生成模型。在单个A100节点(8 GPU)上,它支持长达一个小时的视频(例如3,600帧 /约256K令牌)的RL培训。
备忘录:AI系统的内存OS
- 标题: MemOS: A Memory OS for AI System
- 作者: Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong
- 日期: 2025-07-04
- ArXiv主页 : https://arxiv.org/abs/2507.03724
- 论文链接 : https://arxiv.org/pdf/2507.03724
- 项目链接 : https://memos.openmem.net/
- gitHub仓库 : https://github.com/MemTensor/MemOS
英文摘要
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
中文摘要
大型语言模型(LLM)已成为人工通用智能(AGI)的重要基础架构,但是缺乏明确定义的记忆管理系统阻碍了长期文化推理,持续个性化和知识一致性的发展。存在依赖静态参数和更新的知识时期,并限制了较短的知识,并限制了式的偏好,并限制了其范围的偏好,并限制了(限制了知识)。知识在纯文本中,它仍然是无状态的解决方法,而没有生命周期控制或与持久表示的集成。从内存层次结构的角度来看,续签LLM的培训和推理成本建模,表明在参数内存和外部检索之间引入明确的内存层可以通过外部化特定的特定知识来实质性地降低这些成本。除了计算效率之外,LLMS还面临着更广泛的挑战,这些挑战是如何在时间和上下文中分布的方式,需要能够管理跨越不同时间尺度和来源的异质知识的系统。为了应对这一挑战,我们提出了备忘录,这是一种将内存视为可管理系统资源的内存操作系统。它统一了明文,基于激活和参数级记忆的表示形式,调度和演变,从而实现了具有成本效益的存储和检索。作为基本单元,Memcube封装了内存内容和元数据,例如出处和版本。可以随着时间的推移而组成,迁移和融合成员,从而在内存类型之间进行灵活的过渡,并通过基于参数的学习进行桥接检索。备忘录建立了一个以内存为中心的系统框架,该框架为LLM带来了可控性,可塑性和可发展性,为持续学习和个性化建模奠定了基础。
T-lora:单图扩散模型自定义,而无需过度拟合
-
标题: T-LoRA: Single Image Diffusion Model Customization Without Overfitting
-
作者: Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev
-
日期: 2025-07-08
-
ArXiv主页 : https://arxiv.org/abs/2507.05964
-
gitHub仓库 : https://github.com/ControlGenAI/T-LoRA
英文摘要
While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.
中文摘要
虽然扩散模型微调为定制预训练的模型生成特定对象提供了强大的方法,但在训练样本受到限制时,它经常会遭受过度拟合,从而损害了概括能力和输出多样性。本文仅使用单个概念图像来解决一个充满挑战但最具影响力的任务,即单位定制具有最大的实践潜力,以适应扩散模型。我们介绍了T-Lora,这是一个依赖于TimeStep的低级适应框架,专门为扩散模型个性化设计。在我们的工作中,我们表明,较高的扩散时间段比低较低的时间段更容易拟合,这需要对时间段敏感的微调策略。T-Lora结合了两个关键的创新:(1)一种动态微调策略,该策略根据扩散时间段调整了秩限制的更新,(2)一种权重参数化技术,可通过正交初始化来确保适配器组件之间的独立性。广泛的实验表明,T-Lora及其各个组件的表现优于标准LORA和其他扩散模型个性化技术。他们在概念保真度和文本对准之间取得了卓越的平衡,突出了T-Lora在数据限制和资源约束的场景中的潜力。代码可在https://github.com/controlgenai/t-lora上找到。
Singlora:使用单个矩阵的低等级适应
-
标题: SingLoRA: Low Rank Adaptation Using a Single Matrix
-
作者: David Bensaïd, Noam Rotstein, Roy Velich, Daniel Bensaïd, Ron Kimmel
-
日期: 2025-07-08
-
ArXiv主页 : https://arxiv.org/abs/2507.05566
-
gitHub仓库 : https://github.com/kyegomez/SingLoRA
英文摘要
Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.
中文摘要
低排名适应性(LORA)具有大型预审计模型的高级参数效率微调。洛拉(Lora)通过添加两个较小的矩阵的乘积来增强模型的预训练重量,这些矩阵共同形成了低级矩阵更新。最近的研究表明,这两个矩阵之间的尺度差异通常会导致不稳定的训练动态,从而导致次优性能。在本文中,我们提出了Singlora,该文章通过学习权重更新来重新制定低级适应,这是单个低级矩阵乘以其转置的分解。这种简单的设计固有地消除了矩阵量表冲突,确保稳定的优化,并将参数计数大致减半。我们分析了无限宽度神经网络框架内的单紫罗兰色,表明它可以保证构造稳定的特征学习。关于多个任务的广泛实验验证了这些好处。在常识推理上,MNLI在MNLI上的微调7b实现了91.3%的精度 - 超过Lora(89.1%)和Lora+(90.2%),而仅使用其参数预算的60%。在图像生成中,与单紫罗兰的微调稳定扩散显着提高了梦boot乱的图像保真度,达到恐龙的相似性得分为0.151,而多拉和洛拉的得分分别为0.148和0.143。
4Kagent:对4K超分辨率的任何图像进行代理
- 标题: 4KAgent: Agentic Any Image to 4K Super-Resolution
- 作者: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu
- 日期: 2025-07-09
- ArXiv主页 : https://arxiv.org/abs/2507.07105
- 论文链接 : https://arxiv.org/pdf/2507.07105
- 项目链接 : https://4kagent.github.io/
- gitHub仓库 : https://github.com/taco-group/4KAgent
英文摘要
We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.
中文摘要
我们提出了4Kagent,这是一种统一的代理超级分辨率的通才系统,旨在普遍地将任何图像提高到4K分辨率(如果迭代地应用甚至更高)。我们的系统可以将图像从具有严重降解的极低分辨率转化,例如,在256x256的高度扭曲的输入中,将图像转变为晶体清晰的,逼真的4K输出。4Kagent包括三个核心组件:(1)分析,一个基于定制用例的4Kagent管道的模块;(2)一种感知剂,它利用视觉模型与图像质量评估专家一起分析输入图像并制定量身定制的恢复计划;(3)按照递归执行反射范式执行计划的恢复代理,并在质量驱动的专家策略中指导,以选择每个步骤的最佳输出。此外,4Kagent嵌入了专门的面部修复管道,可显着增强肖像和自拍照片中的面部细节。我们在11个不同的任务类别中严格评估了我们的4Kagent,其中包括总共26种不同的基准测试,从而在广泛的成像域中设定了新的最新时间。我们的评估涵盖了自然图像,肖像照片,AI生成的内容,卫星图像,荧光显微镜以及医学成像,例如基础镜检查,超声和X射线,在感知方面(例如NIQE,MUSIQ)和Fidelity(例如,PSNR)表示,在感知方面表现出了出色的表现。通过建立用于低级视觉任务的新型代理范式,我们旨在促进以各种研究社区为中心的以视觉为中心的自主代理的更广泛的兴趣和创新。我们将在以下网址发布所有代码,模型和结果:https://4kagent.github.io。
关于潜在推理的调查
-
标题: A Survey on Latent Reasoning
-
作者: Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
-
日期: 2025-07-08
-
ArXiv主页 : https://arxiv.org/abs/2507.06203
-
gitHub仓库 : https://github.com/multimodal-art-projection/LatentCoT-Horizon
英文摘要
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.
中文摘要
大型语言模型(LLMS)表现出了令人印象深刻的推理能力,尤其是在以明确的思维链(COT)为指导下,推理了口头上间步骤的推理。尽管COT提高了可解释性和准确性,但其对自然语言推理的依赖性限制了模型的表现性带宽。潜在推理通过完全在模型的隐藏状态下执行多步推断来解决这种瓶颈,从而消除了令牌级的监督。为了推进潜在的推理研究,这项调查提供了对潜在推理的新兴领域的全面概述。首先,我们研究神经网络层作为推理的计算底物的基础作用,突出了分层表示如何支持复杂的转换。接下来,我们探讨了多种潜在推理方法,包括基于激活的复发,隐藏状态传播以及压缩或内部化显式推理痕迹的微调策略。最后,我们讨论了高级范式,例如通过掩盖扩散模型的无限深度潜在推理,从而实现全球一致和可逆的推理过程。通过统一这些观点,我们旨在阐明潜在推理的概念格局,并在LLM认知前沿绘制未来的研究方向。相关的GitHub存储库收集了最新论文和存储库,请访问以下网址:https://github.com/multimodal-art-provention/latentcot-horizon/。
我们是否还应该用蒙版语言建模预先编码?
- 标题: Should We Still Pretrain Encoders with Masked Language Modeling?
- 作者: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo
- 日期: 2025-07-01
- ArXiv主页 : https://arxiv.org/abs/2507.00994
- 论文链接 : https://arxiv.org/pdf/2507.00994
- 项目链接 : https://huggingface.co/MLMvsCLM
- gitHub仓库 : https://github.com/Nicolas-BZRD/EuroBERT
英文摘要
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.
中文摘要
学习高质量的文本表示是多种NLP任务的基础。尽管传统上审计编码器审计依赖于蒙版的语言建模(MLM),但最近的证据表明,用因果语言建模(CLM)预测的解码器模型可以有效地重新陈述为编码器,通常超过文本表示上的传统编码器。但是,尚不清楚这些收益是反映了CLM目标的固有优势还是由模型和数据量表等混杂因素产生。在本文中,我们通过一系列大规模,精心控制的预处理的消融,培训总共30个型号,范围为2.1亿至10亿个参数,并进行15,000多个微调和评估运行。我们发现,尽管使用MLM培训通常可以在文本表示任务中产生更好的性能,但受CLM训练的模型更具数据效率,并证明了改善的微调稳定性。在这些发现的基础上,我们通过实验表明,在固定的计算训练预算下,依次应用CLM,然后是MLM的双相培训策略可实现最佳性能。此外,我们证明,当从现有的LLM生态系统中初始化的CLM模型初始化时,这种策略变得更加吸引人,从而减少了培训一流的编码器模型所需的计算负担。我们在https://hf.co/mlmvsclm上发布了所有项目工件,以促进进一步的研究。
Mirix:基于LLM的代理的多代理存储系统
- 标题: MIRIX: Multi-Agent Memory System for LLM-Based Agents
- 作者: Yu Wang, Xi Chen
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07957
- 论文链接 : https://arxiv.org/pdf/2507.07957
- 项目链接 : https://mirix.io/
- gitHub仓库 : https://github.com/Mirix-AI/MIRIX
英文摘要
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
中文摘要
尽管AI代理的记忆能力正在引起人们的关注,但现有的解决方案在根本上仍然有限。大多数人依靠平坦,狭窄的内存组件,限制其个性化,抽象和可靠地回忆用户特定信息的能力。为此,我们介绍了Mirix,这是一种模块化的多代理记忆系统,通过解决该领域最关键的挑战来重新定义AI内存的未来:使语言模型能够真正记住。与先前的方法不同,Mirix超越了文本,以拥抱丰富的视觉和多模式体验,使记忆在现实世界中确实有用。Mirix由六种不同的,仔细的内存类型组成:核心,情节,语义,程序,资源存储器和知识库,再加上一个多代理框架,该框架动态控制和坐标更新和检索。该设计使代理商能够按大规模持续,推理并准确检索多样化的长期用户数据。我们在两个苛刻的设置中验证Mirix。首先,在ScreenShotVQA上,这是一个具有挑战性的多模式基准,包括每个序列近20,000个高分辨率的计算机屏幕截图,需要深层的上下文理解,并且在没有应用现有的内存系统的情况下,Mirix的准确性比RAG基线高35%,同时将存储需求降低99.9%。其次,在Locomo上,具有单模式文本输入的长期对话基准,Mirix的最新性能为85.4%,超过了现有的基线。这些结果表明,Mirix为内存启动的LLM代理设定了新的性能标准。为了允许用户体验我们的内存系统,我们提供了由Mirix提供动力的包装应用程序。它实时监视屏幕,建立个性化的内存基础,并提供直观的可视化和安全的本地存储空间以确保隐私。
代理KB:利用跨域经验解决特工问题解决
-
标题: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
-
作者: Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
-
日期: 2025-07-08
-
ArXiv主页 : https://arxiv.org/abs/2507.06229
-
gitHub仓库 : https://github.com/OPPO-PersonalAI/Agent-KB
英文摘要
As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.
中文摘要
随着语言代理处理越来越复杂的任务,他们在有效的错误纠正和跨领域的重复使用方面挣扎。我们介绍了Agent KB,这是一个分层体验框架,可以通过新颖的retrieve-refine管道解决复杂的代理问题解决。代理KB解决了一个核心限制:传统上,代理商无法从彼此的经历中学习。通过捕获高级策略和详细的执行日志,Agent KB创建了一个共享的知识库,可以实现跨质量知识转移。在Gaia基准测试中,Agent KB在Gaia基准测试中提高了16.28个百分点。在最具挑战性的任务上,Claude-3从38.46%提高到57.69%,而GPT-4的中级任务从53.49%提高到73.26%。在SWE基础代码维修中,Agent KB使Claude-3可以从41.33%提高到53.33%。我们的结果表明,Agent KB提供了一个模块化的,框架的基础架构,以使代理商能够从过去的经验中学习并将成功的策略推广到新任务。
Skywork-R1V3技术报告
-
标题: Skywork-R1V3 Technical Report
-
作者: Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Yahui Zhou
-
日期: 2025-07-08
-
ArXiv主页 : https://arxiv.org/abs/2507.06167
-
gitHub仓库 : https://github.com/SkyworkAI/Skywork-R1V
英文摘要
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
中文摘要
我们介绍了Skywork-R1V3,这是一种先进的开源视觉语言模型(VLM),它是一种新的视觉推理方法。它的关键创新在于有效地将推理技能从仅文本大型语言模型(LLM)转移到视觉任务。SkyWork-R1V3的强劲性能主要源于我们精心培训的训练后RL框架,该框架有效地激活和增强了模型的推理能力,而无需额外的继续前训练。通过这个框架,我们进一步揭示了连接器模块在实现多模式推理模型的稳健跨模式比对中的基本作用。此外,我们介绍了推理能力的独特指标,即关键推理令牌的熵,这在RL培训期间已证明对检查点的选择非常有效。Skywork-R1V3在MMMU上实现了最新的结果,从64.3%的增长到76.0%。该性能与入门级人类能力相匹配。值得注意的是,我们的RL驱动后训练方法甚至可以使38B参数模型媲美顶部封闭源VLM。该实施成功将数学推理转移到其他与主题相关的推理任务。我们还包括对课程学习和加强框架策略的分析,以及关于多模式推理的更广泛讨论。SkyWork-R1V3代表了多模式推理的重大飞跃,将RL作为推动开源VLM功能的强大引擎。
Omnipart:具有语义脱钩和结构内聚力的部分感知3D代
- 标题: OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
- 作者: Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu
- 日期: 2025-07-08
- ArXiv主页 : https://arxiv.org/abs/2507.06165
- 论文链接 : https://arxiv.org/pdf/2507.06165
- 项目链接 : https://omnipart.github.io/
英文摘要
The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.
中文摘要
具有明确,可编辑的零件结构的3D资产对于推进交互式应用至关重要,但是大多数生成方法仅产生单片形状,从而限制了它们的效用。我们介绍了Omnipart,这是一个新型的零件感知3D对象生成的框架,旨在实现组件之间的高语义脱钩,同时保持健壮的结构内聚力。Omnipart独特地将这项复杂的任务解散为两个协同阶段:(1)自回归结构计划模块生成一个可控的,可变的长度序列3D零件边界盒的序列,由柔性2D零件蒙版在不需要直接控制的情况下进行柔性2D零件掩模,而无需直接的直接或偏见相应或半个或半个月的标签;(2)有效地从预先训练的整体3D发电机中进行了空间条件的整流流模型,同时且始终如一地在计划的布局内合成所有3D部分。我们的方法支持用户定义的零件粒度,精确的本地化,并启用各种下游应用程序。广泛的实验表明,Omnipart实现了最先进的性能,为更加可解释,可编辑和多功能的3D内容铺平了道路。
转到零:使用数百万尺度数据迈向零射击运动
-
标题: Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
-
作者: Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang
-
日期: 2025-07-09
-
ArXiv主页 : https://arxiv.org/abs/2507.07095
英文摘要
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
中文摘要
基于文本描述生成多样化和自然的人类运动序列构成了计算机视觉,图形和机器人技术领域内的基本和挑战性的研究领域。尽管在该领域取得了重大进步,但当前的方法论经常在零击泛化能力方面面临挑战,这在很大程度上归因于训练数据集的规模有限。此外,缺乏全面的评估框架通过未能识别改进的方向来阻碍此任务的进步。在这项工作中,我们旨在将文本到动作推向一个新时代,即实现零击的概括能力。为此,首先,我们开发了一个有效的注释管道,并引入了迄今为止最大的人类运动数据集,其中包括2000多个小时和200万个高质量的运动序列。此外,我们提出了动态数字,这是评估零弹性运动产生的最全面的基准。利用可扩展的体系结构,我们将模型扩展到7b参数,并验证其在动感数字上的性能。我们的结果表明,对不域外和复杂组成运动的强烈概括,这标志着朝着零拍的人类运动产生迈出的重要一步。该代码可在https://github.com/vankouf/motionillion-codes上找到。
如何培训LLM Web代理:统计诊断
- 标题: How to Train Your LLM Web Agent: A Statistical Diagnosis
- 作者: Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia
- 日期: 2025-07-05
- ArXiv主页 : https://arxiv.org/abs/2507.04103
- 论文链接 : https://arxiv.org/pdf/2507.04103
英文摘要
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
中文摘要
基于LLM的Web代理最近取得了重大进展,但其中大部分发生在封闭式系统中,从而通过开源替代方案扩大了差距。进步已经被两个关键挑战所阻碍:第一,狭义地关注单步任务,忽略了多步互动的复杂性;其次,基于LLM后的网络代理商所需的高计算成本。为了解决这个问题,我们介绍了第一项统计基础的研究,该研究涉及LLM Web-Agent训练后的计算分配。我们的方法使用两阶段的管道,训练Llama 3.1 8B学生通过监督的微调(SFT)模仿Llama 3.3 70b老师,然后进行policy钢筋学习。我们发现这个过程对超参数的选择高度敏感,从而使详尽的扫描不切实际。为了使其他人免于昂贵的试用,我们采样了1,370张配置,并使用引导程序来估计有效的超参数。我们的结果表明,将SFT与On-Policy RL相结合,始终在Workarena和MiniWob ++上单独胜过。此外,该策略仅需要55%的计算,以匹配MiniWob ++上纯SFT的峰值性能,从而有效地推动了Compute-Performance Pareto Frontier,并且是唯一可以使用封闭源模型缩小差距的策略。
可追溯的证据增强了视觉基础推理:评估和方法论
- 标题: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
- 作者: Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07999
- 论文链接 : https://arxiv.org/pdf/2507.07999
- 项目链接 : https://github.com/Haochen-Wang409/TreeVGR
- gitHub仓库 : https://github.com/Haochen-Wang409/TreeVGR
英文摘要
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
中文摘要
诸如OpenAI-O3先驱之类的模型通过动态引用视觉区域进行了视觉推理,就像人类的"图像思考"一样。但是,没有基准可以整体评估这些功能。为了弥合这一差距,我们提出了Treebench(可追溯的证据评估基准),这是基于三个原则的诊断基准:(1)重点的视觉感知复杂场景中的微妙目标,(2)可通过边界盒评估可跟踪的证据,以及(3)二阶推理以测试对象交互和空间层次的简单对象定位。使用密集对象的图像优先级,我们最初从SA-1B中品尝1K高质量的图像,并合并八个LMM专家,以手动注释每个图像的问题,候选选项和答案。经过三个质量控制阶段之后,Treebench由405个具有挑战性的视觉提问对组成,即使是最先进的模型也用这种基准挣扎,在此基准中都没有达到60%的准确性,例如OpenAI-O3仅得分为54.87。此外,我们介绍了TreeVgr(可追溯的证据增强了视觉扎根推理),一种训练范式,以与加固学习共同监督定位和推理,从而准确地进行定位和可解释的推理途径。从Qwen2.5-VL-7B初始化,它改善了V *台式(+16.8),mme-realworld(+12.6)和Treebench(+13.4),证明可追溯性是推进视觉良好的理由的关键。该代码可在https://github.com/haochen-wang409/treevgr上找到。
多模式推理的感知感知政策优化
- 标题: Perception-Aware Policy Optimization for Multimodal Reasoning
- 作者: Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
- 日期: 2025-07-08
- ArXiv主页 : https://arxiv.org/abs/2507.06448
- 论文链接 : https://arxiv.org/pdf/2507.06448
- 项目链接 : https://mikewangwzhl.github.io/PAPO/
- gitHub仓库 : https://github.com/MikeWangWZHL/PAPO
英文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
中文摘要
通过可验证的奖励(RLVR)的强化学习已被证明是具有强大的多步推理能力的大型语言模型(LLM)的高效策略。但是,它的设计和优化仍然量身定制为纯粹的文本域,当应用于多模式推理任务时,导致次优性能。特别是,我们观察到当前多模式推理中的主要误差来源在于视觉输入的感知。为了解决这个瓶颈,我们提出了感知感知的政策优化(PAPO),这是一个简单而有效的GRPO扩展,鼓励模型学会在学习推理时完全来自内部监督信号。值得注意的是,Papo不依赖其他数据策展,外部奖励模型或专有模型。具体而言,我们以KL Divergence术语的形式引入了隐性感知损失,尽管它很简单,但它在多种多模态基准中却具有显着的总体改进(4.4%)。在具有高视力依赖性的任务上,改进更为明显,接近8.0%。我们还观察到感知错误的大幅降低(30.5%),表明PAPO的感知能力提高了。我们对Papo进行了全面的分析,并确定了独特的损失问题,我们严格地分析并减轻双熵损失。总体而言,我们的工作将感知感知的监督更深入地整合到RLVR学习目标中,并为一个新的RL框架奠定了基础,该框架鼓励了视觉上扎根的推理。项目页面:https://mikewangwzhl.github.io/papo。
流vln:通过慢速上下文建模的流视觉和语言导航
- 标题: StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
- 作者: Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
- 日期: 2025-07-07
- ArXiv主页 : https://arxiv.org/abs/2507.05240
- 论文链接 : https://arxiv.org/pdf/2507.05240
- 项目链接 : https://streamvln.github.io/
- gitHub仓库 : https://github.com/OpenRobotLab/StreamVLN
英文摘要
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/{https://streamvln.github.io/}.
中文摘要
现实世界设置中的视觉和语言导航(VLN)要求代理处理连续的视觉流并生成以语言说明为基础的低延迟的动作。虽然基于视频的大语言模型(视频LLMS)驱动了最近的进展,但基于视频llm的当前VLN方法通常在细粒度的视觉理解,长期背景建模和计算效率之间面临折衷。我们介绍了StreamVln,这是一种流媒体VLN框架,该框架采用混合慢速上下文建模策略来支持多模式推理,而不是交织的视觉,语言和动作输入。快速流的对话环境通过主动对话的滑动窗口有助于响应式动作生成,而缓慢的记忆上下文则使用3D感知的代币修剪策略压缩了历史视觉状态。通过这种缓慢的设计,StreamVln通过有效的KV缓存重用来实现连贯的多转向对话,从而支持具有有限上下文的上下文大小和推理成本的长视频流。对VLN-CE基准测试的实验表明了最先进的性能,并具有稳定的低潜伏期,从而确保了现实世界部署的稳健性和效率。项目页面为:https://streamvln.github.io/ {https://streamvln.github.io/}。
多粒时时空令牌合并以训练视频LLMS的无训练加速度
- 标题: Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
- 作者: Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07990
- 论文链接 : https://arxiv.org/pdf/2507.07990
- 项目链接 : https://www.jshyun.me/projects/sttm
- gitHub仓库 : https://github.com/HYUNJS/STTM
英文摘要
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.
中文摘要
视频大语言模型(LLMS)通过利用大量时空令牌来实现强大的视频理解,但具有令牌计数的二次计算缩放。为了解决这个问题,我们提出了一种无训练的时空令牌合并方法,名为STTM。我们的关键见解是在视频数据中利用本地空间和时间冗余,这在先前的工作中被忽略了。STTM首先使用在Quadtree结构上的粗到精细搜索将每个帧转换为多粒子空间令牌,然后在时间维度上执行成对合并。这种分解的合并方法的表现优于六个视频QA基准的现有令牌减少方法。值得注意的是,STTM在预算50%的预算低于50%的准确性下降了0.5%,并且在预算30%的预算下仅下降了2%,其准确度仅为0.5%。此外,STTM是查询不可思议的,可以在同一视频中重复使用不同问题的KV缓存。该项目页面可在https://www.jshyun.me/projects/sttm上找到。
简易数据集:一个统一且可扩展的框架,用于从非结构化文档合成LLM微调数据
-
标题: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
-
作者: Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang
-
日期: 2025-07-05
-
ArXiv主页 : https://arxiv.org/abs/2507.04009
-
gitHub仓库 : https://github.com/ConardLi/easy-dataset
英文摘要
Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.
中文摘要
大型语言模型(LLMS)在通用任务上表现出了令人印象深刻的表现,但是由于高质量的域数据缺乏,使它们适应特定领域仍然具有挑战性。现有的数据合成工具通常难以有效地从异质文档中提取可靠的微调数据。为了解决此限制,我们提出了简单数据集,这是一个统一的框架,用于通过直观的图形用户界面(GUI)从非结构化文档中综合微型数据。具体而言,简单的数据集允许用户轻松地配置文本提取模型和块策略,以将原始文档转换为连贯的文本块。然后,它利用角色驱动的提示方法来使用公共可用的LLM产生各种问答式求助方法。在整个管道中,人类的视觉界面促进了中间输出的审查和完善,以确保数据质量。在财务提问任务上进行的实验表明,合成数据集的微调LLM可显着提高特定于领域的性能,同时保留通用知识。源代码和可安装软件包可在https://github.com/conardli/easy-dataset上找到,并获得了9,000多个GitHub星星。
评论家:批评家指导的加强学习数学形式化
- 标题: CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
- 作者: Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang
- 日期: 2025-07-08
- ArXiv主页 : https://arxiv.org/abs/2507.06181
- 论文链接 : https://arxiv.org/pdf/2507.06181
- 项目链接 : https://github.com/multimodal-art-projection/CriticLean
- gitHub仓库 : https://github.com/multimodal-art-projection/CriticLean
英文摘要
Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.
中文摘要
将自然语言数学语句转换为正式的可执行代码是自动定理证明的基本挑战。虽然先前的工作集中在产生和汇编成功上,但对评论家阶段的关注很少 - 评估生成的形式化是否真正捕捉了原始问题的语义意图。在本文中,我们介绍了评论家,这是一个新颖的评论家指导的加强学习框架,将评论家的作用从被动验证者提升到主动学习组成部分。具体来说,首先,我们提出了通过监督的微调和强化学习培训的评论家,以严格评估LEAN 4形式化的语义忠诚。然后,我们介绍了评论家,这是一种旨在衡量模型将语义上正确与错误形式化的能力的基准,并证明我们训练有素的评论家模型可以显着胜过强大的开放式和封闭式基线。在Criticlean框架的基础上,我们构建了FineleanCorpus,这是一个数据集,其中包括超过285K的问题,这些问题表现出了丰富的领域多样性,广泛的困难覆盖范围和基于人类评估的高正确性。总体而言,我们的发现强调,优化评论家阶段对于产生可靠的形式化至关重要,我们希望我们的批评者将为未来的正式数学推理提供宝贵的见解。
Dreamvla:具有全面世界知识的视觉语言行动模型
- 标题: DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
- 作者: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
- 日期: 2025-07-06
- ArXiv主页 : https://arxiv.org/abs/2507.04447
- 论文链接 : https://arxiv.org/pdf/2507.04447
- 项目链接 : https://zhangwenyao1.github.io/DreamVLA/
- gitHub仓库 : https://github.com/Zhangwenyao1/DreamVLA
英文摘要
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
中文摘要
视力语言动作(VLA)模型的最新进展在将图像产生与动作预测相结合以改善机器人操纵中的概括和推理方面表现出了希望。但是,现有的方法仅限于基于图像的预测,这些预测遭受了冗余信息,并且缺乏全面和批判性的世界知识,包括动态,空间和语义信息。为了解决这些局限性,我们提出了DreamVla,这是一个新颖的VLA框架,该框架整合了全面的世界知识预测以实现反向动态建模,从而建立了对操纵任务的感知预测行动循环。具体而言,Dreamvla引入了一个动态区域指导的世界知识预测,并与空间和语义提示集成在一起,该预测为行动计划提供了紧凑而全面的表示。这种设计与人类在行动之前先首先形成抽象的多模式推理链与世界互动的方式保持一致。为了减轻训练过程中动态,空间和语义信息之间的干扰,我们采用了一种块稳定的结构化注意机制,该机制掩盖了它们相互关注,防止信息泄漏并保持每个表示的清洁和分离。此外,为了对未来动作进行有条件的分布进行建模,我们采用了基于扩散的变压器,该变压器将动作表示与共享潜在特征相关。对现实世界和仿真环境的广泛实验表明,Dreamvla在实际机器人任务上达到了76.7%的成功率,而Calvin ABC-D基准测试的平均长度为4.44。
OST基础:评估MLLM在在线时空场景中的功能
- 标题: OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
- 作者: JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07984
- 论文链接 : https://arxiv.org/pdf/2507.07984
- 项目链接 : https://rbler1234.github.io/OSTBench.github.io/
- gitHub仓库 : https://github.com/OpenRobotLab/OST-Bench
英文摘要
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
中文摘要
多模式大语言模型(MLLM)的最新进展在整合视觉和语言以进行复杂推理方面表现出了显着的功能。尽管大多数现有的基准测试都通过固定的预录用输入来评估离线设置下的模型,但我们推出了OST Bench,这是一种基准测试,旨在从主动地探索场景的代理商的角度评估在线时空的理解。在线方面强调需要在逐步获得的观察值上进行处理和理由,而时空组件需要将当前的视觉输入与历史记忆集成在一起以支持动态空间推理。OST基础更好地反映了现实世界体现的感知的挑战。OST BENCH基于有效的数据收集管道,由1.4k场景和10K询问答案对组成,这些场景是从Scannet,MatterPort3D和Arkitscenes收集的。我们评估了OST基础上的几个领先的MLLM,并观察到它们缺乏需要复杂时空推理的任务。在在线设置下,随着探索视野扩展并增长,它们的准确性会下降。通过进一步的实验分析,我们确定了跨模型的常见误差模式,发现基于复杂的基于线索的空间推理需求和长期记忆检索要求都显着沿两个单独的轴降低模型性能,突出了必须解决的核心挑战,以改善在线体现的推理。为了促进该领域的进一步研究和开发,我们的代码,数据集和基准都可以使用。我们的项目页面是:https://rbler1234.github.io/ostbench.github.io/
4DSLOMO:高速场景的4D重建以及异步捕获
- 标题: 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
- 作者: Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue
- 日期: 2025-07-07
- ArXiv主页 : https://arxiv.org/abs/2507.05163
- 论文链接 : https://arxiv.org/pdf/2507.05163
- 项目链接 : https://openimaginglab.github.io/4DSloMo/
英文摘要
Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
中文摘要
从多视频视频中重建快速动力的场景对于高速运动分析和现实4D重建至关重要。但是,大多数4D捕获系统仅限于低于30 fps(每秒帧)的帧速率,并且来自低FPS输入的高速运动直接重建可能会导致不良结果。在这项工作中,我们通过新颖的捕获和处理模块提出了仅使用低FPS摄像机的高速4D捕获系统。在捕获的一侧,我们提出了一种异步捕获方案,该方案通过惊人的摄像机开始时间来提高有效帧速率。通过对摄像机进行分组并利用25 fps的基本帧速率,我们的方法达到了100-200 fps的等效帧速率,而无需使用专门的高速摄像头。在处理方面,我们还提出了一个新颖的生成模型,以固定由4D稀疏视图重建引起的伪影,因为异步减少了每个时间戳上的观点数量。具体而言,我们建议培训基于视频扩散的伪像模型进行稀疏4D重建,该模型完善了缺失的细节,保持时间一致性并提高了整体重建质量。实验结果表明,与同步捕获相比,我们的方法显着增强了高速4D重建。
预训练的政策歧视者是一般奖励模型
-
标题: Pre-Trained Policy Discriminators are General Reward Models
-
作者: Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen
-
日期: 2025-07-07
-
ArXiv主页 : https://arxiv.org/abs/2507.05197
-
gitHub仓库 : https://github.com/InternLM/POLAR
英文摘要
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
中文摘要
我们通过将奖励建模作为政策歧视者提出奖励建模,提供了一种新颖的看法,该歧视者量化了两种政策之间产生奖励信号的差异,从而指导培训政策以所需的行为实现目标政策。基于这种概念上的见解,我们提出了一种名为策略判别学习(Polar)的可扩展培训方法,该方法训练奖励模型(RM)以辨别相同的政策并区分不同的政策。与依靠绝对偏好的传统奖励建模方法不同,Polar捕获了一个策略和任意目标策略之间的相对差异,这是一个可扩展的高级优化目标,适用于建模通用排名关系。利用极性训练范式,我们提出了一系列的RMS,其参数尺度从1.8B到7b。经验结果表明,极性大大优于传统的非训练方法,从而显着提高了RM性能。例如,与SOTA基线相比,Polar-7B可以将STEM任务的偏好精度从54.8%提高到81.0%,从57.9%到85.5%。Polar还使用加强调查(RFT)显示了RLHF中强大的概括能力,提供可靠的奖励信号并显着提高了政策绩效 - 从平均47.36%到56.33%到56.33%,Qwen2.5-32b从64.49%到70.47%的平均降至56.33%,至70.47%。此外,缩放实验揭示了计算与性能之间明确的幂律关系,并由接近0.99的线性相关系数支持。令人印象深刻的性能,强大的概括和缩放特性表明,极性是开发一般奖励模型的有希望的方向。
langsplatv2:高维3D语言高斯分裂,450+ fps
- 标题: LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS
- 作者: Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, Hanspeter Pfister
- 日期: 2025-07-09
- ArXiv主页 : https://arxiv.org/abs/2507.07136
- 论文链接 : https://arxiv.org/pdf/2507.07136
- 项目链接 : https://langsplat-v2.github.io/
- gitHub仓库 : https://github.com/ZhaoYujie2002/LangSplatV2
英文摘要
In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 times speedup and a 47 times boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real-time inference performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster. Codes and demos are available at our project page: https://langsplat-v2.github.io.
中文摘要
在本文中,我们介绍了LangsPlatv2,该文章以476.2 fps的高度分裂和3D开放式唱片库的文本查询为384.6 fps,以获得高分辨率图像,提供了42倍的速度,并在Langsplat上获得了47倍的速度,并提供了47倍的提升。Langsplat采用高斯分裂将2D语言特征嵌入3D中,从而大大提高了速度并使用SAM Spinics学习精确的3D语言字段。3D语言字段中的此类进步对于需要在复杂场景中需要语言交互的应用至关重要。但是,Langsplat尚未实现实时推理性能(8.2 fps),即使使用高级A100 GPU,严重限制了其更广泛的应用。在本文中,我们首先对langsplat进行了详细的时间分析,将重量级解码器确定为主要速度瓶颈。我们的解决方案LangsPlatV2假设每个高斯在全球词典中充当稀疏代码,从而导致学习了3D稀疏系数领域,完全消除了重量级解码器的需求。通过利用这种稀疏性,我们进一步提出了一种具有CUDA优化的有效稀疏系数剥落方法,以高质量的高质量呈现高维特征地图,同时仅产生超低维度的时间成本。我们的实验结果表明,LangsPlatv2不仅可以达到更好或竞争性的查询准确性,而且更快。代码和演示可在我们的项目页面上找到:https://langsplat-v2.github.io。
GPT-4O对愿景的了解如何?评估标准计算机视觉任务的多模式基础模型
- 标题: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
- 作者: Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir
- 日期: 2025-07-02
- ArXiv主页 : https://arxiv.org/abs/2507.01955
- 论文链接 : https://arxiv.org/pdf/2507.01955
- 项目链接 : https://fm-vision-evals.epfl.ch/
- gitHub仓库 : https://github.com/EPFL-VILAB/fm-vision-evals
英文摘要
Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.
中文摘要
多模式的基础模型(例如GPT-4O)最近取得了显着的进步,但尚不清楚这些模型在理解视觉方面的确切位置。在本文中,我们基于流行的多式联系基础模型(GPT-4O,O4-Mini,Gemini 1.5 Pro和Gemini 2.0 Flash,Claude 3.5 SONNET,QWEN2-VL,LLAMA 3.2)在标准计算机视觉任务上(使用语义段,对象检测,图像分类,depsife,ccote)(CRE)(eco)(eco)(eco)(eco)(eco)(eco)(eco)(eco)(CLAMA 3.2)变体等)。执行此操作的主要挑战是:1)大多数模型都接受了输出文本的培训,并且不能本地表达通用的域,例如片段或3D几何,2)许多领先的模型仅在API级别上可访问,即没有重量访问它们适应它们。我们通过将标准视觉任务转换为同等文本和API兼容的任务来解决这些挑战。我们观察到1)在任何任务下,模型都不接近最先进的专家模型。但是,2)他们是受人尊敬的通才;这是非常了不起的,因为它们大概是对主要基于图像文本的任务进行培训的。3)他们执行语义任务比几何任务要好。4)虽然迅速链接技术会影响性能,但更好的模型对迅速变化的敏感性较小。5)GPT-4O在非争议模型中表现最好,在6个任务中的4个任务中获得最高位置,6)推理模型,例如O3显示了几何任务的改进,以及7)对具有天然图像产生的模型的初步分析,例如最新的GPT-4O,表明它们表现出幻觉和空间未对准等怪癖。
跳过一层还是循环?测试时间深度适应验证的LLM
- 标题: Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs
- 作者: Ziyue Li, Yang Li, Tianyi Zhou
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07996
- 论文链接 : https://arxiv.org/pdf/2507.07996
英文摘要
Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For >75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For >60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.
中文摘要
预验证的神经网络可以将其体系结构适应不同的输入而无需任何填充吗?我们是否需要所有层来完成简单的任务,并且它们是否足以进行具有挑战性的任务?我们发现,经过预处理的大语言模型(LLM)的层层可以作为单独的模块操纵,以构建为每个测试样本定制的更好甚至更浅的模型。特别是,可以作为复发性神经网络(RNN)跳过/修剪或重复的每一层,并以任意顺序与其他人堆叠在一起,每样本产生层链(COLA)。这个组成空间大大扩大了现有作品的范围,以循环/经过预审预周化的模块,层修剪或早期换取网络。我们开发了蒙特卡洛树搜索(MCT)协议,以探索和识别从数学和常识性推理基准的每个样本的最佳可乐。与固定深度的静态模型相比,可乐允许快捷路径(快速思考),同一层的复发(s)(慢思维)以及两者结合,为不同输入提供了更灵活的动态体系结构。我们对MCT优化的可乐进行了广泛的分析,这导致了两个关键发现:(1)对于原始LLM进行正确预测的样品中,我们可以找到较短的Cola,这表明了提高推理效率的较大空间;(2)对于最初有不正确预测的样品中,> 60%,我们可以识别出可乐,以实现正确的预测,这表明绩效的空间很大。我们的结果突出了使用预训练的LLMS固定体系结构来推断不同样本的缺点,并为解锁测试时间深度适应的概括功率铺平了道路。
几何强迫:结合视频扩散和3D表示,以保持一致的世界建模
- 标题: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
- 作者: Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07982
- 论文链接 : https://arxiv.org/pdf/2507.07982
- 项目链接 : https://geometryforcing.github.io/
英文摘要
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.
中文摘要
视频固有地代表了动态3D世界的2D投影。但是,我们的分析表明,仅在原始视频数据上训练的视频扩散模型通常无法捕获其学习的表示形式中有意义的几何感知结构。为了弥合视频扩散模型与物理世界的基本3D性质之间的差距,我们提出了几何强迫,这是一种简单而有效的方法,可以鼓励视频扩散模型内部化潜在的3D表示。我们的关键见解是通过将模型的中间表示与几何形状 - 感知结构进行对齐,使它们与预审前的几何基础模型的特征对齐。为此,我们介绍了两个互补的对准目标:角对准,通过余弦相似性和比例对齐来实现定向一致性,从而通过从归一化扩散表示中回归未归一化的几何特征来保留与规模相关的信息。我们评估了相机视图和动作条件的视频生成任务上的几何强迫。实验结果表明,我们的方法基本上提高了基线方法的视觉质量和3D一致性。项目页面:https://geometryforcing.github.io。
Pyvision:具有动态工具的代理视觉
- 标题: PyVision: Agentic Vision with Dynamic Tooling
- 作者: Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei
- 日期: 2025-07-10
- ArXiv主页 : https://arxiv.org/abs/2507.07998
- 论文链接 : https://arxiv.org/pdf/2507.07998
- 项目链接 : https://agent-x.space/pyvision/
- gitHub仓库 : https://github.com/agents-x-project/PyVision
英文摘要
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
中文摘要
LLM越来越多地部署为代理,能够计划,推理和动态调用外部工具的系统。但是,在视觉推理中,先验方法在很大程度上仍然受到预定义的工作流和静态工具集的限制。在本报告中,我们提出了Pyvision,这是一个交互式,多转变的框架,使MLLM可以自主生成,执行和完善基于Python的工具,该工具量身定制了手头任务,解释了灵活且可解释的问题解决方案。我们开发了Pyvision创建的工具的分类法,并在各种基准测试中分析它们的用法。数量上,Pyvision在V *上实现了一致的性能增长,将GPT-4.1提高 +7.8%,而Claude-4.0-Sonnet在VLMSareBlind-Mini上提高了 +31.1%。这些结果表明,更广泛的转变:动态工具不仅可以使用工具,还可以使用它们来发明它们,朝着更具代理的视觉推理前进。
RLVER:通过可智力的情感奖励的加强学习
- 标题: RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
- 作者: Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
- 日期: 2025-07-03
- ArXiv主页 : https://arxiv.org/abs/2507.03112
- 论文链接 : https://arxiv.org/pdf/2507.03112
英文摘要
Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
中文摘要
大型语言模型(LLM)在逻辑和算法推理方面表现出色,但他们的情商(EQ)仍然远远落后于他们的认知能力。虽然从可验证的奖励中学习(RLVR)在其他领域中提出了进步,但其应用于对话,尤其是对情绪智能 - 抢劫的应用。在这项工作中,我们介绍了RLVER,这是第一个端到端的强化学习框架,该框架利用模拟用户可验证的情感奖励来培养LLMS中的高阶同理心能力。在此框架内,自谐情感模拟的用户参与对话推出并在对话期间产生确定性的情感分数,作为指导LLM学习的奖励信号。具有PPO的微调QWEN2.5-7B教学模型可将其Sontient基准分数从13.3提高到79.2,而在很大程度上保留了数学和编码能力。广泛的实验表明:(i)rlver始终提高多个对话能力;(ii)思维和无思想的模型显示出不同的趋势 - 想法的模型在同理心和洞察力方面表现出色,而无意识的模型则有利于行动;(iii)GRPO通常会产生稳定的增长,而PPO可以将某些功能推向更高的天花板;(iv)更具挑战性的环境并不总是适中的环境可以产生更强大的结果。我们的结果表明,rlver是通往情感智能和能力广泛的语言代理商的实用途径。
Robobrain 2.0技术报告
- 标题: RoboBrain 2.0 Technical Report
- 作者: BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Shanyu Rong, Zhengliang Cai, Bolun Zhang, Shuyi Zhang, Huaihai Lyu, Mengfei Du, Lingfeng Zhang, Xi Feng, Xiaodan Liu, Yance Jiao, Chenrui He, Mengsi Lyu, Zhuo Chen, Yulong Ao, Xue Sun, Zheqi He, Jingshu Zheng, Xi Yang, Donghai Shi, Kunchang Xie, Bochao Zhang, Shaokai Nie, Chunlei Men, Yonghua Lin, Zhongyuan Wang, Tiejun Huang, Shanghang Zhang
- 日期: 2025-07-02
- ArXiv主页 : https://arxiv.org/abs/2507.02029
- 论文链接 : https://arxiv.org/pdf/2507.02029
英文摘要
We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supports key real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene graph updating). This report details the model architecture, data construction, multi-stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advances embodied AI research and serves as a practical step toward building generalist embodied agents. The code, checkpoint and benchmark are available at https://superrobobrain.github.io.
中文摘要
我们介绍了Robobrain 2.0,这是我们最新一代的体现视觉基础模型,旨在统一物理环境中复杂体现任务的感知,推理和计划。它有两个变体:轻巧的7b型号和一个全尺寸的32B型号,具有带有视觉编码器和语言模型的异质体系结构。尽管它的大小紧凑,但Robobrain 2.0在各种体现的推理任务中取得了强大的性能。在空间和时间基准上,32B变体都取得了领先的结果,超过了先前的开源和专有模型。特别是,它支持关键的现实世界中体现的AI功能,包括空间理解(例如,负担能力预测,空间参考,轨迹预测)和时间决策(例如,闭环互动,多代理的长途骑士长途计划和场景图更新)。本报告详细介绍了模型架构,数据构建,多阶段培训策略,基础架构和实际应用。我们希望Robobrain 2.0的进步体现了AI研究,并是朝着建立通才体现的代理商迈出的实践步骤。代码,检查点和基准可在https://superrobrobrain.github.io上找到。