中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- QeRL:超越效率------LLMs的量化增强强化学习
- 具有表示自动编码器的扩散变压器
- 空间强迫:视觉-语言-动作模型的隐式空间表示对齐
- [D2E:扩展桌面数据上的视觉动作预训练以传输到具体的 AI](#D2E:扩展桌面数据上的视觉动作预训练以传输到具体的 AI)
- 用相机思考:用于以相机为中心的理解和生成的统一多模态模型
- 机器人学习:教程
- [当模型说谎时,我们学习:使用 PsiloQA 进行多语言跨度幻觉检测](#当模型说谎时,我们学习:使用 PsiloQA 进行多语言跨度幻觉检测)
- 通过自监督预训练推进端到端像素空间生成建模
- 代理熵平衡策略优化
- [PaddleOCR-VL:通过 0.9B 超紧凑视觉语言模型促进多语言文档解析](#PaddleOCR-VL:通过 0.9B 超紧凑视觉语言模型促进多语言文档解析)
- 扩展以语言为中心的全模态表征学习
- DITING:用于网络小说翻译基准测试的多智能体评估框架
- [WithAnyone:实现可控且 ID 一致的图像生成](#WithAnyone:实现可控且 ID 一致的图像生成)
- KORMo:适合所有人的韩国开放推理模型
- 人工智能服务:借助人工智能眼镜主动提供帮助
- [FlashWorld:几秒内生成高质量 3D 场景](#FlashWorld:几秒内生成高质量 3D 场景)
- 从像素到文字------大规模迈向原生视觉语言原语
- [UniMoE-Audio:利用动态容量 MoE 生成统一语音和音乐](#UniMoE-Audio:利用动态容量 MoE 生成统一语音和音乐)
- 注意力阐明了LLMs推理:预先计划和锚定节奏实现了细粒度的策略优化
- [Bee:一个高质量的语料库和全栈套件,可解锁高级的完全开放的 MLLM](#Bee:一个高质量的语料库和全栈套件,可解锁高级的完全开放的 MLLM)
- ImagerySearch:超越语义依赖性约束的视频生成的自适应测试时间搜索
- BitNet蒸馏
- 潜在细化解码:通过细化信念状态来增强基于扩散的语言模型
- AutoPR:让您的学术晋升自动化!
- StreamingVLM:实时理解无限视频流
- [RAG-Anything:一体化 RAG 框架](#RAG-Anything:一体化 RAG 框架)
- [多模式提示优化:为什么不利用 MLLM 的多种模式](#多模式提示优化:为什么不利用 MLLM 的多种模式)
- [使用大型语言模型进行 Vibe 编码的调查](#使用大型语言模型进行 Vibe 编码的调查)
- 标签:抗幻觉扩散采样的切向放大指导
QeRL:超越效率------LLMs的量化增强强化学习
- 标题: QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
- 作者: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
- 日期: 2025-10-13
- ArXiv主页 : https://arxiv.org/abs/2510.11696
- 论文链接 : https://arxiv.org/pdf/2510.11696
- 项目链接 : https://github.com/NVlabs/QeRL
- gitHub仓库 : https://github.com/NVlabs/QeRL
英文摘要
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
中文摘要
我们提出了 QeRL,一种用于大型语言模型 (LLM) 的量化增强强化学习框架。虽然强化学习对于LLMs的推理能力至关重要,但它是资源密集型的,需要大量的 GPU 内存和较长的部署持续时间。QeRL 通过将 NVFP4 量化与低秩适应 (LoRA) 相结合来解决这些问题,加速 RL 的推出阶段,同时减少内存开销。除了效率之外,我们的研究结果表明,量化噪声会增加策略熵,增强探索,并能够在强化学习期间发现更好的策略。为了进一步优化探索,QeRL 引入了自适应量化噪声(AQN)机制,可在训练期间动态调整噪声。实验表明,QeRL 在推出阶段可提供超过 1.5 倍的加速。此外,这是第一个能够在单个 H100 80GB GPU 上对 32B LLM 进行 RL 训练的框架,同时为 RL 训练提供整体加速。它还实现了比 16 位 LoRA 和 QLoRA 更快的奖励增长和更高的最终精度,同时匹配 7B 模型中 GSM8K(90.8%)和 MATH 500(77.4%)等数学基准上的全参数微调性能。这些结果将 QeRL 确立为LLMs RL 培训的高效且有效的框架。
具有表示自动编码器的扩散变压器
- 标题: Diffusion Transformers with Representation Autoencoders
- 作者: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
- 日期: 2025-10-13
- ArXiv主页 : https://arxiv.org/abs/2510.11690
- 论文链接 : https://arxiv.org/pdf/2510.11690
- 项目链接 : https://rae-dit.github.io/
- gitHub仓库 : https://github.com/bytetriper/RAE
英文摘要
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
中文摘要
潜在生成建模,其中预训练的自动编码器将像素映射到扩散过程的潜在空间中,已成为扩散变压器(DiT)的标准策略;然而,自动编码器组件几乎没有发展。大多数 DiT 继续依赖原始的 VAE 编码器,这引入了一些限制:过时的主干会损害架构的简单性、限制信息容量的低维潜在空间以及纯粹基于重建的训练产生的弱表示并最终限制生成质量。在这项工作中,我们探索用预训练的表示编码器(例如 DINO、SigLIP、MAE)与经过训练的解码器配对来取代 VAE,形成我们所说的表示自动编码器(RAE)。这些模型提供高质量的重建和语义丰富的潜在空间,同时允许可扩展的基于变压器的架构。由于这些潜在空间通常是高维的,因此一个关键的挑战是使扩散变压器能够在其中有效运行。我们分析了这一困难的根源,提出了理论上的解决方案,并通过经验进行了验证。我们的方法实现了更快的收敛,而没有辅助表示对齐损失。使用配备轻量、宽 DDT 头的 DiT 变体,我们在 ImageNet 上实现了强大的图像生成结果:256x256(无引导)下的 FID 为 1.51,256x256 和 512x512(有引导)下的 FID 为 1.13。RAE 具有明显的优势,应该成为扩散变压器训练的新默认设置。
空间强迫:视觉-语言-动作模型的隐式空间表示对齐
- 标题: Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
- 作者: Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, Haoang Li
- 日期: 2025-10-14
- ArXiv主页 : https://arxiv.org/abs/2510.12276
- 论文链接 : https://arxiv.org/pdf/2510.12276
- 项目链接 : https://spatial-forcing.github.io/
- gitHub仓库 : https://github.com/OpenHelix-Team/Spatial-Forcing
英文摘要
Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data, which lack accurate spatial awareness and hinder their ability to operate in the 3D physical world. Existing solutions attempt to incorporate explicit 3D sensor inputs such as depth maps or point clouds, but these approaches face challenges due to sensor noise, hardware heterogeneity, and incomplete depth coverage in existing datasets. Alternative methods that estimate 3D cues from 2D images also suffer from the limited performance of depth estimators.We propose Spatial Forcing (SF), a simple yet effective alignment strategy that implicitly forces VLA models to develop spatial comprehension capabilities without relying on explicit 3D inputs or depth estimators. SF aligns intermediate visual embeddings of VLAs with geometric representations produced by pretrained 3D foundation models. By enforcing alignment at intermediate layers, SF guides VLAs to encode richer spatial representations that enhance action precision.Extensive experiments in simulation and real-world environments demonstrate that SF achieves state-of-the-art results, surpassing both 2D- and 3D-based VLAs. SF further accelerates training by up to 3.8x and improves data efficiency across diverse robotic tasks. Project page is at https://spatial-forcing.github.io/
中文摘要
视觉-语言-动作(VLA)模型最近在使机器人遵循语言指令并执行精确动作方面显示出强大的潜力。然而,大多数 VLA 都是建立在仅基于 2D 数据预训练的视觉语言模型之上,缺乏准确的空间感知并阻碍了它们在 3D 物理世界中运行的能力。现有解决方案尝试合并显式 3D 传感器输入,例如深度图或点云,但由于传感器噪声、硬件异构性和现有数据集中不完整的深度覆盖,这些方法面临挑战。从 2D 图像估计 3D 线索的替代方法也受到深度估计器性能有限的影响。我们提出了空间强迫(SF),这是一种简单而有效的对齐策略,隐式迫使 VLA 模型发展空间理解能力,而不依赖于显式 3D 输入或深度估计器。SF 将 VLA 的中间视觉嵌入与预训练的 3D 基础模型生成的几何表示对齐。通过在中间层强制对齐,SF 引导 VLA 编码更丰富的空间表示,从而提高动作精度。模拟和现实环境中的大量实验表明,SF 取得了最先进的结果,超越了基于 2D 和 3D 的 VLA。SF 将训练速度进一步提高了 3.8 倍,并提高了各种机器人任务的数据效率。项目页面位于 https://spatial-forcing.github.io/
D2E:扩展桌面数据上的视觉动作预训练以传输到具体的 AI
- 标题: D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
- 作者: Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee
- 日期: 2025-10-07
- ArXiv主页 : https://arxiv.org/abs/2510.05684
- 论文链接 : https://arxiv.org/pdf/2510.05684
- 项目链接 : https://worv-ai.github.io/d2e/
- gitHub仓库 : https://github.com/worv-ai/D2E
英文摘要
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/
中文摘要
大型语言模型利用互联网规模的文本数据,但具体的人工智能仍然受到物理轨迹收集成本高昂的限制。桌面环境------尤其是游戏------提供了一个引人注目的替代方案:它们提供了丰富的大规模感觉运动交互,同时保持了具身学习所必需的结构化观察-动作耦合。我们提出了 D2E(桌面到具体人工智能),这是一个框架,演示桌面交互可以作为机器人具体人工智能任务的有效预训练基础。与之前保持特定领域(例如,Minecraft 的 VPT)或保持数据专有(例如 SIMA)的工作不同,D2E 建立了从可扩展的桌面数据收集到具体领域中经过验证的传输的完整管道。我们的框架由三个组件组成:(1) OWA Toolkit,将不同的桌面交互统一为具有 152 倍压缩的标准化格式;(2) Generalist-IDM,通过基于时间戳的事件预测,在未见过的游戏中实现强大的零样本泛化,从而实现互联网规模的伪标签;(3) VAPT,将桌面预训练的表示转移到物理操作和导航。使用 1.3K+ 小时的数据(259 小时的人类演示,以及 1K+ 小时的伪标记游戏),我们在 LIBERO 操作上总共实现了 96.6% 的成功率,在 CANVAS 导航基准上实现了 83.3% 的成功率。这验证了数字交互中的感觉运动原语表现出足够的不变性,可以有意义地转移到物理具体任务,从而将桌面预训练建立为机器人技术的实用范例。我们将公开我们的所有工作,包括 OWA 工具包、人类收集和伪标记的数据集以及 VAPT 训练的模型,请访问 https://worv-ai.github.io/d2e/
用相机思考:用于以相机为中心的理解和生成的统一多模态模型
- 标题: Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
- 作者: Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
- 日期: 2025-10-09
- ArXiv主页 : https://arxiv.org/abs/2510.08673
- 论文链接 : https://arxiv.org/pdf/2510.08673
- 项目链接 : https://kangliao929.github.io/projects/puffin/
- gitHub仓库 : https://github.com/KangLiao929/Puffin
英文摘要
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.
中文摘要
以相机为中心的理解和生成是空间智能的两个基石,但它们通常是孤立研究的。我们推出了 Puffin,这是一种以相机为中心的统一多模态模型,可沿相机维度扩展空间感知。Puffin 集成了语言回归和基于扩散的生成,可以从任意视角解释和创建场景。为了弥合相机和视觉语言之间的模态差距,我们引入了一种新颖的范式,将相机视为语言,从而能够用相机进行思考。这引导模型将空间基础的视觉线索与摄影术语结合起来,同时跨几何上下文进行推理。Puffin 在 Puffin-4M 上进行训练,Puffin-4M 是一个包含 400 万个视觉-语言-相机三元组的大型数据集。我们结合了全局相机参数和逐像素相机地图,产生灵活可靠的空间生成。实验证明 Puffin 在以相机为中心的生成和理解方面比专用模型具有优越的性能。通过指令调整,Puffin 可以推广到各种跨视图任务,例如空间想象、世界探索和摄影指导。我们将发布代码、模型、数据集管道和基准,以推进多模式空间智能研究。
机器人学习:教程
- 标题: Robot Learning: A Tutorial
- 作者: Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, Michel Aractingi
- 日期: 2025-10-14
- ArXiv主页 : https://arxiv.org/abs/2510.12403
- 论文链接 : https://arxiv.org/pdf/2510.12403
- 项目链接 : https://huggingface.co/spaces/lerobot/robot-learning-tutorial
- gitHub仓库 : https://github.com/fracapuano/robot-learning-tutorial
英文摘要
Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in lerobot.
中文摘要
在机器学习的快速进步和大规模机器人数据的可用性不断增长的推动下,机器人学习正处于拐点。从经典的、基于模型的方法到数据驱动的、基于学习的范式的转变正在释放自主系统前所未有的能力。本教程探讨了现代机器人学习的前景,绘制了从强化学习和行为克隆的基本原理到能够跨不同任务甚至机器人实施例运行的通才、语言条件模型的课程。这项工作旨在为研究人员和从业者提供指南,我们的目标是通过在 lerobot 中实现的现成示例,为读者提供促进机器人学习发展所需的概念理解和实用工具。
当模型说谎时,我们学习:使用 PsiloQA 进行多语言跨度幻觉检测
-
标题: When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
-
作者: Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova
-
日期: 2025-10-06
-
ArXiv主页 : https://arxiv.org/abs/2510.04849
-
gitHub仓库 : https://github.com/s-nlp/PsiloQA
英文摘要
Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
中文摘要
幻觉检测仍然是安全可靠地部署大型语言模型 (LLM) 的一个基本挑战,特别是在需要事实准确性的应用程序中。现有的幻觉基准通常在序列级别上运行,并且仅限于英语,缺乏全面评估所需的细粒度、多语言监督。在这项工作中,我们介绍了 PsiloQA,这是一个大规模的多语言数据集,用 14 种语言的跨级别幻觉进行了注释。PsiloQA 通过自动化的三阶段管道构建:使用 GPT-4o 从维基百科生成问答对,在无上下文环境中从不同的 LLM 中引出潜在的幻觉答案,并使用 GPT-4o 通过与黄金答案和检索到的上下文进行比较来自动注释幻觉范围。我们评估了各种幻觉检测方法------包括不确定性量化、基于LLM的标记和微调编码器模型------并表明基于编码器的模型在跨语言中实现了最强的性能。此外,PsiloQA 展示了有效的跨语言泛化,并支持将知识迁移到其他基准,同时比人工注释的数据集显着更具成本效益。我们的数据集和结果推动了多语言环境中可扩展、细粒度幻觉检测的开发。
通过自监督预训练推进端到端像素空间生成建模
-
标题: Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
-
作者: Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu
-
日期: 2025-10-14
-
ArXiv主页 : https://arxiv.org/abs/2510.12586
-
gitHub仓库 : https://github.com/AMAP-ML/EPG
英文摘要
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.
中文摘要
与潜在空间模型相比,像素空间生成模型通常更难训练,并且通常表现不佳,从而造成持续的性能和效率差距。在本文中,我们介绍了一种新颖的两阶段训练框架,该框架弥补了像素空间扩散和一致性模型的差距。在第一阶段,我们预训练编码器以从干净图像中捕获有意义的语义,同时将它们与沿相同确定性采样轨迹的点对齐,这从数据分布之前演化出点。在第二阶段,我们将编码器与随机初始化的解码器集成,并对扩散模型和一致性模型进行端到端的微调。我们的训练框架在 ImageNet 数据集上展示了强大的经验性能。具体来说,我们的扩散模型在经过 75 次函数评估 (NFE) 后,在 ImageNet-256 上达到了 2.04 的 FID,在 ImageNet-512 上达到了 2.35 的 FID,在生成质量和效率方面都大幅超越了之前的像素空间方法,同时以相当的训练成本与基于 VAE 的领先模型相媲美。此外,在 ImageNet-256 上,我们的一致性模型在单个采样步骤中实现了令人印象深刻的 FID 8.82,显着超过了其潜在空间对应模型。据我们所知,这标志着首次在高分辨率图像上成功训练一致性模型,而无需依赖预先训练的 VAE 或扩散模型。
代理熵平衡策略优化
-
标题: Agentic Entropy-Balanced Policy Optimization
-
作者: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
-
日期: 2025-10-16
-
ArXiv主页 : https://arxiv.org/abs/2510.14545
-
gitHub仓库 : https://github.com/dongguanting/ARPO
英文摘要
Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
中文摘要
最近,代理强化学习(Agentic RL)在激励网络代理的多轮、长视野工具使用能力方面取得了重大进展。虽然主流代理强化学习算法在熵的指导下自主探索高不确定性的工具调用步骤,但过度依赖熵信号可能会施加进一步的限制,导致训练崩溃。在本文中,我们深入研究了熵带来的挑战,并提出了代理熵平衡策略优化(AEPO),这是一种代理强化学习算法,旨在平衡部署和策略更新阶段的熵。AEPO 包含两个核心组件:(1)动态熵平衡推出机制,通过熵预监控自适应分配全局和分支采样预算,同时对连续的高熵工具调用步骤施加分支惩罚,以防止过度分支问题;(2) 熵平衡策略优化,将停止梯度操作插入到高熵裁剪项中,以保留并适当地重新调整高熵标记上的梯度,同时结合熵感知优势估计来优先考虑高不确定性标记上的学习。14 个具有挑战性的数据集的结果表明,AEPO 始终优于 7 种主流 RL 算法。只需 1K RL 样本,带有 AEPO 的 Qwen3-14B 就取得了令人印象深刻的结果:Pass@1 在 GAIA 上的得分为 47.6%,在 Humanity's Last Exam 上的得分为 11.2%,在 WebWalker 上的得分为 43.0%;Pass@5 在 GAIA 上得分为 65.0%,在 Humanity's Last Exam 上得分为 26.0%,在 WebWalker 上得分为 70.0%。进一步的分析表明,AEPO 提高了推出采样多样性,同时保持稳定的策略熵,促进可扩展的网络代理训练。
PaddleOCR-VL:通过 0.9B 超紧凑视觉语言模型促进多语言文档解析
-
标题: PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
-
作者: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma
-
日期: 2025-10-16
-
ArXiv主页 : https://arxiv.org/abs/2510.14528
-
gitHub仓库 : https://github.com/PaddlePaddle/PaddleOCR
英文摘要
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .
中文摘要
在本报告中,我们提出了 PaddleOCR-VL,这是一种专为文档解析量身定制的 SOTA 和资源高效模型。其核心组件是PaddleOCR-VL-0.9B,这是一种紧凑而强大的视觉语言模型(VLM),它将NaViT式动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型集成在一起,以实现精确的元素识别。这种创新模型有效支持 109 种语言,擅长识别复杂元素(例如文本、表格、公式和图表),同时保持最低的资源消耗。通过对广泛使用的公共基准和内部基准的综合评估,PaddleOCR-VL在页面级文档解析和元素级识别方面均实现了SOTA性能。它的性能显着优于现有解决方案,与顶级 VLM 相比具有强大的竞争力,并提供快速的推理速度。这些优势使其非常适合实际场景中的实际部署。代码可在 https://github.com/PaddlePaddle/PaddleOCR 获取。
扩展以语言为中心的全模态表征学习
- 标题: Scaling Language-Centric Omnimodal Representation Learning
- 作者: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong
- 日期: 2025-10-13
- ArXiv主页 : https://arxiv.org/abs/2510.11693
- 论文链接 : https://arxiv.org/pdf/2510.11693
- 项目链接 : https://huggingface.co/LCO-Embedding
- gitHub仓库 : https://github.com/LCO-Embedding/LCO-Embedding
英文摘要
Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
中文摘要
最近利用多模态大语言模型(MLLM)和对比学习(CL)进行微调的多模态嵌入方法已经显示出有希望的结果,但其优越性背后的根本原因仍未得到充分探索。这项工作认为,基于 MLLM 的方法的一个关键优势源于生成预训练期间实现的隐式跨模态对齐,其中语言解码器学习利用共享表示空间内的多模态信号来生成单模态输出。通过对各向异性和核相似性结构的分析,我们凭经验证实了 MLLM 表示中出现了潜在对齐,从而使 CL 能够充当轻量级细化阶段。利用这种洞察力,我们提出了一种以语言为中心的全模态嵌入框架,称为 LCO-Emb。跨不同骨干网和基准的广泛实验证明了其有效性,跨模式实现了最先进的性能。此外,我们确定了生成表征缩放定律(GRSL),表明通过对比细化获得的表征能力与 MLLM 的生成能力成正比。这表明提高生成能力已成为提高表征质量的有效范例。我们提供了 GRSL 的理论解释,它将 MLLM 的生成质量与其表示性能的上限正式联系起来,并在具有挑战性的、低资源的视觉文档检索任务上对其进行了验证,表明在 CL 之前进行持续的生成预训练可以进一步增强模型嵌入能力的潜力。代码、模型和资源可在 https://github.com/LCO-Embedding/LCO-Embedding 获取。
DITING:用于网络小说翻译基准测试的多智能体评估框架
-
标题: DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
-
作者: Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie
-
日期: 2025-10-10
-
ArXiv主页 : https://arxiv.org/abs/2510.09116
-
gitHub仓库 : https://github.com/WHUNextGen/DITING
英文摘要
Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.
中文摘要
大型语言模型(LLM)具有相当先进的机器翻译(MT),但它们在翻译网络小说方面的有效性仍不清楚。现有的基准依赖于表面指标,无法捕捉该类型的独特特征。为了弥补这些差距,我们引入了第一个网络小说翻译综合评估框架DITING,从成语翻译、词汇歧义、术语本地化、时态一致性、零代词解析和文化安全六个维度评估叙事和文化保真度,并有超过18K专家注释的汉英句子对支持。我们进一步提出了 AgentEval,一种推理驱动的多智能体评估框架,它模拟专家审议来评估词汇重叠之外的翻译质量,在七个测试的自动指标中实现与人类判断的最高相关性。为了实现指标比较,我们开发了 MetricAlign,这是一个由 300 个句子对组成的元评估数据集,并标注了错误标签和标量质量分数。对十四种开放式、封闭式和商业模式的综合评估表明,中国培养的LLMs超越了规模较大的外国同行,并且 DeepSeek-V3 提供了最忠实且风格连贯的翻译。我们的工作为探索基于LLMs的网络小说翻译建立了新的范式,并提供公共资源来推进未来的研究。
WithAnyone:实现可控且 ID 一致的图像生成
- 标题: WithAnyone: Towards Controllable and ID Consistent Image Generation
- 作者: Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
- 日期: 2025-10-16
- ArXiv主页 : https://arxiv.org/abs/2510.14975
- 论文链接 : https://arxiv.org/pdf/2510.14975
- 项目链接 : https://doby-xu.github.io/WithAnyone/
- gitHub仓库 : https://github.com/Doby-Xu/WithAnyone
英文摘要
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
中文摘要
身份一致的生成已成为文本到图像研究的一个重要焦点,最近的模型在生成与参考身份一致的图像方面取得了显着的成功。然而,包含同一个体的多个图像的大规模配对数据集的稀缺迫使大多数方法采用基于重建的训练。这种依赖通常会导致我们称之为复制粘贴的失败模式,其中模型直接复制参考人脸,而不是在姿势、表情或光照的自然变化中保留身份。这种过度相似性破坏了可控性并限制了生成的表达能力。为了解决这些限制,我们(1)构建了一个大规模配对数据集MultiID-2M,针对多人场景量身定制,为每个身份提供多样化的参考;(2) 引入一个基准,量化复制粘贴伪影以及身份保真度和变异之间的权衡;(3) 提出一种具有对比身份损失的新颖训练范式,利用配对数据来平衡保真度与多样性。这些贡献在 WithAnyone 中达到顶峰,这是一种基于扩散的模型,可以有效地减少复制粘贴,同时保持较高的身份相似性。广泛的定性和定量实验表明,WithAnyone 显着减少了复制粘贴伪影,提高了姿势和表情的可控性,并保持了强大的感知质量。用户研究进一步验证了我们的方法实现了高身份保真度,同时实现了富有表现力的可控生成。
KORMo:适合所有人的韩国开放推理模型
-
标题: KORMo: Korean Open Reasoning Model for Everyone
-
作者: Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
-
日期: 2025-10-10
-
ArXiv主页 : https://arxiv.org/abs/2510.09426
-
gitHub仓库 : https://github.com/MLP-Lab/KORMo-tutorial
英文摘要
This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.
中文摘要
这项工作提出了针对非英语语言(特别是韩语)构建完全开放的双语大语言模型(LLM)的首次大规模调查,主要在合成数据上进行训练。我们引入了 KORMo-10B,这是一个在韩语-英语语料库上从头开始训练的 10.8B 参数模型,其中 68.74% 的韩语部分是合成的。通过系统的实验,我们证明,当精心策划平衡的语言覆盖和多样化的教学风格时,合成数据不会在大规模预训练期间导致不稳定或退化。此外,该模型在广泛的推理、知识和指令遵循基准方面实现了与当代开放权重多语言基线相当的性能。我们的实验揭示了两个关键发现:(1)合成数据可以可靠地维持长期预训练而不会导致模型崩溃,(2)双语指令调整可以实现韩语中近乎母语的推理和话语连贯性。通过完全发布包括数据、代码、训练配方和日志在内的所有组件,这项工作建立了一个透明的框架,用于在资源匮乏的环境中开发合成数据驱动的完全开放模型(FOM),并为未来的多语言LLMs研究树立了可复制的先例。
人工智能服务:借助人工智能眼镜主动提供帮助
- 标题: AI for Service: Proactive Assistance with AI Glasses
- 作者: Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li, Weifeng Liu, Haocong He, Bolong Feng, Xuyang Liu, Yuanhuiyi Lyu, Xu Zheng, Xuming Hu, Linfeng Zhang
- 日期: 2025-10-16
- ArXiv主页 : https://arxiv.org/abs/2510.14359
- 论文链接 : https://arxiv.org/pdf/2510.14359
英文摘要
In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.
中文摘要
在人工智能从被动工具演变为主动和自适应伴侣的时代,我们推出了人工智能服务(AI4Service),这是一种在日常生活中实现主动和实时帮助的新范式。现有的人工智能服务在很大程度上仍然是反应性的,仅响应明确的用户命令。我们认为,真正智能且乐于助人的助手应该能够预测用户需求并在适当的时候主动采取行动。为了实现这一愿景,我们提出了 Alpha-Service,这是一个解决两个基本挑战的统一框架:通过检测以自我为中心的视频流中的服务机会来了解何时进行干预,以及了解如何提供通用和个性化服务。受冯·诺依曼计算机架构的启发,基于人工智能眼镜,Alpha-Service由五个关键组件组成:用于感知的输入单元、用于任务调度的中央处理单元、用于工具使用的算术逻辑单元、用于长期个性化的存储单元以及用于自然人机交互的输出单元。作为初步探索,我们通过部署在AI眼镜上的多代理系统来实现Alpha-Service。案例研究(包括实时二十一点顾问、博物馆导游和购物助理)展示了其无缝感知环境、推断用户意图以及在没有明确提示的情况下提供及时有用帮助的能力。
FlashWorld:几秒内生成高质量 3D 场景
- 标题: FlashWorld: High-quality 3D Scene Generation within Seconds
- 作者: Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao
- 日期: 2025-10-15
- ArXiv主页 : https://arxiv.org/abs/2510.13678
- 论文链接 : https://arxiv.org/pdf/2510.13678
- 项目链接 : https://imlixinyang.github.io/FlashWorld-Project-Page/
- gitHub仓库 : https://github.com/imlixinyang/FlashWorld
英文摘要
We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100times faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.
中文摘要
我们提出了FlashWorld,这是一种生成模型,可以在几秒钟内从单个图像或文本提示生成3D场景,比以前的作品快10~100倍,同时拥有卓越的渲染质量。我们的方法从传统的面向多视图(MV 导向)范式(为后续 3D 重建生成多视图图像)转变为面向 3D 的方法,其中模型在多视图生成期间直接生成 3D 高斯表示。在确保 3D 一致性的同时,面向 3D 的方法通常视觉质量较差。FlashWorld 包括双模式预训练阶段和跨模式后训练阶段,有效地整合了两种范式的优势。具体来说,利用视频扩散模型的先验知识,我们首先预训练双模式多视图扩散模型,该模型共同支持面向 MV 和面向 3D 的生成模式。为了弥补面向 3D 生成的质量差距,我们进一步提出了一种跨模式训练后蒸馏,通过将一致的 3D 面向模式与高质量 MV 面向模式的分布进行匹配。这不仅在保持 3D 一致性的同时增强了视觉质量,而且还减少了推理所需的去噪步骤。此外,我们提出了一种策略,在此过程中利用大量单视图图像和文本提示来增强模型对分布外输入的泛化能力。大量的实验证明了我们方法的优越性和效率。
从像素到文字------大规模迈向原生视觉语言原语
-
标题: From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
-
作者: Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu
-
日期: 2025-10-16
-
ArXiv主页 : https://arxiv.org/abs/2510.14979
-
gitHub仓库 : https://github.com/EvolvingLMMs-Lab/NEO
英文摘要
The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
中文摘要
原生视觉语言模型 (VLM) 的大厦已成为典型模块化 VLM 的有力竞争者,由不断发展的模型架构和训练范例塑造而成。然而,有两个挥之不去的阴云给其广泛的探索和推广蒙上了阴影:(-)哪些基本限制使原生VLM与模块化VLM区别开来,这些障碍可以在多大程度上克服?(-) 如何使原生 VLM 的研究更容易获得和民主化,从而加速该领域的进展。在本文中,我们阐明了这些挑战并概述了构建本机 VLM 的指导原则。具体来说,一个原生 VLM 原语应该:(i)在共享语义空间内有效地对齐像素和单词表示;(ii) 无缝整合以前独立的视觉和语言模块的优势;(iii) 本质上体现了支持统一视觉语言编码、对齐和推理的各种跨模式属性。因此,我们推出了 NEO,这是一个根据第一原理构建的新颖的原生 VLM 系列,能够在不同的现实场景中与顶级模块化同类产品相媲美。只需 3.9 亿个图像文本示例,NEO 就可以从头开始有效地开发视觉感知,同时减轻由我们精心设计的基元制作的密集且整体模型内的视觉语言冲突。我们将 NEO 定位为可扩展且功能强大的原生 VLM 的基石,并与一组丰富的可重用组件配合使用,以形成经济高效且可扩展的生态系统。我们的代码和模型可在以下网址公开获取:https://github.com/EvolvingLMMs-Lab/NEO。
UniMoE-Audio:利用动态容量 MoE 生成统一语音和音乐
- 标题: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
- 作者: Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang
- 日期: 2025-10-15
- ArXiv主页 : https://arxiv.org/abs/2510.13344
- 论文链接 : https://arxiv.org/pdf/2510.13344
- 项目链接 : https://mukioxun.github.io/Uni-MoE-site/home.html
英文摘要
Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
中文摘要
统一多模式模型的最新进展表明了综合内容生成的明显趋势。然而,听觉领域仍然是一个重大挑战,音乐和语音通常是孤立发展的,阻碍了通用音频合成的进展。这种分离源于固有的任务冲突和严重的数据不平衡,这阻碍了真正统一的音频生成模型的开发。为了应对这一挑战,我们提出了 UniMoE-Audio,这是一种新颖的动态容量专家混合 (MoE) 框架内的统一语音和音乐生成模型。在架构上,UniMoE-Audio 引入了用于动态专家数量分配的 Top-P 路由策略,以及混合专家设计,包括用于特定领域知识的路由专家、用于与领域无关的功能的共享专家以及用于自适应计算跳过的空专家。为了解决数据不平衡的问题,我们引入了三阶段的培训课程:1)独立专家培训利用原始数据集,在不受干扰的情况下向每个"原始专家"灌输特定领域的知识;2) MoE 集成和预热将这些专家纳入 UniMoE-Audio 架构中,使用平衡数据集的子集预热门模块和共享专家;3)协同联合训练在完全平衡的数据集上端到端地训练整个模型,从而增强跨领域协同作用。大量实验表明,UniMoE-Audio 不仅在主要语音和音乐生成基准上实现了最先进的性能,而且还展示了卓越的协同学习,减轻了幼稚联合训练中常见的性能下降。我们的研究结果凸显了专门的 MoE 架构和策划的培训策略在推进通用音频生成领域的巨大潜力。主页:https://mukioxun.github.io/Uni-MoE-site/home.html
注意力阐明了LLMs推理:预先计划和锚定节奏实现了细粒度的策略优化
- 标题: Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
- 作者: Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan
- 日期: 2025-10-15
- ArXiv主页 : https://arxiv.org/abs/2510.13554
- 论文链接 : https://arxiv.org/pdf/2510.13554
英文摘要
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
中文摘要
大型语言模型 (LLM) 的推理模式仍然不透明,强化学习 (RL) 通常在整个一代中应用统一的学分,模糊了关键步骤和常规步骤之间的区别。这项工作将注意力定位为一个特殊的基础,使LLMs的内部逻辑变得清晰易读,不仅作为计算的副产品,而且作为推理本身的机械蓝图。我们首先区分局部和全局聚焦信息处理之间的注意力头,并揭示局部聚焦头在对角线附近产生锯齿图案,指示短语块,而全局聚焦头暴露对未来令牌施加广泛下游影响的令牌。我们用两个指标来形式化这些指标:1)窗口平均注意力距离,它衡量剪辑窗口内向后注意力的程度;2) 未来注意力影响力,将代币的全局重要性量化为它从后续代币收到的平均关注度。总而言之,这些信号揭示了一种重复出现的预计划和锚定机制,其中模型首先执行远程上下文引用以生成介绍性标记,该标记紧随其后或与组织后续推理的语义锚定标记一致。利用这些见解,我们引入了三种新颖的 RL 策略,这些策略动态地对关键节点(预计划令牌、锚令牌及其时间耦合)执行有针对性的信用分配,并在各种推理任务中显示出一致的性能增益。通过使优化与模型的内在推理节奏保持一致,我们的目标是将不透明的优化转变为可操作的结构感知过程,希望为 LLM 推理的更加透明和有效的优化提供潜在的一步。
Bee:一个高质量的语料库和全栈套件,可解锁高级的完全开放的 MLLM
- 标题: Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
- 作者: Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu
- 日期: 2025-10-15
- ArXiv主页 : https://arxiv.org/abs/2510.13795
- 论文链接 : https://arxiv.org/pdf/2510.13795
- 项目链接 : https://open-bee.github.io/
英文摘要
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
中文摘要
完全开放的多模态大语言模型(MLLM)目前落后于专有模型,这主要是由于监督微调(SFT)的数据质量存在显着差距。现有的开源数据集经常受到广泛的噪音和复杂推理数据严重不足的困扰,例如思想链(CoT),这阻碍了高级模型功能的开发。为了应对这些挑战,我们的工作做出了三项主要贡献。首先,我们介绍 Honey-Data-15M,这是一个新的 SFT 数据集,包含约 1500 万个 QA 对,通过多种清理技术进行处理,并通过新颖的双层(短和长)CoT 富集策略进行增强。其次,我们介绍数据管理管道 HoneyPipe 及其底层框架 DataStudio,为社区提供透明且适应性强的数据管理方法,超越静态数据集发布。最后,为了验证我们的数据集和管道,我们在 Honey-Data-15M 上训练 Bee-8B,这是一个 8B 模型。实验表明,Bee-8B 为完全开放的 MLLM 建立了一种新的最先进 (SOTA),其性能可与最近的半开放模型(如 InternVL3.5-8B)竞争,甚至在某些情况下超越。我们的工作为社区提供了一套基础资源,包括: Honey-Data-15M 语料库;由 HoneyPipe 和 DataStudio 组成的全栈套件;培训食谱;评估工具;和模型权重。这项工作表明,原则上关注数据质量是开发完全开放的 MLLM 的关键途径,这些 MLLM 与半开放的同行相比具有高度竞争力。
ImagerySearch:超越语义依赖性约束的视频生成的自适应测试时间搜索
-
标题: ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
-
作者: Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang
-
日期: 2025-10-16
-
ArXiv主页 : https://arxiv.org/abs/2510.14847
-
gitHub仓库 : https://github.com/AMAP-ML/ImagerySearch
英文摘要
Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
中文摘要
视频生成模型取得了显着的进步,尤其是在现实场景中表现出色;然而,在富有想象力的场景中,它们的表现显着下降。这些提示通常涉及很少同时出现的具有长距离语义关系的概念,超出了训练分布。现有方法通常应用测试时间缩放来提高视频质量,但其固定的搜索空间和静态奖励设计限制了对富有想象力的场景的适应性。为了填补这一空白,我们提出了 ImagerySearch,这是一种提示引导的自适应测试时搜索策略,可根据提示中的语义关系动态调整推理搜索空间和奖励函数。这使得在富有挑战性的富有想象力的环境中视频更加连贯、视觉上更加可信。为了评估这个方向的进展,我们引入了 LDT-Bench,这是第一个用于长距离语义提示的专用基准,由 2,839 个不同的概念对和一个用于评估创意生成能力的自动化协议组成。大量实验表明,ImagerySearch 在 LDT-Bench 上始终优于强大的视频生成基线和现有测试时间缩放方法,并在 VBench 上实现了有竞争力的改进,证明了其在不同提示类型中的有效性。我们将发布LDT-Bench和代码,以促进未来对富有想象力的视频生成的研究。
BitNet蒸馏
-
标题: BitNet Distillation
-
作者: Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei
-
日期: 2025-10-15
-
ArXiv主页 : https://arxiv.org/abs/2510.13998
-
gitHub仓库 : https://github.com/microsoft/BitNet
英文摘要
In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.
中文摘要
在本文中,我们提出了 BitNet Distillation (BitDistill),这是一种轻量级管道,可针对特定下游任务将现成的全精度 LLM(例如 Qwen)微调为 1.58 位精度(即三元权重 {-1, 0, 1}),以最小的计算成本实现强大的特定于任务的性能。具体来说,BitDistill 融合了三项关键技术:BitNet 中引入的 SubLN 模块;多头注意力蒸馏,基于MiniLM;持续的预训练,作为关键的预热步骤,可以缓解特定任务上经过微调的全精度 LLM 和 1.58 位 LLM 之间的性能差距的可扩展性问题。实验结果表明,BitDistill 在不同模型大小上实现了与全精度对应模型相当的性能,同时节省了高达 10 倍的内存,并且 CPU 推理速度提高了 2.65 倍。代码可从 https://github.com/microsoft/BitNet 获取。
潜在细化解码:通过细化信念状态来增强基于扩散的语言模型
- 标题: Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
- 作者: Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui
- 日期: 2025-10-13
- ArXiv主页 : https://arxiv.org/abs/2510.11052
- 论文链接 : https://arxiv.org/pdf/2510.11052
英文摘要
Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.
中文摘要
自回归 (AR) 模型仍然是自然语言生成的标准,但由于严格的顺序解码,仍然存在高延迟问题。最近受扩散启发的方法,例如 LlaDA 和 Dream,通过并行生成来缓解这一问题,但它们面临两个核心限制:信息丢失,因为未最终确定的代币的预测分布在每一步都会被丢弃;以及过早承诺,在没有充分全球协调的情况下做出本地决策。我们引入潜在细化解码(LRD),这是一个具有潜在细化和预测反馈循环的两阶段框架。第一阶段将屏蔽位置维持为预测标记和屏蔽嵌入的分布混合,从而使模型能够建立更加全局一致的信念。第二阶段逐步确定有信心的代币,同时保留不确定的代币以进行迭代反馈。KL 散度动力学为收敛和早期停止提供了原则性且可靠的标准。跨编码(HumanEval +6.3、MBPP +2.6)和推理(GSM8K +2.9、MATH500 +3.8)的实验表明,LRD 提高了准确性,同时提供高达 10.6 倍的加速,使其成为并行序列生成的强大且通用的替代方案。
AutoPR:让您的学术晋升自动化!
- 标题: AutoPR: Let's Automate Your Academic Promotion!
- 作者: Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che
- 日期: 2025-10-10
- ArXiv主页 : https://arxiv.org/abs/2510.09558
- 论文链接 : https://arxiv.org/pdf/2510.09558
- 项目链接 : https://yzweak.github.io/autopr.github.io/
- gitHub仓库 : https://github.com/LightChen233/AutoPR
英文摘要
As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.
中文摘要
随着同行评审研究数量的激增,学者们越来越依赖社交平台进行发现,而作者则投入大量精力来推广他们的工作,以确保可见性和引用。为了简化这一过程并减少对人力的依赖,我们引入了自动推广(AutoPR),这是一项新颖的任务,可将研究论文转化为准确、引人入胜且及时的公共内容。为了进行严格的评估,我们发布了 PRBench,这是一个多模式基准,它将 512 篇同行评审的文章与高质量的促销帖子联系起来,沿着三个轴评估系统:保真度(准确性和语气)、参与度(受众定位和吸引力)和一致性(时间和渠道优化)。我们还推出了 PRAgent,这是一个多代理框架,可分三个阶段实现 AutoPR 的自动化:通过多模式准备进行内容提取、协同合成以优化输出,以及针对特定平台进行调整以优化规范、语气和标签以实现最大影响力。与 PRBench 上的直接 LLM 管道相比,PRAgent 表现出了显着的改进,包括总观看时间增加了 604%,点赞数增加了 438%,整体参与度至少提高了 2.9 倍。消融研究表明,平台建模和有针对性的促销对这些收益的贡献最大。我们的研究结果将 AutoPR 定位为一个易于处理、可衡量的研究问题,并为可扩展、有影响力的自动化学术交流提供了路线图。
StreamingVLM:实时理解无限视频流
-
标题: StreamingVLM: Real-Time Understanding for Infinite Video Streams
-
作者: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han
-
日期: 2025-10-10
-
ArXiv主页 : https://arxiv.org/abs/2510.09608
-
gitHub仓库 : https://github.com/mit-han-lab/streaming-vlm
英文摘要
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
中文摘要
视觉语言模型 (VLM) 可以为实时助手和自主代理提供支持,但它们面临着一个严峻的挑战:在不增加延迟和内存使用的情况下理解近乎无限的视频流。全神贯注地处理整个视频会导致计算成本成倍增加,并且长视频的性能较差。同时,简单的滑动窗口方法也存在缺陷,因为它们要么破坏一致性,要么由于冗余重新计算而遭受高延迟。在本文中,我们介绍了 StreamingVLM,这是一种专为实时、稳定地理解无限视觉输入而设计的模型。我们的方法是一个统一的框架,使训练与流式推理保持一致。在推理过程中,我们通过重用注意力池的状态、最近视觉标记的短窗口和最近文本标记的长窗口来维护紧凑的 KV 缓存。这种流传输能力是通过简单的监督微调(SFT)策略灌输的,该策略将全部注意力集中在短的、重叠的视频块上,从而有效地模仿推理时间注意力模式,而无需在过长的上下文上进行训练。为了进行评估,我们构建了 Inf-Streams-Eval,这是一个新的基准,视频平均超过两个小时,需要帧和文本之间每秒进行密集对齐。在 Inf-Streams-Eval 上,StreamingVLM 对 GPT-4O mini 的胜率达到 66.18%,并在单个 NVIDIA H100 上保持高达 8 FPS 的稳定实时性能。值得注意的是,我们的 SFT 策略还增强了一般 VQA 能力,而无需任何特定于 VQA 的微调,将 LongVideoBench 的性能提高了 +4.30,将 OVOBench Realtime 的性能提高了 +5.96。代码可在 https://github.com/mit-han-lab/streaming-vlm 获取。
RAG-Anything:一体化 RAG 框架
-
标题: RAG-Anything: All-in-One RAG Framework
-
作者: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang
-
日期: 2025-10-14
-
ArXiv主页 : https://arxiv.org/abs/2510.12323
-
gitHub仓库 : https://github.com/HKUDS/RAG-Anything
英文摘要
Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inherently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities. Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types. The framework introduces dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. We develop cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching. This enables effective reasoning over heterogeneous content where relevant evidence spans multiple modalities. RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. Performance gains become particularly pronounced on long documents where traditional approaches fail. Our framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. Our framework is open-sourced at: https://github.com/HKUDS/RAG-Anything.
中文摘要
检索增强生成(RAG)已成为扩展大型语言模型超越其静态训练限制的基本范例。然而,当前的 RAG 能力与现实世界的信息环境之间存在严重的不一致。现代知识库本质上是多模式的,包含文本内容、视觉元素、结构化表格和数学表达式的丰富组合。然而,现有的 RAG 框架仅限于文本内容,在处理多模式文档时造成了根本性的差距。我们提出了 RAG-Anything,这是一个统一的框架,可以跨所有模式进行全面的知识检索。我们的方法将多模式内容重新概念化为互连的知识实体而不是孤立的数据类型。该框架引入了双图构造来捕获统一表示中的跨模式关系和文本语义。我们开发了跨模式混合检索,将结构知识导航与语义匹配相结合。这使得能够对相关证据跨越多种模式的异构内容进行有效推理。RAG-Anything 在具有挑战性的多模式基准测试中展示了卓越的性能,与最先进的方法相比取得了显着的改进。对于传统方法失败的长文档,性能提升尤其明显。我们的框架建立了多模式知识访问的新范例,消除了限制当前系统的架构碎片。我们的框架是开源的:https://github.com/HKUDS/RAG-Anything。
多模式提示优化:为什么不利用 MLLM 的多种模式
-
标题: Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
-
作者: Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang
-
日期: 2025-10-10
-
ArXiv主页 : https://arxiv.org/abs/2510.09201
-
gitHub仓库 : https://github.com/Dozi01/MPO
英文摘要
Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
中文摘要
大型语言模型 (LLM) 已经取得了显着的成功,其多模式扩展 (MLLM) 进一步释放了跨越图像、视频和文本以外的其他模式的功能。然而,尽管发生了这种转变,旨在减轻手动提示制作负担同时最大限度提高性能的提示优化方法仍然局限于文本,最终限制了 MLLM 的全部潜力。受这一差距的启发,我们引入了多模态提示优化的新问题,它将提示优化的先前定义扩展到由文本和非文本提示对定义的多模态空间。为了解决这个问题,我们提出了多模态提示优化器(MPO),这是一个统一的框架,不仅通过保持对齐的更新来执行多模态提示的联合优化,而且还通过利用早期的评估作为基于贝叶斯的选择策略中的先验来指导候选提示的选择过程。通过跨文本之外的多种模式(例如图像、视频甚至分子)的广泛实验,我们证明 MPO 优于领先的纯文本优化方法,将多模式提示优化作为实现 MLLM 潜力的关键一步。
使用大型语言模型进行 Vibe 编码的调查
-
标题: A Survey of Vibe Coding with Large Language Models
-
作者: Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng
-
日期: 2025-10-14
-
ArXiv主页 : https://arxiv.org/abs/2510.12399
英文摘要
The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed "Vibe Coding" where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.
中文摘要
大语言模型(LLM)的进步促进了从代码生成辅助到自主编码代理的范式转变,从而实现了一种称为"Vibe Coding"的新颖开发方法,开发人员通过结果观察而不是逐行代码理解来验证人工智能生成的实现。尽管具有变革潜力,但这种新兴范式的有效性仍未得到充分探索,经验证据揭示了人类与人工智能协作中意想不到的生产力损失和根本挑战。为了弥补这一差距,本次调查首次对大型语言模型的 Vibe 编码进行了全面、系统的回顾,为这种变革性的开发方法建立了理论基础和实践框架。通过对 1000 多篇研究论文的系统分析,我们调查了整个 vivi 编码生态系统,检查了关键的基础设施组件,包括用于编码的 LLM、基于 LLM 的编码代理、编码代理的开发环境和反馈机制。我们首先将 Vibe 编码作为一门正式学科引入,通过约束马尔可夫决策过程将其形式化,该过程捕获人类开发人员、软件项目和编码代理之间的动态三元关系。在此理论基础上,我们将现有实践综合为五种不同的开发模型:无约束自动化、迭代对话协作、规划驱动、测试驱动和上下文增强模型,从而提供了该领域的第一个全面的分类法。至关重要的是,我们的分析表明,成功的 Vibe 编码不仅取决于代理能力,还取决于系统的上下文工程、完善的开发环境以及人机协作开发模型。
标签:抗幻觉扩散采样的切向放大指导
- 标题: TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling
- 作者: Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin
- 日期: 2025-10-06
- ArXiv主页 : https://arxiv.org/abs/2510.04533
- 论文链接 : https://arxiv.org/pdf/2510.04533
- 项目链接 : https://hyeon-cho.github.io/TAG/
- gitHub仓库 : https://github.com/hyeon-cho/Tangential-Amplifying-Guidance
英文摘要
Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.
中文摘要
最近的扩散模型在图像生成方面实现了最先进的性能,但经常遭受语义不一致或幻觉的困扰。虽然各种推理时间指导方法可以增强生成,但它们通常通过依赖外部信号或架构修改来间接操作,这会引入额外的计算开销。在本文中,我们提出了切向放大制导(TAG),这是一种更有效、更直接的制导方法,仅对轨迹信号进行操作,而不修改底层的扩散模型。TAG利用中间样本作为投影基础,并放大估计分数相对于该基础的切向分量来校正采样轨迹。我们通过利用一阶泰勒展开来形式化这个引导过程,这表明放大切向分量可以将状态引导至更高概率的区域,从而减少不一致性并提高样本质量。TAG 是一种即插即用、与架构无关的模块,可以通过最少的计算量提高扩散采样保真度,为扩散引导提供新的视角。