【论文速递】2026年第03周(Jan-11-17)(Robotics/Embodied AI/LLM)中文使用 googletrans 翻译,翻译不对的地方以英文为准In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extr