Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
胡玉洋11 小时前
跨时空便民服务站
ai·ai作画·llm·aigc·ai编程·ai写作
科普瑞传感仪器12 小时前
告别“盲打磨”:六维力传感器如何通过选型实现真正的机器人恒力控制?
人工智能·科技·ai·机器人·无人机
programer_3312 小时前
本地手动创建一个MCP(windows环境)
windows·python·ai·mcp·cherry studio
paopao_wu12 小时前
ComfyUI遇上Z-Image(1):环境部署与AI图像生成快速体验
人工智能·ai·大模型·comfyui·z-image
programer_3313 小时前
MCP 服务调用
ai·大模型·mcp·cherry studio
韩曙亮15 小时前
【AI 大模型】LangChain 框架 ① ( LangChain 简介 | LangChain 模块 | LangChain 文档 )
人工智能·ai·langchain·llm·大语言模型·prompts·agents
我很哇塞耶16 小时前
OpenAI公开新的模型训练方法:或许能解决模型撒谎问题,已在GPT-5 thiking验证
人工智能·ai·大模型·训练
珑墨17 小时前
【AI产品】当下AI产品的变现模式深度分析
人工智能·ai·数据分析·产品运营·aigc·ai编程·ai写作
胡耀超17 小时前
AI的记忆革命:从Titans架构到长时运行智能体,谷歌Google,Anthropic,NeurIPS 2025
人工智能·python·ai·架构·智能体·上下文·titans
GMICLOUD19 小时前
GMI Cloud@AI 周报 | DeepSeek V3.2 系列震撼开源;Claude Opus 4.5 发布
人工智能·ai·ai资讯