Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
后端AI实验室3 小时前
我用Cursor开发了3个月,整理出这套提效4倍的工作流
java·ai
牧马人win6 小时前
Microsoft Agent Framework 详解与实践
ai
妙妙屋(zy)14 小时前
Windows系统安装OpenClaw并使用Qwen千问接入飞书教程 🤖
ai
Johny_Zhao17 小时前
OpenClaw安装部署教程
linux·人工智能·ai·云计算·系统运维·openclaw
孤竹笑傲1 天前
AI的降维打击
ai
程序员鱼皮1 天前
又一个新项目完结,我要出海了!
ai·github·开源项目
GPUStack2 天前
Token 不再焦虑:用 GPUStack + OpenClaw 搭一个“无限用”的本地 AI 助手
ai·模型推理·gpustack·openclaw