Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
FIT2CLOUD飞致云3 小时前
支持CAS身份认证,支持接入Oracle11数据源,SQLBot开源智能问数系统v1.3.0版本发布
ai·数据分析·开源·智能问数·sqlbot
小小工匠7 小时前
LLM - User Prompt与System Prompt原理、方法与实战
prompt
三条猫9 小时前
AI 大模型如何给 CAD 3D 模型“建立语义”?
人工智能·机器学习·3d·ai·大模型·cad
cxr82810 小时前
高阶结构化提示词(Nano Banana Prompt)实例分析
人工智能·prompt·ai智能体·上下文工程
Destiny_where18 小时前
Agent平台-RAGFlow(2)-源码安装
python·ai
这儿有一堆花1 天前
使用 Whisper 转写语音的完整教学
人工智能·ai·whisper
zhojiew1 天前
在ec2上部署indexTTS和尝试部署sparkTTS模型
ai
百度智能云技术站1 天前
百度智能云 X 十字路口 | 对谈王雁鹏:亲述从大数据时代到 3 万卡集群的中国算力演进史
ai·云计算
容沁风1 天前
Chartjs画二氧化碳浓度曲线
ai·esp32·micropython·二氧化碳传感器
come112341 天前
augment code 工具的系统提示词
ai·aigc