Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
张3蜂15 分钟前
Label Studio 详解:一站式数据标注平台全面介绍
ai
北龙云海3 小时前
从宕机到智变:2025数据中心进化启示录,数智运维如何定义未来
运维·ai·数据中心·智算·数智运维·数据中心规划
琅琊榜首20204 小时前
AI赋能短剧创作:从Prompt设计到API落地的全技术指南
人工智能·prompt
测试者家园4 小时前
Prompt、Agent、测试智能体:测试的新机会,还是新焦虑?
人工智能·prompt·智能体·职业和发展·质量效能·智能化测试·软件开发和测试
xcLeigh5 小时前
AI的提示词专栏:Prompt 与 Python Pandas 的结合使用指南
人工智能·python·ai·prompt·提示词
砚边数影5 小时前
AI开发依赖引入:DL4J / Java-ML 框架 Maven 坐标配置
java·数据库·人工智能·深度学习·机器学习·ai·maven
砚边数影5 小时前
AI环境搭建(一):JDK17 + Maven 配置,Java开发环境标准化流程
数据库·人工智能·ai·ai编程
江拥羡橙5 小时前
vscode使用windsurf获取token
vscode·ai·windsurf
稳稳C96 小时前
04|Langgraph | 从入门到实战 | 进阶篇 | 流式传输
python·ai·langchain·agent·langgraph
无双@6 小时前
保姆级 安装+使用上 Claude Code
ai·大模型·agent·claude·配置·claude code·skills