Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
utmhikari7 小时前
【测试人生】LLM赋能游戏自动化测试的一些想法
自动化测试·游戏·ai·大模型·llm·游戏测试
Nina_7179 小时前
第二章 prompt思维链
python·prompt
Learn Beyond Limits11 小时前
Using per-item Features|使用每项特征
人工智能·python·神经网络·算法·机器学习·ai·吴恩达
安娜的信息安全说14 小时前
Ollama 使用详解:本地部署大语言模型的指南
人工智能·ai·语言模型·ollama
脚踏实地的大梦想家1 天前
【LangChain】P10 LangChain 提示词模板深度解析(一):Prompt Template
langchain·prompt
OopsOutOfMemory1 天前
LangChain源码分析(十三)- 运行时与监控
ai·langchain·aigc·ai编程·ai应用
CoderJia程序员甲2 天前
GitHub 热榜项目 - 日榜(2025-10-03)
ai·开源·大模型·github·ai教程
Elastic 中国社区官方博客2 天前
Elasticsearch MCP 服务器:与你的 Index 聊天
大数据·服务器·人工智能·elasticsearch·搜索引擎·ai·全文检索
CoderJia程序员甲2 天前
GitHub 热榜项目 - 日榜(2025-09-26)
ai·开源·github·ai编程·github热榜
CoderJia程序员甲2 天前
GitHub 热榜项目 - 日榜(2025-10-02)
ai·github·开源项目·github热榜