Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
逛街的猫啊6 分钟前
【AI 专栏】JSON-RPC
ai·rpc·json
小龙23 分钟前
大模型训练全流程学习笔记
笔记·学习·ai·大模型
manjianghong861 小时前
结合AI编码和VBA宏批量调整word2007文档中的多个图片
ai·ai应用·ai编码·ai助力word编辑
阿部多瑞 ABU1 小时前
第五章:林心
人工智能·ai·ai写作
Swizard2 小时前
数据不够代码凑?用 Albumentations 让你的 AI 模型“看”得更广,训练快 10 倍!
python·算法·ai·训练
CoderJia程序员甲2 小时前
GitHub 热榜项目 - 日榜(2025-12-27)
ai·开源·大模型·github·ai教程
Elastic 中国社区官方博客2 小时前
使用 LocalAI 和 Elasticsearch 构建本地 RAG 应用
大数据·数据库·人工智能·elasticsearch·搜索引擎·ai·全文检索
spencer_tseng12 小时前
transformer-explainer
ai·transformer
图生生15 小时前
饰品商拍提效:AI图生图实现白底图转上身图
人工智能·ai
yingxiao88815 小时前
美国拟终止互联网平台免责条款;YouTube推出AI工具Playables Builder
ai·appstore·ai应用·豆包·行业资讯·燕云十六声