Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
AI智能架构工坊2 小时前
提升AI虚拟健康系统开发效率:架构师推荐10款低代码开发平台
android·人工智能·低代码·ai
崎岖Qiu2 小时前
【SpringAI篇01】:5分钟教会你使用SpringAI (1.0.0稳定版)
java·spring boot·后端·spring·ai
Tiandaren8 小时前
自用提示词01 || Prompt Engineering || 学习路线大纲 || 作用:通过启发式的问题来带动学习
人工智能·pytorch·深度学习·nlp·prompt·1024程序员节
CoderJia程序员甲10 小时前
GitHub 热榜项目 - 日榜(2025-10-25)
ai·开源·github·ai编程·github热榜
郑清1 天前
Spring AI Alibaba 10分钟快速入门
java·人工智能·后端·ai·1024程序员节·springaialibaba
beckyye1 天前
给web增加简单的ai对话功能
前端·ai·通义千问·qwen
梵得儿SHI1 天前
大型语言模型基础之 Prompt Engineering:打造稳定输出 JSON 格式的天气预报 Prompt
人工智能·语言模型·prompt·提示词工程·结构化输出·engineering·ai交互
赋创小助手1 天前
“短小精悍”的边缘AI算力利器:超微SYS-E403-14B-FRN2T服务器评测
服务器·人工智能·科技·ai·架构·边缘计算·1024程序员节
梵得儿SHI1 天前
Prompt Engineering 关键技能:精准掌控 LLM 输出的格式、内容与风格
大模型·llm·prompt·格式控制·内容到风格·内容控制·风格控制
GJGCY1 天前
金融智能体技术解读:十大应用场景与AI Agent架构设计思路
人工智能·经验分享·ai·金融·自动化