Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
Lululaurel7 小时前
机器学习系统框架:核心分类、算法与应用全景解析
人工智能·算法·机器学习·ai·分类
Rysxt_10 小时前
Spring Boot 集成 Spring AI OpenAI Starter 教程
java·spring boot·后端·ai
前端农民工ws15 小时前
Vue 框架的 markdown 渲染组件,针对 AI 的 markdown 流式传输场景
前端·javascript·vue.js·ai
zzywxc78719 小时前
自动化测试框架是软件测试的核心基础设施,通过预设规则和脚本自动执行测试用例,显著提高测试效率和覆盖率。
运维·人工智能·自动化·prompt·测试用例·流程图
m0_603888711 天前
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
ai·原型模式·论文速览
Elastic 中国社区官方博客1 天前
使用 LangExtract 和 Elasticsearch
大数据·人工智能·elasticsearch·搜索引擎·ai·信息可视化·全文检索
lifallen1 天前
淘宝RecGPT:通过LLM增强推荐
人工智能·深度学习·ai·推荐算法