Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
Huang26010819 分钟前
Claude Code:让编程变得更简单的 VS Code 插件
ai
考勤技术解析19 分钟前
外包技术人员打卡管理的技术痛点与轻量化解决方案
大数据·人工智能·ai
ofoxcoding1 小时前
DeepSeek V4 预览版实测:Agent、世界知识、推理能力,跟 V3 和 GPT-5.5/Claude 4.6 比到底什么水平?
大数据·人工智能·gpt·ai
huisheng_qaq1 小时前
【01-AI入门篇】深入理解AI感知智能和认知智能
人工智能·ai·chatgpt·认知智能·感知智能
薛定谔的猫3691 小时前
深入浅出 MCP (Model Context Protocol):开启 AI Agent 的标准化连接时代
ai·llm·agent·技术分享·mcp
老陈跨境记1 小时前
电商出海效率革命:萤火AI批量图片翻译的技术原理与实战测评
人工智能·ai
leikooo1 小时前
Skills 实战:Unsplash → COS 自动化配图
运维·ai·自动化
企业架构师老王2 小时前
注册审批申报材料自动校验:如何利用实在Agent构建非侵入式架构并降低数据误报率?
大数据·人工智能·ai·架构
陌殇殇2 小时前
004 Spring AI Alibaba框架整合百炼大模型平台 — MCP服务
java·spring·ai
零安道长2 小时前
Claude Code GitHub Actions 使用指南
ai