Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Deep-Dive Summary:

Original Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate

text-to-image (TTI) generation systems, providing automated judgments based on

visual and textual context. However, these "judge" models often suffer from

biases, overconfidence, and inconsistent performance across diverse image

domains. While prompt ensembling has shown promise for mitigating these issues

in unimodal, text-only settings, our experiments reveal that standard

ensembling methods fail to generalize effectively for TTI tasks. To address

these limitations, we propose a new multimodal-aware method called Multimodal

Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt

ensemble approach augmented by image clustering, allowing the judge to

dynamically assign prompt weights based on the visual characteristics of each

sample. We show that MMB improves accuracy in pairwise preference judgments and

greatly enhances calibration, making it easier to gauge the judge's true

uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB

outperforms existing baselines in alignment with human annotations and

calibration across varied image content. Our findings highlight the importance

of multimodal-specific strategies for judge calibration and suggest a promising

path forward for reliable large-scale TTI evaluation.

PDF Link: 2509.08777v1

部分平台可能图片显示异常,请以我的博客内容为准

相关推荐
Cha0DD21 分钟前
【由浅入深探究langchain】第二十集-SQL Agent+Human-in-the-loop
人工智能·python·ai·langchain
Cha0DD22 分钟前
【由浅入深探究langchain】第十九集-官方的SQL Agent示例
人工智能·python·ai·langchain
CoderJia程序员甲4 小时前
GitHub 热榜项目 - 日榜(2026-03-29)
人工智能·ai·大模型·github·ai教程
带刺的坐椅7 小时前
SolonCode v2026.4.1 发布(比 ClaudeCode 简约的编程智能体)
java·ai·llm·agent·solon-ai·claudecode·soloncode
Elastic 中国社区官方博客7 小时前
Elasticsearch:如何在 Elastic AI Builder 里使用 DSL 来查询 Elasticsearch
大数据·人工智能·elasticsearch·搜索引擎·ai·全文检索
映辉7 小时前
ultralytics yolo入门实践
yolo·ai
ty_sj7 小时前
AI验证prompt
prompt
problc7 小时前
Pretext —— 无 DOM 文本测量与布局引擎
前端·ai
gao_tjie8 小时前
Seedream MCP 集成指南
ai
落樱弥城8 小时前
Vulkan Compute 详解
算法·ai·图形学