MMMG

A Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Overview

MMMG is a comprehensive benchmark designed to evaluate multimodal generative AI models across text, images, audio, and their interleaved combinations. The benchmark contains 49 carefully designed tasks with 937 instructions spanning four modality combinations. MMMG focuses on two types of tasks: (1) "Verifiable" tasks - these are tasks where the outputs can be objectively checked by programs. For example, checking whether a generated speech recording begins with a specific keyword. (2) Tasks with "generation-evaluation gaps" - situations where generating the correct output is challenging due to complex constraints, but verifying whether the output meets those constraints remains simple and objective. For example, generating an image of a snowman without a carrot nose can be challenging due to spurious correlation, but verifying the absence of the carrot nose can be easily achieved by prompting a VLM.

MMMG's significance lies in its exceptional alignment with human judgment and its ability to provide fine-grained capability analysis. The benchmark achieved remarkable human agreement rates - averaging 94.3% agreement with human evaluators across different modalities, with particularly strong performance in image (94.8%) and interleaved image-text (95.6%) tasks. This represents substantial improvements over previous benchmarks. By categorizing tasks based on specific capabilities being assessed, MMMG enables researchers to identify precise weaknesses in models rather than just getting overall scores. This granular analysis capability makes MMMG a powerful tool for guiding future model development and identifying where the field needs to focus its efforts.

Comparisons with Existing Benchmarks

Comparison of MMMG with other benchmarks

Comprehensiveness of MMMG, compared with other multimodal generation benchmarks. "score" stands for embedding-based / rule-based similarity score, "code" for programmatically verification, and "reason" for multi-step reasoning. "?" represents low human alignment or no human experiments. MMMG significantly improves upon previous benchmarks in two key aspects. Comprehensively, while existing benchmarks like GenEval and DrawBench focus on single modalities, MMMG uniquely covers all four modality combinations including critical interleaved scenarios and three major tested capability, addressing the gap in evaluating real-world multimodal applications. Reliably, MMMG achieves exceptional human agreement scores exceeding 90% across all modalities by using verifiable tasks with objective, programmatic evaluation instead of subjective assessments or unvalidated MLM-as-judge approaches.

Image

Image-Text

Sound-Music

Speech

Name

Size

Date

Overall

Object

Relation

Format

Text Rendering

BibTeX

@misc{yao2025mmmgcomprehensivereliableevaluation, title={MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation}, author={Jihan Yao and Yushi Hu and Yujie Yi and Bin Han and Shangbin Feng and Guang Yang and Bingbing Wen and Ranjay Krishna and Lucy Lu Wang and Yulia Tsvetkov and Noah A. Smith and Banghua Zhu}, year={2025}, eprint={2505.17613}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.17613}, }

MMMG

A Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

MMMG Benchmark

Overview

Comparisons with Existing Benchmarks

Detailed Tasks

Leaderboard

BibTeX