Logo MMMG

A Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

*Core Contributors
โ€ Corresponding to: jihany2@cs.washington.edu, yushihu@uw.edu,
MMMG Overview

Examples of tasks and their evaluation metrics in MMMG. For each task, we develop an evaluation metric using programs, models or their combinations. The tasks are either verifiable purely by programs or have big generation-evaluation gaps: generation is challenging for models, while automatic evaluations have high correlation with human judgments. We show evaluation pseudo-code for demonstration the evaluation process.

Logo MMMG Benchmark

Overview

MMMG is a comprehensive benchmark designed to evaluate multimodal generative AI models across text, images, audio, and their interleaved combinations. The benchmark contains 49 carefully designed tasks with 937 instructions spanning four modality combinations. MMMG focuses on two types of tasks: (1) "Verifiable" tasks - these are tasks where the outputs can be objectively checked by programs. For example, checking whether a generated speech recording begins with a specific keyword. (2) Tasks with "generation-evaluation gaps" - situations where generating the correct output is challenging due to complex constraints, but verifying whether the output meets those constraints remains simple and objective. For example, generating an image of a snowman without a carrot nose can be challenging due to spurious correlation, but verifying the absence of the carrot nose can be easily achieved by prompting a VLM.

MMMG's significance lies in its exceptional alignment with human judgment and its ability to provide fine-grained capability analysis. The benchmark achieved remarkable human agreement rates - averaging 94.3% agreement with human evaluators across different modalities, with particularly strong performance in image (94.8%) and interleaved image-text (95.6%) tasks. This represents substantial improvements over previous benchmarks. By categorizing tasks based on specific capabilities being assessed, MMMG enables researchers to identify precise weaknesses in models rather than just getting overall scores. This granular analysis capability makes MMMG a powerful tool for guiding future model development and identifying where the field needs to focus its efforts.

Comparisons with Existing Benchmarks

Comparison of MMMG with other benchmarks

Comprehensiveness of MMMG, compared with other multimodal generation benchmarks. "score" stands for embedding-based / rule-based similarity score, "code" for programmatically verification, and "reason" for multi-step reasoning. "?" represents low human alignment or no human experiments. MMMG significantly improves upon previous benchmarks in two key aspects. Comprehensively, while existing benchmarks like GenEval and DrawBench focus on single modalities, MMMG uniquely covers all four modality combinations including critical interleaved scenarios and three major tested capability, addressing the gap in evaluating real-world multimodal applications. Reliably, MMMG achieves exceptional human agreement scores exceeding 90% across all modalities by using verifiable tasks with objective, programmatic evaluation instead of subjective assessments or unvalidated MLM-as-judge approaches.

Detailed Tasks

Detailed task statistics

Detailed statistics of the MMMG benchmark per task.

Leaderboard

We evaluate various models including both closed- and open-source models. We sample 4 generations for every instruction. For evaluation, we employ the most human-aligned metric reported in the paper. If you would like to submit your model's performance onto the leaderboard, please refer to the submission guideline.


Open-Source Proprietary Agent

Click on Image, Image-Text, Sound-Music, Speech-Text to expand detailed results.

Image Image-Text Sound-Music Speech
Name Size Date Overall Object Relation Format Text Rendering

The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors. Most TTS models are excluded because they don't support voice customization.

BibTeX

@misc{yao2025mmmgcomprehensivereliableevaluation,
              title={MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation},
              author={Jihan Yao and Yushi Hu and Yujie Yi and Bin Han and Shangbin Feng and Guang Yang and Bingbing Wen and Ranjay Krishna and Lucy Lu Wang and Yulia Tsvetkov and Noah A. Smith and Banghua Zhu},
              year={2025},
              eprint={2505.17613},
              archivePrefix={arXiv},
              primaryClass={cs.AI},
              url={https://arxiv.org/abs/2505.17613},
        }