MMMG is a comprehensive benchmark designed to evaluate multimodal generative AI models across text, images, audio, and their interleaved combinations. The benchmark contains 49 carefully designed tasks with 937 instructions spanning four modality combinations. MMMG focuses on two types of tasks: (1) "Verifiable" tasks - these are tasks where the outputs can be objectively checked by programs. For example, checking whether a generated speech recording begins with a specific keyword. (2) Tasks with "generation-evaluation gaps" - situations where generating the correct output is challenging due to complex constraints, but verifying whether the output meets those constraints remains simple and objective. For example, generating an image of a snowman without a carrot nose can be challenging due to spurious correlation, but verifying the absence of the carrot nose can be easily achieved by prompting a VLM.
MMMG's significance lies in its exceptional alignment with human judgment and its ability to provide fine-grained capability analysis. The benchmark achieved remarkable human agreement rates - averaging 94.3% agreement with human evaluators across different modalities, with particularly strong performance in image (94.8%) and interleaved image-text (95.6%) tasks. This represents substantial improvements over previous benchmarks. By categorizing tasks based on specific capabilities being assessed, MMMG enables researchers to identify precise weaknesses in models rather than just getting overall scores. This granular analysis capability makes MMMG a powerful tool for guiding future model development and identifying where the field needs to focus its efforts.
Comprehensiveness of MMMG, compared with other multimodal generation benchmarks. "score" stands for embedding-based / rule-based similarity score, "code" for programmatically verification, and "reason" for multi-step reasoning. "?" represents low human alignment or no human experiments. MMMG significantly improves upon previous benchmarks in two key aspects. Comprehensively, while existing benchmarks like GenEval and DrawBench focus on single modalities, MMMG uniquely covers all four modality combinations including critical interleaved scenarios and three major tested capability, addressing the gap in evaluating real-world multimodal applications. Reliably, MMMG achieves exceptional human agreement scores exceeding 90% across all modalities by using verifiable tasks with objective, programmatic evaluation instead of subjective assessments or unvalidated MLM-as-judge approaches.
Detailed statistics of the MMMG benchmark per task.
We evaluate various models including both closed- and open-source models. We sample 4 generations for every instruction. For evaluation, we employ the most human-aligned metric reported in the paper. If you would like to submit your model's performance onto the leaderboard, please refer to the submission guideline.
Click on Image, Image-Text, Sound-Music, Speech-Text to expand detailed results.
Image | Image-Text | Sound-Music | Speech | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Size | Date | Overall | Object | Relation | Format | Text Rendering | Overall | Image Consistency | Image-Text Coherence | Image Editing | Reasoning | Overall | Sound | Music | Overall | Voice | Transcript | Speech-Text Coherence |
The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors. Most TTS models are excluded because they don't support voice customization.
@misc{yao2025mmmgcomprehensivereliableevaluation,
title={MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation},
author={Jihan Yao and Yushi Hu and Yujie Yi and Bin Han and Shangbin Feng and Guang Yang and Bingbing Wen and Ranjay Krishna and Lucy Lu Wang and Yulia Tsvetkov and Noah A. Smith and Banghua Zhu},
year={2025},
eprint={2505.17613},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.17613},
}