Formulated and managed by MLCommons, MLPerf benchmarks measure key operational parameters of artificial intelligence (AI) accelerators across multiple industries. These standardized metrics help semiconductor companies optimize performance and support the development of efficient AI chip designs.
This article discusses MLPerf’s crucial role in facilitating comprehensive benchmark testing for tiny and edge computing systems while scaling ML inference performance evaluation from mobile devices to data center systems. It reviews MLPerf’s standardized approach to evaluating ML training performance, details its methodologies for assessing storage performance in ML workloads, and outlines how AlgoPerf advances training speed.
Optimizing benchmark testing for tiny and edge computing systems
Designed for neural networks under 100 kB, MLPerf Inference: Tiny targets ultra-low-power machine learning (ML) systems and embedded devices such as microcontrollers (MCUs), digital signal processors (DSPs), and small neural network accelerators operating at 10–250 MHz and consuming less than 50 mW of power. Key metrics focus on inference speed, optional energy efficiency, and model accuracy, with benchmarks measuring single-inference latency and model quality, defined as accuracy or area under the curve (AUC).
To ensure consistency, optional power measurements must use SPEC-certified power meters to measure AC wall power and undergo evaluation during the same run as performance metrics. MLPerf Inference: Tiny evaluates tasks such as:
- Keyword spotting with DS-CNN for smart earbuds and assistants
- Visual wake words with MobileNet for binary image classification
- Image classification with ResNet on the CIFAR-10 dataset
- Anomaly detection with the Deep AutoEncoder and ToyADMOS datasets
Building on Tiny, MLPerf Inference: Edge evaluates ML inference performance and optional power consumption across a diverse range of edge computing systems. Standardized preprocessing guidelines ensure consistency, and metrics assess inference speed and throughput precisely. Tasks span image classification with ResNet50-v1.5, as shown in Figure 1, object detection, question answering with BERT-Large, and other vision and language benchmarks. MLPerf Inference: Edge excludes DLRMv2, LLAMA2-70B, Mixtral-8x7B, and R-GAT.

MLPerf Inference: Edge follows the same SPEC-certified power measurement standards as Tiny, requiring optional evaluations to be conducted during the same run as performance metrics. In addition, MLPerf Inference: Edge benchmarks must meet stringent requirements, achieving 99% of reference model accuracy or 99.9% for high-accuracy variants while adhering to latency constraints such as 15 ms for ResNet50. Deployment scenarios include single-stream, multi-stream, and offline modes.
Scaling ML inference performance from mobile to data center systems
MLPerf Inference: Mobile supports Android and iOS platforms, with a headless version available for laptops and non-mobile operating systems. It uses the LoadGen tool to generate inference requests and measure performance metrics for tasks such as image classification, object detection, semantic segmentation, and question-answering. The benchmark evaluates inference latency and throughput across deployment scenarios such as single-stream and offline modes. MLPerf Inference: Mobile is compatible with various ML frameworks and accelerators, spanning CPUs, GPUs, NPUs, and DSPs. Importantly, MLPerf Inference: Mobile doesn’t include power measurements.
MLPerf Inference: data center evaluates ML inference performance using core models such as:
- ResNet50-v1.5 for image classification
- BERT-Large for question-answering
- DLRMv2 for recommendation systems
- Llama 2 70B for large language models, as shown in Figure 2
- Mixtral-8x7B for mixture-of-experts models
- Stable Diffusion XL for text-to-image generation

Deployment scenarios include server mode, which measures query processing under the service-level agreement (SLA) constraints and reports queries per second (QPS), and offline mode, which evaluates bulk processing throughput, measured in samples processed per second. MLPerf Inference: Datacenter benchmarks must achieve 99% reference model accuracy for standard variants and 99.9% for high-accuracy variants while meeting specific latency constraints, such as 15 ms for ResNet50.
From training models to HPC simulations: a standardized benchmarking approach
MLPerf Training evaluates performance across diverse systems and platforms. It spans a range of tasks, including:
- Image classification with ResNet-50
- Object detection and instance segmentation
- Large language models, including GPT-3 and Llama 2 70B
- Text-to-image generation, such as Stable Diffusion
- Graph neural networks (GNNs)
- BERT for natural language processing (NLP)
- Recommendation systems
Key features include complete system tests that evaluate models, software, and hardware under real-world conditions. Optional power measurements offer insights into system configurations. The benchmark supports distributed training scenarios, assessing how configurations affect training speed and efficiency. Models must achieve 99% reference accuracy for standard benchmarks and 99.9% for high-accuracy variants.
MLPerf Training: HPC extends these benchmarks to high-performance computing (HPC), supercomputers, and large-scale scientific computing systems. Optimized for coupled workloads that integrate training with simulations, it focuses on scientific applications such as CosmoFlow for cosmological parameter prediction, DeepCAM for climate and weather analysis, and Open Catalyst for atomic force prediction. OpenFold, a recent addition shown in Figure 3, addresses protein structure prediction alongside tasks like quantum molecular dynamics and large-scale scientific data analysis.

MLPerf Training: HPC addresses key characteristics specific to large-scale scientific computing, including:
- On-node vs. off-node communication
- Big dataset handling and I/O bottlenecks
- System reliability at scale
- Message passing interface (MPI) and alternative communication backend performance
Evaluating storage performance in ML workloads
MLPerf Storage evaluates how efficiently storage systems supply training data during model training. Using simulated GPUs stresses data pipelines to their limits without relying on physical hardware accelerators. Through sleep() call emulation, the benchmark provides an open and transparent framework for assessing storage performance in ML training scenarios. Supported models include:
- ResNet50 for image classification (approximately 114 KB per sample)
- 3D UNet for medical imaging (approximately 146 MB per sample)
- CosmoFlow for scientific computing (approximately 2.8 MB per sample)
The benchmark supports distributed training scenarios, requiring all clients to share a single data namespace. It leverages the Deep Learning I/O (DLIO) benchmark for synthetic data generation and loading, with scaling units defined to evaluate performance. Storage scaling units represent the smallest increment to improve system throughput, while host nodes simulate additional storage system load. Each host node runs an identical number of simulated accelerators, ensuring consistent performance evaluation across a wide range of ML applications.
AlgoPerf: accelerating training speed through innovation
The AlgoPerf: Training Algorithms benchmark evaluates training speed improvements through a single track governed by two distinct rulesets: external tuning and self-tuning. As shown in Figure 4, the external tuning ruleset enables workload-agnostic hyperparameter search spaces, while the self-tuning ruleset requires algorithms to adapt autonomously within a single optimization run.

Key benchmark functions include parameter updates, optimizer state initialization, data selection, and batch size definition. Evaluations are conducted on a fixed system to ensure fair comparisons across frameworks such as JAX and PyTorch. Time-to-result metrics are measured for multiple workloads.
Summary
MLCommons, an open engineering consortium with over 125 members, developed MLPerf in 2018 to standardize industry metrics for measuring ML performance. Today, MLPerf benchmarks cover data centers, the intelligent edge, and mobile devices, providing comprehensive training, inference, storage, and algorithmic performance metrics. These benchmarks help semiconductor companies optimize performance and power, enabling more cost-effective and efficient AI chip designs.
Related EE World content
What Are the Different Types of AI Accelerators?
Benchmarking AI From the Edge to the Cloud
What is TinyML?
What’s the Difference Between GPUs and TPUs for AI Processing?
How Are High-Speed Board-to-Board Connectors Used in ML and AI Systems?
References
New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications, BusinessWire
MLPerf Benchmarks Explained, Restack
MLPerf Mobile Benchmarks, GitHub
MLPerf Tiny, GitHub
MLPerf Mobile Inference Benchmark, MLSys
Benchmarking TinyML with MLPerf Tiny Inference Benchmark, CNX-Software
MLPerf Tiny Benchmark, NeurIPS
MLPerf Inference Benchmarks, MLCommons
New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models, BusinessWire
MLPerf Inference v4.0: NVIDIA Reigns Supreme, Intel Shows Impressive Performance Gains, Maginative
MLPerf Storage V1.0 Benchmark Rules, GitHub
Nutanix Unified Storage Takes the Lead in MLPerf Storage v1.0 Benchmark, Nutanix
MLCommons AlgoPerf: Technical Documentation & FAQs, GitHub
MLPerf: A Benchmark for Machine Learning, SC23
Leave a Reply