• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Microcontroller Tips

Microcontroller engineering resources, new microcontroller products and electronics engineering news

  • Products
    • 8-bit
    • 16-bit
    • 32-bit
    • 64-bit
  • Applications
    • 5G
    • Automotive
    • Connectivity
    • Consumer Electronics
    • EV Engineering
    • Industrial
    • IoT
    • Medical
    • Security
    • Telecommunications
    • Wearables
    • Wireless
  • Learn
    • eBooks / Tech Tips
    • EE Training Days
    • FAQs
    • Learning Center
    • Tech Toolboxes
    • Webinars/Digital Events
  • Resources
    • Design Guide Library
    • DesignFast
    • LEAP Awards
    • Podcasts
    • White Papers
  • Videos
    • EE Videos & Interviews
    • Teardown Videos
  • EE Forums
    • EDABoard.com
    • Electro-Tech-Online.com
  • Engineering Training Days
  • Advertise
  • Subscribe

What are the different MLPerf benchmarks from MLCommons?

January 29, 2025 By Aharon Etengoff Leave a Comment

Formulated and managed by MLCommons, MLPerf benchmarks measure key operational parameters of artificial intelligence (AI) accelerators across multiple industries. These standardized metrics help semiconductor companies optimize performance and support the development of efficient AI chip designs.

This article discusses MLPerf’s crucial role in facilitating comprehensive benchmark testing for tiny and edge computing systems while scaling ML inference performance evaluation from mobile devices to data center systems. It reviews MLPerf’s standardized approach to evaluating ML training performance, details its methodologies for assessing storage performance in ML workloads, and outlines how AlgoPerf advances training speed.

Optimizing benchmark testing for tiny and edge computing systems

Designed for neural networks under 100 kB, MLPerf Inference: Tiny targets ultra-low-power machine learning (ML) systems and embedded devices such as microcontrollers (MCUs), digital signal processors (DSPs), and small neural network accelerators operating at 10–250 MHz and consuming less than 50 mW of power. Key metrics focus on inference speed, optional energy efficiency, and model accuracy, with benchmarks measuring single-inference latency and model quality, defined as accuracy or area under the curve (AUC).

To ensure consistency, optional power measurements must use SPEC-certified power meters to measure AC wall power and undergo evaluation during the same run as performance metrics. MLPerf Inference: Tiny evaluates tasks such as:

  • Keyword spotting with DS-CNN for smart earbuds and assistants
  • Visual wake words with MobileNet for binary image classification
  • Image classification with ResNet on the CIFAR-10 dataset
  • Anomaly detection with the Deep AutoEncoder and ToyADMOS datasets

Building on Tiny, MLPerf Inference: Edge evaluates ML inference performance and optional power consumption across a diverse range of edge computing systems. Standardized preprocessing guidelines ensure consistency, and metrics assess inference speed and throughput precisely. Tasks span image classification with ResNet50-v1.5, as shown in Figure 1, object detection, question answering with BERT-Large, and other vision and language benchmarks. MLPerf Inference: Edge excludes DLRMv2, LLAMA2-70B, Mixtral-8x7B, and R-GAT.

Figure 1. A detailed illustration of the ResNet-50 architecture showcasing the residual learning block and hierarchical convolutional layers for image classification. (Image: ResearchGate)

MLPerf Inference: Edge follows the same SPEC-certified power measurement standards as Tiny, requiring optional evaluations to be conducted during the same run as performance metrics. In addition, MLPerf Inference: Edge benchmarks must meet stringent requirements, achieving 99% of reference model accuracy or 99.9% for high-accuracy variants while adhering to latency constraints such as 15 ms for ResNet50. Deployment scenarios include single-stream, multi-stream, and offline modes.

Scaling ML inference performance from mobile to data center systems

MLPerf Inference: Mobile supports Android and iOS platforms, with a headless version available for laptops and non-mobile operating systems. It uses the LoadGen tool to generate inference requests and measure performance metrics for tasks such as image classification, object detection, semantic segmentation, and question-answering. The benchmark evaluates inference latency and throughput across deployment scenarios such as single-stream and offline modes. MLPerf Inference: Mobile is compatible with various ML frameworks and accelerators, spanning CPUs, GPUs, NPUs, and DSPs. Importantly, MLPerf Inference: Mobile doesn’t include power measurements.

MLPerf Inference: data center evaluates ML inference performance using core models such as:

  • ResNet50-v1.5 for image classification
  • BERT-Large for question-answering
  • DLRMv2 for recommendation systems
  • Llama 2 70B for large language models, as shown in Figure 2
  • Mixtral-8x7B for mixture-of-experts models
  • Stable Diffusion XL for text-to-image generation
Figure 2. NVIDIA H200 and TensorRT-LLM set new MLPerf records for Llama 2 70B benchmarks, demonstrating up to 45% faster inference than previous models. (Image: HPC Wire)

Deployment scenarios include server mode, which measures query processing under the service-level agreement (SLA) constraints and reports queries per second (QPS), and offline mode, which evaluates bulk processing throughput, measured in samples processed per second. MLPerf Inference: Datacenter benchmarks must achieve 99% reference model accuracy for standard variants and 99.9% for high-accuracy variants while meeting specific latency constraints, such as 15 ms for ResNet50.

From training models to HPC simulations: a standardized benchmarking approach

MLPerf Training evaluates performance across diverse systems and platforms. It spans a range of tasks, including:

  • Image classification with ResNet-50
  • Object detection and instance segmentation
  • Large language models, including GPT-3 and Llama 2 70B
  • Text-to-image generation, such as Stable Diffusion
  • Graph neural networks (GNNs)
  • BERT for natural language processing (NLP)
  • Recommendation systems

Key features include complete system tests that evaluate models, software, and hardware under real-world conditions. Optional power measurements offer insights into system configurations. The benchmark supports distributed training scenarios, assessing how configurations affect training speed and efficiency. Models must achieve 99% reference accuracy for standard benchmarks and 99.9% for high-accuracy variants.

MLPerf Training: HPC extends these benchmarks to high-performance computing (HPC), supercomputers, and large-scale scientific computing systems. Optimized for coupled workloads that integrate training with simulations, it focuses on scientific applications such as CosmoFlow for cosmological parameter prediction, DeepCAM for climate and weather analysis, and Open Catalyst for atomic force prediction. OpenFold, a recent addition shown in Figure 3, addresses protein structure prediction alongside tasks like quantum molecular dynamics and large-scale scientific data analysis.

Figure 3. MLPerf HPC training performance scaling relative to an 8-GPU NVIDIA H100 OpenFold training baseline. (Image: NVIDIA)

MLPerf Training: HPC addresses key characteristics specific to large-scale scientific computing, including:

  • On-node vs. off-node communication
  • Big dataset handling and I/O bottlenecks
  • System reliability at scale
  • Message passing interface (MPI) and alternative communication backend performance

Evaluating storage performance in ML workloads

MLPerf Storage evaluates how efficiently storage systems supply training data during model training. Using simulated GPUs stresses data pipelines to their limits without relying on physical hardware accelerators. Through sleep() call emulation, the benchmark provides an open and transparent framework for assessing storage performance in ML training scenarios. Supported models include:

  • ResNet50 for image classification (approximately 114 KB per sample)
  • 3D UNet for medical imaging (approximately 146 MB per sample)
  • CosmoFlow for scientific computing (approximately 2.8 MB per sample)

The benchmark supports distributed training scenarios, requiring all clients to share a single data namespace. It leverages the Deep Learning I/O (DLIO) benchmark for synthetic data generation and loading, with scaling units defined to evaluate performance. Storage scaling units represent the smallest increment to improve system throughput, while host nodes simulate additional storage system load. Each host node runs an identical number of simulated accelerators, ensuring consistent performance evaluation across a wide range of ML applications.

AlgoPerf: accelerating training speed through innovation

The AlgoPerf: Training Algorithms benchmark evaluates training speed improvements through a single track governed by two distinct rulesets: external tuning and self-tuning. As shown in Figure 4, the external tuning ruleset enables workload-agnostic hyperparameter search spaces, while the self-tuning ruleset requires algorithms to adapt autonomously within a single optimization run.

Figure 4. The performance profile of the 11 benchmark submissions in the AlgoPerf external tuning track highlights advancements in neural network training optimization. (Image: Frank Schneider/AlgoPerf)

Key benchmark functions include parameter updates, optimizer state initialization, data selection, and batch size definition. Evaluations are conducted on a fixed system to ensure fair comparisons across frameworks such as JAX and PyTorch. Time-to-result metrics are measured for multiple workloads.

Summary

MLCommons, an open engineering consortium with over 125 members, developed MLPerf in 2018 to standardize industry metrics for measuring ML performance. Today, MLPerf benchmarks cover data centers, the intelligent edge, and mobile devices, providing comprehensive training, inference, storage, and algorithmic performance metrics. These benchmarks help semiconductor companies optimize performance and power, enabling more cost-effective and efficient AI chip designs.

Related EE World content

What Are the Different Types of AI Accelerators?
Benchmarking AI From the Edge to the Cloud
What is TinyML?
What’s the Difference Between GPUs and TPUs for AI Processing?
How Are High-Speed Board-to-Board Connectors Used in ML and AI Systems?

References

New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications, BusinessWire
MLPerf Benchmarks Explained, Restack
MLPerf Mobile Benchmarks, GitHub
MLPerf Tiny, GitHub
MLPerf Mobile Inference Benchmark, MLSys
Benchmarking TinyML with MLPerf Tiny Inference Benchmark, CNX-Software
MLPerf Tiny Benchmark, NeurIPS
MLPerf Inference Benchmarks, MLCommons
New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models, BusinessWire
MLPerf Inference v4.0: NVIDIA Reigns Supreme, Intel Shows Impressive Performance Gains, Maginative
MLPerf Storage V1.0 Benchmark Rules, GitHub
Nutanix Unified Storage Takes the Lead in MLPerf Storage v1.0 Benchmark, Nutanix
MLCommons AlgoPerf: Technical Documentation & FAQs, GitHub
MLPerf: A Benchmark for Machine Learning, SC23

You may also like:


  • What’s new with Matter: how Matter 1.4 is reshaping interoperability…

  • How to minimize design cycles for AI accelerators with advanced…

  • How to test IoT device wireless capabilities

  • How to integrate theft-prevention tracking capabilities in IoT devices

  • What can be done to prepare for post quantum cryptography?

Filed Under: Artificial intelligence, FAQ, Featured

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Featured Contributions

Five challenges for developing next-generation ADAS and autonomous vehicles

Securing IoT devices against quantum computing risks

RISC-V implementation strategies for certification of safety-critical systems

What’s new with Matter: how Matter 1.4 is reshaping interoperability and energy management

Edge AI: Revolutionizing real-time data processing and automation

More Featured Contributions

EE TECH TOOLBOX

“ee
Tech Toolbox: Internet of Things
Explore practical strategies for minimizing attack surfaces, managing memory efficiently, and securing firmware. Download now to ensure your IoT implementations remain secure, efficient, and future-ready.

EE Learning Center

EE Learning Center

EE ENGINEERING TRAINING DAYS

engineering
“bills
“microcontroller
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, tools and strategies for EE professionals.

RSS Current EDABoard.com discussions

  • Lightbox circuit help
  • 12VAC to 12VDC 5A on 250ft 12AWG
  • Battery sensing circuitry for coin cell application
  • Input impedance matching network
  • Voltage Regulator Sizing Question

RSS Current Electro-Tech-Online.com Discussions

  • Lightbox circuit
  • Kawai KDP 80 Electronic Piano Dead
  • Python help with keystroke entries
  • Do resistors fail like dominoes?
  • Fuel Auto Shutoff

DesignFast

Design Fast Logo
Component Selection Made Simple.

Try it Today
design fast globle

Footer

Microcontroller Tips

EE World Online Network

  • 5G Technology World
  • EE World Online
  • Engineers Garage
  • Analog IC Tips
  • Battery Power Tips
  • Connector Tips
  • DesignFast
  • EDA Board Forums
  • Electro Tech Online Forums
  • EV Engineering
  • Power Electronic Tips
  • Sensor Tips
  • Test and Measurement Tips

Microcontroller Tips

  • Subscribe to our newsletter
  • Advertise with us
  • Contact us
  • About us

Copyright © 2025 · WTWH Media LLC and its licensors. All rights reserved.
The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media.

Privacy Policy