• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Microcontroller Tips

Microcontroller engineering resources, new microcontroller products and electronics engineering news

  • Products
    • 8-bit
    • 16-bit
    • 32-bit
    • 64-bit
  • Applications
    • 5G
    • Automotive
    • Connectivity
    • Consumer Electronics
    • EV Engineering
    • Industrial
    • IoT
    • Medical
    • Security
    • Telecommunications
    • Wearables
    • Wireless
  • Learn
    • eBooks / Tech Tips
    • EE Training Days
    • FAQs
    • Learning Center
    • Tech Toolboxes
    • Webinars/Digital Events
  • Resources
    • Design Guide Library
    • LEAP Awards
    • Podcasts
    • White Papers
  • Videos
    • EE Videos & Interviews
    • Teardown Videos
  • EE Forums
    • EDABoard.com
    • Electro-Tech-Online.com
  • Engineering Training Days
  • Advertise
  • Subscribe

What determines the size of the dataset needed to train an AI?

March 26, 2025 By Jeff Shepard Leave a Comment

Large data sets are needed to train artificial intelligence (AI) algorithms. Large data sets can be expensive. So, how much data is enough? The complexity of the problem, the model complexity, the quality of the data, and the required level of accuracy primarily determine that.

Data augmentation techniques can be used to increase the size of a dataset, and learning curve analysis can be used to determine when training results have been optimized.

Problem complexity is a major factor in the size of the required dataset. Image recognition is complex and requires a larger training dataset than simple image classification. In addition, problems with more features need more training examples to learn all the possible relationships.

Model complexity is also important, and a deep learning model with more parameters can require a very large dataset for effective learning. A common rule of thumb is the “rule of 10” that states effective training requires 10x more data points than the number of parameters in the model.

Data quality and augmentation

Data with minimal noise or inconsistencies is “high quality” training data. It can be difficult to obtain large quantities of high-quality data, but smaller datasets can be augmented to increase the size of the dataset artificially.

Argumentation can be used with all types of data. Even seemingly small changes are sufficient. For example, effective forms of augmentation for a dataset of images can include cropping, reflection, rotation, scaling, translation, or adding Gaussian noise (Figure 1).

Figure 1. An example of an original image (left) and four additional images were derived using data augmentation techniques. (Image: Nexocode)

Underfitting and overfitting

Bias and variance metrics can be used to determine the quality of an AI/ML model. Bias is the prediction error related to a model that’s too simple (also called underfitting), while a high variance indicates that the model is too complex (overfitting) and considers “noise” in the dataset in addition to the data itself.

The ideal model has low bias and low variance. The two metrics can be considered independent, as shown in Figure 2. However, in the case of AI/ML models, they tend to be inversely proportional, and increasing one leads to a decrease in the other. That’s called the “bias-variance tradeoff” and is an important consideration in learning curve analysis when determining the success of model training.

Figure 2. AI/ML models aim to produce the ideal combination of bias and variance (upper left target). (Image: Analytics Vidhya)

Epochs and learning curve analysis

An epoch represents a complete cycle of training an AI/ML model using a given dataset. Epochs are also used in learning curve analysis to determine the optimal number of training cycles.

Learning curve analysis is important because the required number of epochs can reach the thousands. Using more epochs to “refine” the results, however, is not better since training for too many epochs results in overfitting.

The learning curve plots the amount of data (usually epochs) on the x-axis and the model’s accuracy (or other performance metric) on the y-axis. Learning curve analysis compares results from the training with a set of validation data. The validation data can be an independent dataset or a subset of the training dataset not used for training.

Analysis limitations

Not all models have the same relationship between bias and variance. That can make it challenging to identify an optimal model.

Generally, an optimal model can be identified when the combined bias and variance reach a global minimum, as in Figure 3a. For some models, variance may increase slower than the decrease in bias (Figure 3b), and identifying the optimal model may not be as simple. In those cases, a new or refined model may provide improved results.

Figure 3. The relationship between bias and variance can’t always be relied on to identify the optimal mode. (Analytica Chimica Acta)

Summary

The “rule of 10” can provide a starting point for determining the amount of data needed for AI/ML training. Data availability can be expanded at a low cost using augmentation techniques. Training results can be analyzed using learning curves, but finding the optimal model is not always simple and can require adjustment or replacement of the model.

References

A new strategy to prevent over-fitting in partial least squares models based on model population analysis, Analytica Chimica Acta
Bias and Variance in Machine Learning, Analytics Vidhya
Evaluating data: How much training data do you need for machine learning?, Kili Technology
Finding the Best Training Data for Your AI Model, Keylabs
How does the size of the training data affect the accuracy?, Deepchecks
How to Choose the Right AI Training Data, Ailient
How Much Data Does AI Need? What to Do When You Have Limited Datasets?, Nexocode
How Much Data Is Enough? A Deep Dive into Machine Learning Needs, Shaip
How Much Data Is Required To Train ML Models?, Akkio
Train and use your own models, Google Cloud

EEWorld Online related links

What tools are there to reduce AI power consumption?
How do 224 G connectors support AI/ML training in hyperscale data centers?
What are the different types of AI accelerators?
What is the mathematics behind artificial intelligence?
What’s the difference between GPUs and TPUs for AI processing?

You may also like:


  • A survey of Wi-Fi connectivity modules for IoT applications: part…

  • A survey of Wi-Fi connectivity modules for IoT applications: part…

  • A survey of Wi-Fi connectivity modules for IoT applications: part…

  • A survey of Wi-Fi connectivity modules for IoT applications part…

  • How to minimize design cycles for AI accelerators with advanced…

Filed Under: Artificial intelligence/ML, FAQ, Featured

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Featured Contributions

Can chiplets save the semiconductor supply chain?

Navigating the EU Cyber Resilience Act: a manufacturer’s perspective

The intelligent Edge: powering next-gen Edge AI applications

Engineering harmony: solving the multiprotocol puzzle in IoT device design

What’s slowing down Edge AI? It’s not compute, it’s data movement

More Featured Contributions

EE TECH TOOLBOX

“ee
Tech Toolbox: Connectivity
AI and high-performance computing demand interconnects that can handle massive data throughput without bottlenecks. This Tech Toolbox explores the connector technologies enabling ML systems, from high-speed board-to-board and PCIe interfaces to in-package optical interconnects and twin-axial assemblies.

EE Learning Center

EE Learning Center

EE ENGINEERING TRAINING DAYS

engineering
“bills
“microcontroller
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, tools and strategies for EE professionals.

Footer

Microcontroller Tips

EE World Online Network

  • 5G Technology World
  • EE World Online
  • Engineers Garage
  • Analog IC Tips
  • Battery Power Tips
  • Connector Tips
  • EDA Board Forums
  • Electro Tech Online Forums
  • EV Engineering
  • Power Electronic Tips
  • Sensor Tips
  • Test and Measurement Tips

Microcontroller Tips

  • Subscribe to our newsletter
  • Advertise with us
  • Contact us
  • About us

Copyright © 2026 · WTWH Media LLC and its licensors. All rights reserved.
The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media.

Privacy Policy