Three modals that will shape multimodal AI

Embedded systems are inherently multimodal. Edge AI is transforming the way sensor data is processed and utilized to inform decisions and actions. The rapid shift toward multimodal AI in embedded systems will deliver exponential gains in functionality.

The embedded industry has always relied on real-world data when determining real-world actions. Motor control, the heart of industrial automation, has become more complex as we strive for efficiency and cost-optimization. Brushless DC motors are more robust than brushed DC motors. Hall effect sensors and optical encoders provide rotation positioning, but sensorless motor control removes the need for sensors in exchange for a more complex algorithm. Field-oriented control takes this further and is another example of how complexity has shifted deeper into the system to deliver higher efficiency and a more resilient solution. Using artificial intelligence at the edge represents the next step in this evolution, presenting the opportunity to change the way we connect stimuli and actions.

Future systems will make use of three fundamental modalities: vision, sound, and motion. They will use AI models to infer information about their environment using data from these sensors. By processing multiple data types in numerous or, eventually, a single AI model, the actions they take will better reflect the conditions and intent.

Yes, the control mechanisms will be more complex, but as with previous evolutionary steps, the gains will outweigh the rise in complexity. And the industry is responding to the increased complexity with tools that provide abstraction.

AI models deployed in embedded systems at the edge will use real-world data to infer optimal actions. With linear algorithms, the action must, by definition, comply with a predetermined condition, even before the data is analyzed. With AI, the action needed will be inferred using not just the immediate data, but all the expertise and knowledge distilled into the model.

Multimodal AI brings yet another dimension, using multiple types of data to arrive at one action. The enabling factor is the combination of software and hardware. Deploying multiple AI models in embedded devices requires edge processors designed to run AI. We are now seeing these processors arrive on the market, supported by software from manufacturers and their partners. This emerging ecosystem is helping engineering teams exploit multimodal AI.

AI is disrupting embedded system design

Edge processing, from a deeply embedded viewpoint, is largely about closed systems used for monitoring and control. Embedded systems evolved to understand and interact with the world around us. The path from mechanization to automation, autonomy, and now agentic charters the evolution of the complex electromechanical systems we rely on every day.

These systems are inherently closed because they are designed to process a finite amount of data, presented in a predictable format. Everything that falls outside these presets is ignored. There are good reasons for this approach, but they generally relate to the limited resources available to capture, pre-process, analyze, and react to data.

By contrast, embedded systems that use AI can be considered open, because they are able to process data that doesn’t always conform to presets, at least not in the same way closed systems do. Sensor fusion represents a step in this direction by utilizing the combined data from multiple sensors, albeit still processed through sequential processing. Edge AI will elevate sensor fusion by closing the loop around known effects and inferred causes.

Consider a vision system as another example. Machine vision utilizes cameras to inspect objects, including fresh fruit and manufactured components. Those systems work best when the objects are presented in specific ways, including orientation, lighting, speed, and direction of travel. Objects presented to the image sensor that fall outside those parameters will likely need a human operator to inspect them. AI-based vision systems are more adaptable to the natural variations inherent in object inspection. This, in turn, means that objects can be identified and inspected more quickly with greater flexibility.

Multimodal defines embedded systems

Many functions in embedded systems are single-purpose by nature, using only one type of sensor data as input and providing one action as output. However, embedded systems are rarely formed of a single function. At a high level, each function may be compartmentalized, even if they run on the same microcontroller. However, systemically, their functions are inherently interdependent.

This illustrates that, functionally, many embedded systems use multiple types of data and are implicitly multimodal. The change that AI brings is how those modalities are processed. Initially, OEMs may introduce a single instance of AI or Machine Learning to handle one type of data. This may be time-sequence data from a basic sensor, or object detection using an image sensor.

More AI models may be added to handle other types of data. Systems that run multiple but separate inference models, one for each data type, could be considered weak multimodal. This is because the link between those modalities, from an AI perspective, is limited.

With the introduction of strong multimodal AI, a single model will process multiple types of data. Information will flow freely at the processing stage and be shared between functions. We are already seeing this transformation at the enterprise level.

While strong multimodal AI at the edge doesn’t fundamentally change the logic that provides control, those control algorithms will use inputs generated by AI, inferred from multiple sources of data.

What can multimodal Edge AI deliver?

The adoption of Edge AI has been disruptive, but the rapid shift we’re seeing toward using multimodal AI is based on the same reasons we already use multiple modalities in simple embedded systems. Actuation based on stimuli defines the functionality of a system. Using AI extends the scope of stimuli, increasing functionality without changing the purpose.

Using AI models to process sensor data is displacing sequential data processing. We must also acknowledge the impact that extending the scope of the stimuli has on the overall functionality of embedded systems. With AI inferencing now in the data flow, these systems can make decisions based on conditions that would be difficult or even impossible to parse using conventional data processing techniques.

For example, image sensors generate a large amount of data. A high-resolution camera used in machine vision can easily produce tens of gigabytes per second. This creates processing challenges, even in controlled environments such as anomaly detection during inspection. If the goal is to use object detection as a conditional trigger in uncontrolled settings, the system must recognize objects with a wide range of variations across various conditions. This is where trained AI models can achieve significantly better results than sequential programming.

Examples of Edge AI models, demonstrated by Avnet running on embedded vision systems and available on GitHub for evaluation, include:

EfficientNet – for image classification and general scene classification
Yolo – for object and people detection (e.g. boundary boxes)
Yolo – for instance segmentation (e.g. identify specific items)
DeepLab – for semantic segmentation (e.g. crowd density and location in a scene)
MoveNet – for pose estimation
Palm Detection / Hand Landmark – for hand gesture recognition

Beyond objects and people, AI-enabled image sensors are perfect for gesture detection. Gesturing is uniquely personal. No two people will perform the same gesture in exactly the same way, and even an individual may not be totally consistent in the way they perform the same gesture. AI can absorb these variations and focus on the intent.

Generative AI at the Edge

The use of large language models has already moved to the edge. The models, sometimes pruned and renamed small language models, enable generative AI to run on hardware platforms with limited compute resources. Using a voice-to-text model in collaboration with an LLM/SLM allows people to interact with technology in a natural way.

The modal used here is audio, and it extends beyond voice. Event detection based on sound is an active area of development. Using AI to infer those events has demonstrable benefits at the edge. Avnet is working with its supplier partners to demonstrate event detection using AI on deeply embedded platforms. Demonstrations using models available now can detect a baby’s cry, sirens, or keywords, to name a few.

There are also AI demos leveraging inertial motion units (IMUs) to detect and measure motion (such as vibration and fall detection) and gesture recognition leveraging radar in examples enabling touchless HMI scenarios.

These models, developed by suppliers and ready for deployment by OEMs, can be combined to realize multimodal systems on embedded platforms, or as the foundation for training on a manufacturer’s own data. Once deployed, the platforms can be maintained and updated over the air using a cloud-based platform, such as /IOTCONNECT.

Embedded systems have always been multimodal. The data provided by sensors has been used systemically, but the parameters around that data have always been necessarily restrictive. Edge AI will expand those parameters. Inferencing at the edge will work across modals, with image, audio, and motion sensor data working more closely as a single system.