Generative AI has come a long way in just a few short years, progressing from basic text responses to complex prose. The boundaries of this technology are being pushed even further with the development of multimodal AI — a form of artificial intelligence that works with more than just text, ingesting, processing and generating multiple kinds of data at once.
Multimodal AI Definition
Multimodal AI refers to an artificial intelligent system that uses multiple types of data (including text, images, video and audio) to generate content, form insights and make predictions.
Multimodal AI is finding its way into several industries, ranging from healthcare to robotics. Since Google released Gemini and signaled the era of multimodal AI, other tech giants like OpenAI, Anthropic and Meta have come out with their own multimodal models.
What Is Multimodal AI?
Multimodal AI refers to an artificial intelligence system that leverages various types (or modalities) of data simultaneously to form insights, make predictions and generate content. Besides Google Gemini, other well-known examples of multimodal AI include OpenAI’s DALL-E and GPT-4o, Meta’s ImageBind and Anthropic’s Claude 3 model family.
Multimodal models handle information like text, images, video, speech and more to complete a range of tasks, from generating a recipe based on a photo of food to transcribing an audio clip into multiple languages. This is different from most AI models, which can only handle a single mode of data. Large language models (LLMs) work with text data, for example, while convolutional neural networks (CNNs) work with images.
Multimodality mimics an innately human approach to understanding the world, where we combine sensory inputs like sight, sound and touch to form a more nuanced perception of our reality. By integrating multiple data types in a single model, multimodal AI systems achieve a more comprehensive understanding of their environment.
“It’s really an attempt to replicate how humans perceive,” said Aaron Myers, chief technology officer at AI-powered recruiting platform Suited. “We have five different senses, all of it giving us different data that we can use to make decisions or take actions. Multimodal models are attempting to do the same thing.”
Multimodal vs. Unimodal
Unimodal AI models can only take in one type of data (such as text) and deliver as an output that same kind of data (again, just text).
Multimodal AI models, by contrast, can handle multiple types of data (such as text, images, video and audio). You can give this model many modalities to process, not just one.
How Does Multimodal AI Work?
The process of building a multimodal AI model involves three main components — the input module, fusion module and output module — laid out in the following steps.
1. Training Transformers and Neural Networks on Large Data Sets
Multimodal models are often built on transformer architectures, a type of neural network that calculates the relationship between data points to understand and generate sequences of data. They process “tons and tons” of text data, remove some of the words, and then predict what the missing words are based on the context of the surrounding words, Gross said. They do the same thing with images, audio and whatever other kinds of data the model is designed to understand.
This layer of neural networks is known as the input module. Each neural network is actually unimodal and dedicated to a specific type of data. However, the input module contains many of these neural networks, processing written text, videos, images, audio and other inputs. This stage is what powers cross-modal interactions and gives an AI model the capacity to understand and work with various types of data.
2. Converting Raw Data Into Numerical Values
Processing data is accomplished through embedding, where raw data is encoded into numerical formats (vectors) that the system can more easily understand and work with. For example, text data is broken down into individual tokens (words, letters, etc.), which are turned into numbers. Audio data is segmented and broken down into features like pitch and amplitude, which are also turned into numbers. All of these numbers are then fed into the transformer, which captures the relationships and context both within and across the different modalities.
3. Embedding Data Through Early Fusion or Late Fusion
The module responsible for processing the data is the fusion module. This module combines all the different data types and processes them as a single data set. To achieve this, a fusion module can use either early fusion or late fusion.
In rare cases where the model is “natively multimodal” — built specifically to handle multiple data types — embedding happens all at once through a process called early fusion, which combines, aligns and processes the raw data from each modality so that they all have the same (or similar) mathematical representation. The model not only learns the word “duck,” for example, but also what a duck looks like and sounds like. In theory, this enables the model to not just be good at recognizing a photo of a duck, the quack of a duck or the letters “D-U-C-K,” but the broader “concept” of what a duck is as well, Murphy said.
This approach isn’t easy, so many multimodal systems that exist today merge information from multiple modalities at a later stage through a process called late fusion — after each type of data has been analyzed and encoded separately. Late fusion offers a way to combine and compare different types of data, which vary in appearance, size and meaning in their respective forms, Myers said.
4. Fine-Tuning Models to Improve Their Results
The final module of the AI model is the output module. This module delivers the results, which include decisions, predictions and other outputs. These results are then fine-tuned using techniques like reinforcement learning with human feedback (RLHF) and red teaming in an effort to reduce hallucinations, bias, security risks and other harmful responses. Once that is done, the model should behave similarly to an LLM, but with the capacity to handle other types of data beyond just text.
How Is Multimodal AI Used?
These are some areas where multimodal AI is being applied today.
Chatbots
AI chatbots equipped with multimodality can respond to users more effectively than their text-only counterparts, offering richer and more helpful answers. For example, a user can put in a picture of their dying houseplant and get advice on how to bring it back to life, or get a detailed explanation of a video they linked to.
AI Assistants
AI assistants like Amazon’s Alexa and Google Assistant exist because of multimodal AI. These smart devices can be controlled with simple voice commands, allowing users to pull up specific images and videos, receive current events, instructions and general information (in both audio and text formats) and even adjust the lighting and temperature in their homes.
Healthcare
The medical field requires the interpretation of several forms of data, including medical images, clinical notes, electronic health records and lab tests. Unimodal AI models perform specific healthcare tasks within specific modalities, such as analyzing X-rays or identifying genetic variations. And LLMs are often used to help answer health-related questions in simple terms. Now, researchers are starting to bring multimodal AI into the fold, developing new tools that combine data from all these disparate sources to help make medical diagnoses.
Self-Driving Cars
Self-driving cars process and interpret data from multiple sources, thanks to multimodal AI. Cameras provide visual information about the vehicle’s environment, radar detects objects and their speed while LiDAR measures the distances between them, and GPS provides location and navigation data. By putting all of this data together and analyzing it, AI models can understand the car’s surroundings in real time and react accordingly — they can spot obstacles, predict where other vehicles or pedestrians are going to be and decide when to steer, brake or accelerate.
Robotics
Robots equipped with multimodal AI integrate data from cameras, microphones and depth sensors, enabling them to perceive their environment more accurately and respond in kind. For example, they can use cameras to see and recognize objects, or microphones to understand spoken commands. They can even be fixed with sensors that give them a semblance of touch, smell and taste, giving them the full five senses that humans have, said Brendan Englot, an associate professor in the mechanical engineering department of the Stevens Institute of Technology. Whether it’s a humanoid robot or a cobot on an assembly line, multimodal AI allows robots of all kinds to navigate effectively in diverse environments.
Benefits of Multimodal AI
Better Context Understanding
As they learn, multimodal models integrate and analyze a broad range of data types simultaneously, which gives them a more well-rounded contextual understanding of a given subject than each individual data type might be able to convey on its own.
For example if a multimodal model is prompted to generate a video of a lion, it wouldn’t just see the word “lion” as a sequence of letters — it would know what a lion looks like, how a lion moves and what a lion’s roar sounds like.
More Accurate Results
Because multimodal models are designed to recognize patterns and connections between different types of data, they tend to understand and interpret information more accurately.
“I can be more accurate in my predictions by not only analyzing text, but also analyzing images to sort of fortify results. Or maybe answer questions I couldn’t answer before that are better answered by images rather than text,” Myers explained.
Even so, multimodal AI is still capable of getting things wrong, and may produce biased or otherwise harmful results.
Capable of a Wider Range of Tasks
Multimodal AI systems can handle a wider range of tasks than unimodal ones. Depending on the specific model, they can convert text prompts into AI-generated images, explain what’s going on in a video in plain language, generate an audio clip based on a photo and much more. Meanwhile, unimodal systems are only capable of working with one of these data types.
Better Understanding of User Intent
Multimodality allows users to choose how they want to interact with an AI system, instead of being stuck in one mode of communication.
“It doesn’t matter if you’re expressing [yourself] in motions, in words, if you’re typing something, writing something, making gestures, pointing at things,” said Juan Jose Lopez Murphy, head of data science and AI at IT services company Globant. Multimodal AI systems give users “much more control of what they want to express, which means that you’re capturing their true intent.”
More Intuitive User Experience
Because multimodal systems allow users to express themselves in several different ways, depending on what feels natural to them, their user experience “feels much more intuitive,” Myers said. For example, instead of having to describe what their car engine sounds to get advice on what’s wrong with it, a user can just upload an audio clip. Or rather than listing out all the food in their kitchen for recipe suggestions, they can upload photos of their fridge and pantry.
Challenges of Multimodal AI
Requires More Data
Since they are working with multiple different modalities, multimodal models require a lot of data to function properly. For example, if a model aims to convert text to images and vice versa, then it needs to have a robust set of both text and image data.
The amount of data required also scales with the amount of parameters (variables) in the model, Myers said. “As the number of parameters increases — which it does as you add modalities — the more data you need.”
Limited Data Availability
Not all data types are easily available, especially less conventional data types, such as temperature or hand movements. The internet — an important source of training data for many AI models — is largely made up of text, image and video data. So if you want to make a system that can process any other kind of data, you’ll have to either purchase it from private repositories or make it yourself.
Data Can Be Difficult to Align
Properly aligning multiple different data types is often difficult. Data comes in varying sizes, scales and structures, requiring careful processing and integration to ensure they work together effectively in a single AI system.
Computationally Intensive and Expensive
Multimodality is, in large part, only possible thanks to the unprecedented computing resources available today. These models need to be able to process petabytes of diverse data types simultaneously, demanding substantial computational power that often leads to significant carbon and water usage. Plus, deploying multimodal AI in applications requires a robust hardware infrastructure, further adding to its computational demands and environmental footprint.
It’s expensive too. Unimodal models are expensive on their own — GPT-3 is rumored to have cost OpenAI more than $4 million, and Meta is estimated to have spent $20 million on Llama 2. Multimodal models are “several orders of magnitude” more expensive than those, said Ryan Gross, head of data and applications at cloud services company Caylent.
May Worsen Existing Generative AI Issues
Many of the issues with regular generative AI models — namely bias, privacy concerns, hallucinations — are also prevalent in multimodal models. Multimodal AI may actually exacerbate these issues.
Bias is almost inevitable in data sets, so combining data from various sources could lead to more pronounced and widespread biased outcomes. And processing diverse data types can involve sensitive information, raising the stakes for data privacy and security. Plus, the complexity of integrating multiple kinds of data may increase the risk of generating inaccurate or misleading information.
“When you expand to multimodal models, you now expand the number of tasks that can be done,” Myers said. “And there’s going to be new problems that could be specific to those cases.”
These issues pose even greater risks in robotics applications, as their actions have direct consequences in the physical world.
“Your robot — whether that’s a drone or a car or humanoid — will take some kind of action in the physical world that will have physical consequences,” Englot said. “If you don’t have any guardrails on a model that’s controlling a robot, it’s possible hallucinations or incorrect interpretations of the data could lead to the robot taking actions that could be dangerous or harmful.
The Future of Multimodal AI
Multimodal AI has already impacted the AI landscape and will continue to expand the boundaries of artificial intelligence in several ways.
Multimodal AI and Generative AI
Many generative AI tools can only process one type of data and deliver outputs in that same modality. With the introduction of multimodal AI, generative AI tools can now process various data types and deliver a range of outputs that don’t have to match the input.
Like multimodal AI models, multimodal generative AI models use a system of neural networks and transformer architectures to process and understand text, audio, videos, images and other inputs. By training on diverse data types, these generative AI systems develop the ability to process and produce different kinds of data.
While major tech players like Meta, Google and OpenAI are still experimenting with this technology, it’s only a matter of time before it enters the mainstream as it undergoes improvements.
Multimodal AI and Unified Models
Unified models have emerged as a promising option for making multimodal AI more seamless. Instead of relying on many neural networks to process different data types, unified models consist of a single neural network architecture. This architecture processes data as abstractions, allowing it to adapt to different kinds of data and handle multimodal tasks. Although unified models require extensive training on massive volumes of data, they don’t need as much fine-tuning as other multimodal AI models.
Google’s Gemini is an example of a unified model running on a single architecture, and this list is likely to grow as more companies build their own unified models.
Multimodal AI and Artificial General Intelligence
Eventually, many experts believe, multimodality could be the key to achieving artificial general intelligence (AGI) — a theoretical form of AI that understands, learns and performs any intellectual task as well as a human can. By combining various kinds of data, multimodal models could develop a more holistic and comprehensive understanding of the world around them, which could, in turn, enable them to apply knowledge across a wide range of tasks as well as (or even better than) a human being.
“In the quest for an artificial intelligence that looks a little bit more like human intelligence, it has to be multimodal,” Englot said. “It has to process as many input modalities as a human could — vision, language, touch, physical action — and be able to respond to all those things with the same intelligence that a human can.”
Frequently Asked Questions
Is ChatGPT multimodal AI?
GPT-4o and GPT-4, two models that power ChatGPT, are multimodal — so yes, ChatGPT is capable of being multimodal.
What is unimodal vs. multimodal AI?
Unimodal AI can only process and generate a single type of data, such as just text or just images. Meanwhile, multimodal AI can work with multiple types of data at the same time.
What is the difference between generative AI and multimodal AI?
Generative AI refers to AI systems that produce text, images, video, audio and other outputs. Multimodal AI refers to any AI system that can process and produce different types of data. Generative AI systems may use multimodal training data to develop the ability to input one type of data and output another type of data, but generative AI isn’t always multimodal.
What is an example of multimodal AI?
One common example of multimodal AI is image generators, which produce pictures based on text prompts. In some cases, these systems work in reverse too, generating text-based content from visual inputs like photos or charts.