What Is GPT-4o?

Here’s what you need to know about OpenAI’s latest multimodal model.

Written by Ellen Glover
Published on Aug. 29, 2024
Photo of a smartphone with the GPT-4o homepage pulled up, and the OpenAI logo in the background.
Image: Shutterstock

GPT-4o (short for GPT-4 omni) is an artificial intelligence model made by OpenAI, the company behind ChatGPT. It serves as the default model for the chatbot and can be integrated into other generative AI tools through the company’s API.

What Is GPT-4o?

GPT-4o is a large language model developed by OpenAI. It is multimodal, meaning that can reason across text, visuals and audio in real time.

Similar to OpenAI’s other large language models, GPT-4o can be used to generate written content and carry on text-based conversations with users. It is also multimodal, meaning it can understand and produce images, video and audio, in addition to text. All of this is accomplished in a single system, which helps to enable more natural human-computer interactions, an OpenAI spokesperson told Built In. 

“We’re looking at the future of interaction between ourselves and the machines,” Mira Murati, OpenAI’s chief technology officer, said during a live-streamed demo of GPT-4o. “We think that GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural and far, far easier.”

 

What Is GPT-4o?

Launched in May 2024, GPT-4o is a multilingual, multimodal AI model developed by the company OpenAI. It is the most capable of all the company’s models in terms of functionality and performance, offering language processing capabilities similar to those of its predecessor, GPT-4, but at faster speeds and lower costs. GPT-4o also excels at complex reasoning, language translation, math and coding. 

GPT-4o is designed to process and integrate text, visuals and audio all on one neural network, providing the model with a more comprehensive understanding of subjects across all those modalities. For example, if a user gives GPT-4o a photo of a birthday cake and asks for a recipe to make that cake, the model can analyze the image — identify that it’s a birthday cake, note its dimensions and other details — and generate an accurate recipe.

This native multimodality “dramatically” increases GPT-4o’s speed and efficiency, said Brady Lund, an assistant professor of information science at the University of North Texas who has studied GPT-4o’s capabilities. It also enables ChatGPT to function more like a human, processing information from multiple sources at once to better assist its users.

“Apart from the brain, [humans] also have an eye and we also have an ear. We are able to listen, we are able to see things,” generative AI expert Ritesh Vajariya told Built In. “[GPT-4o] is able to combine all of those capabilities into a single shot.”

Take it From the ExpertsGPT-4o: Here’s What You Need to Know

 

What Can GPT-4o Do?

According to OpenAI, some of the model’s most prominent abilities include: 

  • Text summarization and generation: GPT-4o can perform common LLM tasks such as text summarization, content generation and text-based chats with users. Plus, with a context window of up to128,000 tokens and an output limit of 4,096 tokens, the model can handle bigger document inputs, and maintain longer conversations with users than GPT-4.
  • Multimodal reasoning and generation: GPT-4o integrates text, audio and visuals into a single model, meaning it can process and generate a combination of these data types with more speed than if it were done on multiple different models. 
  • Image generation: GPT-4o can  generate images from text prompts, similar to other AI art generators like Stable Diffusion and Midjourney.
  • Visual processing and analysis: GPT-4o can analyze image and video inputs and then explain their content in text form.
  • Voice generation: GPT-4o can generate spoken language, offering a range of distinct voices created in collaboration with human actors.
  • Audio conversations: GPT-4o can engage in real-time verbal conversations by receiving voice inputs from users and replying with AI-generated audio. The model’s average response time is 320 milliseconds, similar to typical human response times.
  • Language translation: GPT-4o supports real-time translation in more than 50 languages. It has better text-processing capabilities for non-English languages compared to GPT-4, particularly for languages that don’t use a Western alphabet, like Korean, Arabic and Russian.

Not all of the above capabilities are widely available. Some are exclusive to ChatGPT Plus subscribers, or select API users, and others aren’t publicly available at all. 

 

GPT-4o Limitations

Although GPT-4o surpassed several benchmarks in ability speed and cost-efficiency, it remains a work in progress. The model’s multimodal capabilities introduce all sorts of new ways for ChatGPT to hallucinate and otherwise get things wrong. And its training data only extends to October 2023, so it may generate false or outdated information.

In their research, Lund and his colleagues also found that GPT-4o has a tendency to misunderstand “complex and ambiguous inputs,” particularly when those inputs are audio or visual. But he believes this is mostly because these capabilities are so new. “I think it will be refined over time,” he told Built In.

Related ReadingWhat Are AI Agents?

 

GPT-4o vs. GPT-4

GPT-4o is now the default AI model for ChatGPT, replacing GPT-4. While they share some similarities, the two models differ significantly in terms of ability, performance and efficiency.

GPT-4o Handles Multimodality Differently

GPT-4 was primarily designed for text processing, meaning it doesn’t have the built-in support for handling audio or visual inputs. Rather, it siloes those modalities into separate models. In the ChatGPT interface, for example, GPT-4 has to call on other OpenAI models to process any non-text data, such as DALL-E for images and Whisper for audio. This may lead to longer response times and higher computing costs.

In contrast, GPT-4o was designed for multimodality from the beginning. Trained on a large corpus of text, image, video and audio data, the model can merge all of these capabilities on a single neural network — that means faster response times and smoother transitions between tasks, according to OpenAI.

GPT-4o Is Faster and Cheaper 

GPT-4o is designed to be faster and more cost-efficient than GPT-4 across the board, not just for multimodal tasks. Overall, GPT-4o is twice as fast and costs half as much to run as GPT-4 Turbo, the most recent version of GPT-4, according to OpenAI.

GPT-4o Knows More Languages 

OpenAI says GPT-4o performs “significantly” better at non-English languages than GPT-4 thanks to a new tokenizer, which converts text into smaller chunks that the model can understand mathematically. This is especially useful while translating languages that are not based on the Latin alphabet, such as Hindi, Japanese and Turkish. The change tackles a longstanding issue in machine translation, where models have historically been optimized primarily for Western languages at the expense of those in other regions.

GPT-4o Is Better at Reasoning Tasks 

Lund found that GPT-4o is better than GPT-4 and other previous GPT models at performing inductive reasoning tasks — reasoning “how to get from point A to point B,” he said. For example, if the user asks GPT-4o how to build a shed, the model can figure out the steps needed to do it.

GPT-4o is also marginally better at deductive reasoning and inference, meaning it can both “derive valid conclusions” from information and “generate credible hypotheses with limited knowledge,” Lund and his colleagues wrote.

OpenAI Emphasized Safety With GPT-4o

Safety was built into GPT-4o from the beginning and has been reinforced at every step of the development process, according to the OpenAI spokesperson. This was done through techniques like filtering training data and refining the model’s behavior through post-training. The model also underwent “extensive” external red teaming to help identify risks that are either introduced or amplified by the newly added modalities.

In addition to evaluating the safety of the model’s text and vision capabilities, the company spokesperson said OpenAI focused additional efforts on its audio capabilities, noting several novel risks like unauthorized voice generation and the potential for generating copyrighted content. Based on those evaluations, OpenAI said it implemented new safety guardrails to mitigate the risks of voice outputs specifically.

In the end, OpenAI scored GPT-4o as a “medium risk” — the highest risk-level a model can display for the company to deploy it, according to its Preparedness Framework.

“In the previous models, they were not open about how they were giving out the scorecards,” Vajariya said. “With GPT-4o, they are more vocal about the scorecard system, as well as their preparedness framework, which revolves around how they perceive the risk.” 

GPT-4o Has a ‘Mini’ Version

Shortly after announcing GPT-4o, OpenAI released a more compact version of the model, called GPT-4o mini. It’s faster and cheaper than GPT-4o, according to the company, and performs better on industry benchmarks than several other similarly sized models, including Gemini Flash and Claude Haiku.

“GPT-4o is a bazooka, you don’t need a bazooka every day,” Vajariya said, adding that GPT-4o mini is more like a revolver, designed for everyday use. “It is more energy efficient, it is cheaper to operate on their end as well. It may not have all the bells and whistles, or be as accurate as GPT-4o, but you don’t need that kind of accuracy all the time.”

GPT-4o mini is now available as a text and vision model to developers using OpenAI’s API. It also powers ChatGPT for Free, Plus and Team users.

More on GPT-4o MiniGPT-4o Mini is Cheaper. Is It As Good?

 

How to Access GPT-4o

Users can access GPT-4o in several ways:

  • ChatGPT: GPT-4o is the default model powering ChatGPT. Free users do not have access to some of the model’s more advanced features, including vision, file uploads and data analysis, and are limited to a certain number of inputs — at which time the chatbot reverts to GPT-4o mini. Users who pay $20/month for ChatGPT Plus get full access to GPT-4o, with no feature restrictions or input limits.
  • API: Developers can access GPT-4o through OpenAI’s API and Microsoft’s Azure AI platform, meaning they can fine-tune and integrate all of the model’s publicly available capabilities into their own applications.
  • Desktop: OpenAI has integrated GPT-4o into a new ChatGPT desktop application, which is available via Apple’s macOS. 

Frequently Asked Questions

Yes, GPT-4o is available to anyone with an OpenAI API account, as well as through Microsoft’s Azure AI platform. It is also available to Free and Plus users in ChatGPT, but Free users don’t have access to the model’s more advanced features, and are limited to a certain number of inputs.

GPT-4o can be accessed via OpenAI’s API and Microsoft’s Azure AI platform. It can also be accessed through ChatGPT, but users with a free account are limited on the number of exchanges they can have with the GPT-4o before the chatbot reverts back to using GPT-4.

Unlike GPT-4, GPT-4o is natively multimodal, meaning it can process and generate different types of data — text, images, video and audio — within a single system. GPT-4o is also faster and cheaper than GPT-4. And while GPT-4o achieves similar performance to GPT-4 in text, reasoning and coding tasks, it excels in its multilingual audio and vision capabilities.

Explore Job Matches.