ChatGPT’s New Upgrade Teases AI’s Multimodal Future

OpenAI’s chatbot learns to carry a conversation—and expect competition

3 min read
photo montage of hazy clouds at dusk superimposed with a smartphone-sized screen containing a five-word prompt: These clouds are caused by

ChatGPT isn’t just a chatbot anymore.

OpenAI’s latest upgrade grants ChatGPT powerful new abilities that go beyond text. It can tell bedtime stories in its own AI voice, identify objects in photos, and respond to audio recordings. These capabilities represent the next big thing in AI: multimodal models.

“Multimodal is the next generation of these large models, where it can process not just text, but also images, audio, video, and even other modalities,” says Linxi “Jim” Fan, senior AI research scientist at Nvidia.

ChatGPT gets an eyes-and-ears power-up

ChatGPT’s upgrade is a noteworthy example of a multimodal AI system. Instead of using a single AI model designed to work with a single form of input, like a large language model (LLM) or speech-to-voice model, multiple models work together to create a more cohesive AI tool.

“The future of generative AI is hyper-personalization. This will happen for knowledge workers, creatives, and end users.”
—Kyle Shannon, Storyvine

OpenAI provides three specific multimodal features. Users can prompt the chatbot with images or voice, as well as receive responses in one of five AI-generated voices. Image input is available on all platforms, while voice is limited to the ChatGPT app for Android and iOS.

A demo from OpenAI shows ChatGPT being used to adjust a bike seat. A befuddled cyclist first snaps a photo of their bike and asks for help lowering the seat, then follows up with photos of the bike’s user manual and a tool set. ChatGPT responds with text describing the best tool for the job and how to use it.

These multimodal features aren’t entirely new. GPT-4 launched with an understanding of image prompts in March 2023, which was put into practice by some OpenAI partners—including Microsoft’s Bing Chat. But tapping these features required API access, so it was generally reserved for partners and developers.

GPT4’s multimodal features appeared in Bing Chat in the summer of 2023. Microsoft

They’re now available to everyone willing to pay US $20 a month for a ChatGPT Plus subscription. And their synthesis with ChatGPT’s friendly interface is another perk. Image input is as simple as opening the app and tapping an icon to snap a photo.

Simplicity is multimodal AI’s killer feature. Current AI models for images, videos, and voice are impressive, but finding the right model for each task can be time-consuming, and moving data between models is a chore. Multimodal AI eliminates these problems. A user can prompt the AI agent with various media, then seamlessly switch between images, text, and voice prompts within the same conversation.

“This points to the future of these tools, where they can provide us almost anything we want in the moment,” says Kyle Shannon, founder and CEO of the AI video platform Storyvine. “The future of generative AI is hyper-personalization. This will happen for knowledge workers, creatives, and end users.”

Is multimodal the future?

ChatGPT’s support for image and voice is just a taste of what’s to come.

“While there aren’t any good models for it right now, in principle you can give it 3D data, or even something like digital smell data, and it can output images, videos, and even actions,” says Fan. “I do research at Nvidia on game AI and robotics, and multimodal models are critical for these efforts.”

Image and voice input is the natural start for ChatGPT’s multimodal capabilities. It’s a user-facing app, and these are two of the most common forms of data a user might want to use. But there’s no reason an AI model can’t train to address other forms of data, whether it’s an Excel spreadsheet, a 3D model, or a photograph with depth data.

That’s not to say it’s easy. Organizations looking to build multimodal AI face many challenges. The biggest, perhaps, is wrangling the vast sums of data required to train a roster of AI models.

“I think multimodal models will have roughly the same landscape as the current large language models,” says Fan. “It’s very capital-intense. And it’s probably even worse for multimodal, because consider how much data is in the images, and in the videos.”

That would seem to give the edge to ChatGPT and other well-heeled AI startups, such as Anthropic, creator of, which recently entered an agreement worth “up to 4 billion” with Amazon.

But it’s too soon to count out smaller organizations. Fan says research into multimodal AI is less mature than research into LLMs, leaving room for researchers to find new techniques. Shannon agrees and expects innovation from all sides, citing the rapid iteration and improvement of open-source large language models like Meta’s LLama 2.

“I think there will always be a pendulum between general [AI] tools and specialty tools,” says Shannon. “What changes is that now we have the possibility of truly general tools. The specialization can be a choice rather than a requirement.”

The Conversation (1)
Ezio Savva
Ezio Savva03 Oct, 2023

The introduction of multimodal capabilities like voice in ChatGPT could potentially marginalise users who are Deaf or hard-of-hearing. While the technology aims to create a more inclusive and interactive experience for a broader audience, it runs the risk of inadvertently excluding those who cannot utilise all its features. Especially if future updates focus more on voice and audio-based functionalities, the technology could become less accessible for people who rely solely on text and visual cues. AI developers should consider Deaf and hard-of-hearing needs when adding new multimodal features for full accessibility.