ChatGPT’s New Upgrade Teases AI’s Multimodal Future

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

ChatGPT isn’t just a chatbot anymore.

OpenAI’s latest upgrade grants ChatGPT powerful new abilities that go beyond text. It can tell bedtime stories in its own AI voice, identify objects in photos, and respond to audio recordings. These capabilities represent the next big thing in AI: multimodal models.

“Multimodal is the next generation of these large models, where it can process not just text, but also images, audio, video, and even other modalities,” says Linxi “Jim” Fan, senior AI research scientist at Nvidia.

ChatGPT gets an eyes-and-ears power-up

ChatGPT’s upgrade is a noteworthy example of a multimodal AI system. Instead of using a single AI model designed to work with a single form of input, like a large language model (LLM) or speech-to-voice model, multiple models work together to create a more cohesive AI tool.

“The future of generative AI is hyper-personalization. This will happen for knowledge workers, creatives, and end users.”
—Kyle Shannon, Storyvine

OpenAI provides three specific multimodal features. Users can prompt the chatbot with images or voice, as well as receive responses in one of five AI-generated voices. Image input is available on all platforms, while voice is limited to the ChatGPT app for Android and iOS.

A demo from OpenAI shows ChatGPT being used to adjust a bike seat. A befuddled cyclist first snaps a photo of their bike and asks for help lowering the seat, then follows up with photos of the bike’s user manual and a tool set. ChatGPT responds with text describing the best tool for the job and how to use it.

These multimodal features aren’t entirely new. GPT-4 launched with an understanding of image prompts in March 2023, which was put into practice by some OpenAI partners—including Microsoft’s Bing Chat. But tapping these features required API access, so it was generally reserved for partners and developers.

GPT4’s multimodal features appeared in Bing Chat in the summer of 2023. Microsoft

They’re now available to everyone willing to pay US $20 a month for a ChatGPT Plus subscription. And their synthesis with ChatGPT’s friendly interface is another perk. Image input is as simple as opening the app and tapping an icon to snap a photo.

Simplicity is multimodal AI’s killer feature. Current AI models for images, videos, and voice are impressive, but finding the right model for each task can be time-consuming, and moving data between models is a chore. Multimodal AI eliminates these problems. A user can prompt the AI agent with various media, then seamlessly switch between images, text, and voice prompts within the same conversation.

“This points to the future of these tools, where they can provide us almost anything we want in the moment,” says Kyle Shannon, founder and CEO of the AI video platform Storyvine. “The future of generative AI is hyper-personalization. This will happen for knowledge workers, creatives, and end users.”

Is multimodal the future?

ChatGPT’s support for image and voice is just a taste of what’s to come.

“While there aren’t any good models for it right now, in principle you can give it 3D data, or even something like digital smell data, and it can output images, videos, and even actions,” says Fan. “I do research at Nvidia on game AI and robotics, and multimodal models are critical for these efforts.”

Image and voice input is the natural start for ChatGPT’s multimodal capabilities. It’s a user-facing app, and these are two of the most common forms of data a user might want to use. But there’s no reason an AI model can’t train to address other forms of data, whether it’s an Excel spreadsheet, a 3D model, or a photograph with depth data.

That’s not to say it’s easy. Organizations looking to build multimodal AI face many challenges. The biggest, perhaps, is wrangling the vast sums of data required to train a roster of AI models.

“I think multimodal models will have roughly the same landscape as the current large language models,” says Fan. “It’s very capital-intense. And it’s probably even worse for multimodal, because consider how much data is in the images, and in the videos.”

That would seem to give the edge to ChatGPT and other well-heeled AI startups, such as Anthropic, creator of Claude.ai, which recently entered an agreement worth “up to 4 billion” with Amazon.

But it’s too soon to count out smaller organizations. Fan says research into multimodal AI is less mature than research into LLMs, leaving room for researchers to find new techniques. Shannon agrees and expects innovation from all sides, citing the rapid iteration and improvement of open-source large language models like Meta’s LLama 2.

“I think there will always be a pendulum between general [AI] tools and specialty tools,” says Shannon. “What changes is that now we have the possibility of truly general tools. The specialization can be a choice rather than a requirement.”

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

ChatGPT’s New Upgrade Teases AI’s Multimodal Future

OpenAI’s chatbot learns to carry a conversation—and expect competition

ChatGPT gets an eyes-and-ears power-up

Is multimodal the future?

Video Friday: RACER Heavy

As Ukraine Builds New Reactors, Renewables Beckon

Travels with Perplexity AI

Related Stories

What to Do When the Ghost in the Machine Is You

ChatGPT May Be a Better Improviser Than You

ChatGPT Makes OK Clinical Decisions—Usually

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

ChatGPT’s New Upgrade Teases AI’s Multimodal Future

OpenAI’s chatbot learns to carry a conversation—and expect competition

ChatGPT gets an eyes-and-ears power-up

Is multimodal the future?

Video Friday: RACER Heavy

As Ukraine Builds New Reactors, Renewables Beckon

Travels with Perplexity AI

Related Stories

What to Do When the Ghost in the Machine Is You

ChatGPT May Be a Better Improviser Than You

ChatGPT Makes OK Clinical Decisions—Usually