Gemini Is Google’s Best AI Model Yet, But Who Cares?

Google’s reveal of Gemini, an AI model built to close the gap between the search giant and OpenAI, made a great first impression. Strong benchmarks, a flashy video demo, and immediate availability (albeit for a cut-back version) signaled confidence.

But the positivity soured as AI engineers and enthusiasts picked through the details and found flaws. Gemini is an impressive entry that may eventually erode GPT-4’s dominance, but Google’s slippery messaging has left it playing defense.

“There’s more question than there are answers,” says Emma Matthies, lead AI engineer at a major North American retailer who was speaking for herself not her employer. “I did find there to be a discontinuity between the way [Google’s Gemini video demo] was shown and details that are actually in Google’s tech blog.”

Google’s troubled demo

Google’s Gemini demo drew criticism as AI developers noticed inconsistencies.Google

The demo in question is titled “Hands-on with Gemini,” and launched on YouTube alongside Gemini’s reveal. It’s fast-paced, friendly, fun, and packed with easy-to-understand visual examples. It also exaggerates how Gemini works.

A Google representative says the demo “shows real prompts and outputs from Gemini.” But the video’s editing leaves out some details. The exchange with Gemini occurred over text, not voice, and the visual problems the AI solved were input as images, not a live video feed. Google’s blog also describes prompts not shown in the demo. When Gemini was asked to identify a game of rock, paper, scissors based on hand gestures, it was given the hint “it’s a game.” The demo omits that hint.

And that’s just the start of Google’s problems. AI developers quickly realized Gemini’s capabilities were less revolutionary than they initially appeared.

“If you look at the capabilities of GPT-4 Vision, and you build the right interface for it, it’s similar to Gemini,” says Matthies. “I’ve done things like this as side projects, and there’s experiments on social media like this as well, such as the ‘David Attenborough is narrating my life’ video, which was extremely funny.”

GPT-4 Vision can interpret images in ways similar to Google’s Gemini demo.Replicate

On 11 December, just five days after Gemini’s reveal, an AI developer named Greg Sadetsky produced a rough recreation of the Gemini demo with GPT-4 Vision. He followed up with a head-to-head comparison between Gemini and GPT-4 Vision, which didn’t go Google’s way.

Google is taking flack for its benchmark data, too. Gemini Ultra, the largest of three models in the family, claims to beat GPT-4 in a variety of benchmarks. This is broadly true, but the quoted figures are selected to paint Gemini in the best light.

Google used different methodologies from others for measuring performance. The way a user prompts an AI model can influence its performance, and results are only comparable when the same prompt strategy is used.

GPT-4’s performance on a benchmark called massive multitask language understanding (MMLU) was measured using what’s called few shot prompting. Asking a question without context is called a “zero-shot” prompt, while providing a few examples is a “few-shot” prompt.

Another method walks the AI model through the reasoning necessary to find an answer. Google’s Gemini was measured using such a chain-of-reasoning method, notes Richard Davies, Lead Artificial Intelligence Engineer at Guildhawk. “It’s not a fair comparison.”

Google’s paper on Gemini offers a range of comparisons, but its marketing compares different strategies to make its results look better. It also focuses entirely on Gemini Ultra, which isn’t yet available to the public. Gemini Pro, the only version currently available, delivers less impressive results.

Gemini impresses despite fumbled messaging

The problems with Gemini’s presentation cast a shadow over its announcement. Peer past the insincere marketing, though, and Gemini remains an impressive feat.

Gemini is multimodal, which means it can reason across text, images, audio, code, and other forms of media. This isn’t unique to Gemini, but most multimodal models are not publicly available, difficult to use, or focused on a precise task. That’s left OpenAI’s GPT-4 to dominate the space.

“At the very least, I am looking forward to there being a strong alternative and close competitor to GPT-4 and the new GPT-4 vision model. Because currently, there just isn’t anything in the same class,” says Matthies.

A nine by nine table Gemini Ultra’s impressive benchmark results show a small yet significant edge over GPT-4.Google

Davies, meanwhile, is intrigued by Gemini’s benchmark performance which, despite cherry-picking, shows a significant improvement in several like-for-like scenarios.

“It’s about a four percent improvement [in MMLU] from GPT-4’s 86.4 percent to Gemini’s 90 percent. But in terms of how much actual error is reduced, it’s reduced by more than 20 percent… that’s quite a lot,” says Davies. Even small reductions in error have a big impact when models receive millions of requests per day.

Gemini’s fate remains to be determined, and it hinges on two unknowns: Gemini Ultra’s release date and OpenAI’s GPT-5. While users can try Gemini Pro right now, its bigger sibling won’t release until sometime in 2024. The rapid pace of AI development makes it hard to say how Ultra will fair once it arrives, and gives OpenAI ample time to respond with a new model or a moderately improved version of GPT-4.

From Your Site Articles

ChatGPT’s New Upgrade Teases AI’s Multimodal Future ›

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Gemini Is Google’s Best AI Model Yet, But Who Cares?

Promising benchmarks and demos give way to criticism

Google’s troubled demo

Gemini impresses despite fumbled messaging

Phone Keyboard Exploits Leave 1 Billion Users Exposed

An Engineer Who Keeps Meta’s AI infrastructure Humming

Solar Fuel Production Just Needs a Change in Direction

Related Stories

Llama 3 Establishes Meta as the Leader in “Open” AI

Announcing a Benchmark to Improve AI Safety

Nvidia Tops Llama 2, Stable Diffusion Speed Trials

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Gemini Is Google’s Best AI Model Yet, But Who Cares?

Promising benchmarks and demos give way to criticism

Google’s troubled demo

Gemini impresses despite fumbled messaging

Phone Keyboard Exploits Leave 1 Billion Users Exposed

An Engineer Who Keeps Meta’s AI infrastructure Humming

Solar Fuel Production Just Needs a Change in Direction

Related Stories

Llama 3 Establishes Meta as the Leader in “Open” AI

Announcing a Benchmark to Improve AI Safety

Nvidia Tops Llama 2, Stable Diffusion Speed Trials