Want to Keep AI From Sharing Secrets? Train It Yourself

On 11 March 2023, Samsung’s Device Solutions division permitted employee use of ChatGPT. Problems ensued. A report in The Economist Korea, published less than three weeks later, identified three cases of “data leakage.” Two engineers used ChatGPT to troubleshoot confidential code, and an executive used it for a transcript of a meeting. Samsung changed course, banning employee use, not of just ChatGPT but of all external generative AI.

Samsung’s situation illustrates a problem facing anyone who uses third-party generative AI tools based on a large language model (LLM). The most powerful AI tools can ingest large chunks of text and quickly produce useful results, but this feature can easily lead to data leaks.

“That might be fine for personal use, but what about corporate use? […] You can’t just send all of your data to OpenAI, to their servers,” says Taleb Alashkar, chief technology officer of the computer vision company AlgoFace and MIT Research Affiliate.

Naïve AI users hand over private data

Generative AI’s data privacy issues boil down to two key concerns.

AI is bound by the same privacy regulations as other technology. Italy’s temporary ban of ChatGPT occurred after a security incident in March 2023 that let users see the chat histories of other users. This problem could affect any technology that stores user data. Italy lifted its ban after OpenAI added features to give users more control over how their data is stored and used.

But AI faces other unique challenges. Generative AI models aren’t designed to reproduce training data and are generally incapable of doing so in any specific instance, but it’s not impossible. A paper titled “Extracting Training Data from Diffusion Models,” published in January 2023, describes how Stable Diffusion can generate images similar to images in the training data. The Doe vs. GitHub lawsuit includes examples of code generated by Github Copilot, a tool powered by an LLM from OpenAI, that match code found in training data.

A photograph of a woman named Ann Graham Lotz next to an AI-generated image of Ann Graham Lotz created with Stable Diffusion. The comparison shows that the AI generator image is significantly similar to the original image, which was included in the AI model's training data. Researchers discovered that Stable Diffusion can sometimes produce images similar to its training data. Extracting Training Data from Diffusion Models

This leads to fears that generative AI controlled by a third party could unintentionally leak sensitive data, either in part or in whole. Some generative AI tools, including ChatGPT, worsen this fear by including user data in their training set. Organizations concerned about data privacy are left with little choice but to bar its use.

“Think about an insurance company, or big banks, or [Department of Defense], or Mayo Clinic,” says Alashkar, adding that “every CIO, CTO, security principal, or manager in a company is busy looking over those policies and best practices. I think most responsible companies are very busy now trying to find the right thing.”

Efficiency holds the answer to private AI

AI’s data privacy woes have an obvious solution. An organization could train using its own data (or data it has sourced through means that meet data-privacy regulations) and deploy the model on hardware it owns and controls. But the obvious solution comes with an obvious problem: It’s inefficient. The process of training and deploying a generative AI model is expensive and difficult to manage for all but the most experienced and well-funded organizations.

“When you start training on 500 GPUs, things go wrong. You really have to know what you’re doing, and that’s what we’ve done, and we’ve packaged it together in an interface,” says Naveen Rao, cofounder and CEO of MosaicML. Rao’s company offers a third option: a hosted AI model that runs within MosaicML’s secure environment. The model can be controlled through a Web client, a command line interface, or Python.

“When you start training on 500 GPUs, things go wrong. You really have to know what you’re doing.” —Naveen Rao, cofounder and CEO of MosaicML

“Here’s the platform, here’s the model, and you keep your data. Train your model and keep your model weights. The data stays in your network,” explains Julie Choi, MosaicML’s chief marketing and community officer. Choi says the company works with clients in the financial industry and others that are “really invested in their own IP.”

The hosted approach is a growing trend. Intel is collaborating on a private AI model for Boston Consulting Group, IBM plans to step into the arena with Watsonx AI, and existing services like Amazon’s Sagemaker and Microsoft’s Azure ML are evolving in response to demand.

A graph that shows training of an AI model hosted on MosaicML. The graph notes several points at which hardware failures occured. Training resumed automatically after each hardware failure. MosaicML can train a host LLM in under 10 days and will automatically compensate for hardware failures that occur in training.MosaicML

Training a hosted AI model remains expensive, difficult, and time consuming, but drastically less so than going it alone. On 5 May 2023, MosaicML announced it had trained an LLM model called MPT-7B for less than US $200,000 in nine and a half days and without human intervention. OpenAI doesn’t reveal the cost of training its models, but estimates peg the cost of training GPT-3 at a minimum of $4.6 million dollars.

Deploying a hosted AI model also gives organizations control over issues that border on privacy, such as trust and safety. Choi says that a nutrition chat app turned to MosaicML after discovering its AI suggestions produced “fat shaming” responses. The app, which at the time used a competing LLM, couldn’t prevent undesirable responses because it didn’t control the training data or the weights used to fine-tune its output.

“We really believe that security and data privacy are paramount when you’re building AI systems. Because at the end of the day, AI is an accelerant, and it’s going to be trained on your data to help you make your decisions,” says Choi.

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Want to Keep AI From Sharing Secrets? Train It Yourself

MosaicML delivers a secure platform for hosted AI

Naïve AI users hand over private data

Efficiency holds the answer to private AI

This IEEE Society’s Secret to Boosting Student Membership

Why Haven’t Hoverbikes Taken Off?

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Related Stories

Privacy Is Just No Longer a Thing in Augmented Reality?

Expansive Health Data Privacy Law for Washington State

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Want to Keep AI From Sharing Secrets? Train It Yourself

MosaicML delivers a secure platform for hosted AI

Naïve AI users hand over private data

Efficiency holds the answer to private AI

This IEEE Society’s Secret to Boosting Student Membership

Why Haven’t Hoverbikes Taken Off?

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Related Stories

Privacy Is Just No Longer a Thing in Augmented Reality?

Expansive Health Data Privacy Law for Washington State

Transparency Depends on Digital Breadcrumbs