On 11 March 2023, Samsung’s Device Solutions division permitted employee use of ChatGPT. Problems ensued. A report in The Economist Korea, published less than three weeks later, identified three cases of “data leakage.” Two engineers used ChatGPT to troubleshoot confidential code, and an executive used it for a transcript of a meeting. Samsung changed course, banning employee use, not of just ChatGPT but of all external generative AI.
Samsung’s situation illustrates a problem facing anyone who uses third-party generative AI tools based on a large language model (LLM). The most powerful AI tools can ingest large chunks of text and quickly produce useful results, but this feature can easily lead to data leaks.
“That might be fine for personal use, but what about corporate use? […] You can’t just send all of your data to OpenAI, to their servers,” says Taleb Alashkar, chief technology officer of the computer vision company AlgoFace and MIT Research Affiliate.
Naïve AI users hand over private data
Generative AI’s data privacy issues boil down to two key concerns.
AI is bound by the same privacy regulations as other technology. Italy’s temporary ban of ChatGPT occurred after a security incident in March 2023 that let users see the chat histories of other users. This problem could affect any technology that stores user data. Italy lifted its ban after OpenAI added features to give users more control over how their data is stored and used.
But AI faces other unique challenges. Generative AI models aren’t designed to reproduce training data and are generally incapable of doing so in any specific instance, but it’s not impossible. A paper titled “Extracting Training Data from Diffusion Models,” published in January 2023, describes how Stable Diffusion can generate images similar to images in the training data. The Doe vs. GitHub lawsuit includes examples of code generated by Github Copilot, a tool powered by an LLM from OpenAI, that match code found in training data.
Researchers discovered that Stable Diffusion can sometimes produce images similar to its training data. Extracting Training Data from Diffusion Models
This leads to fears that generative AI controlled by a third party could unintentionally leak sensitive data, either in part or in whole. Some generative AI tools, including ChatGPT, worsen this fear by including user data in their training set. Organizations concerned about data privacy are left with little choice but to bar its use.
“Think about an insurance company, or big banks, or [Department of Defense], or Mayo Clinic,” says Alashkar, adding that “every CIO, CTO, security principal, or manager in a company is busy looking over those policies and best practices. I think most responsible companies are very busy now trying to find the right thing.”
Efficiency holds the answer to private AI
AI’s data privacy woes have an obvious solution. An organization could train using its own data (or data it has sourced through means that meet data-privacy regulations) and deploy the model on hardware it owns and controls. But the obvious solution comes with an obvious problem: It’s inefficient. The process of training and deploying a generative AI model is expensive and difficult to manage for all but the most experienced and well-funded organizations.
“When you start training on 500 GPUs, things go wrong. You really have to know what you’re doing, and that’s what we’ve done, and we’ve packaged it together in an interface,” says Naveen Rao, cofounder and CEO of MosaicML. Rao’s company offers a third option: a hosted AI model that runs within MosaicML’s secure environment. The model can be controlled through a Web client, a command line interface, or Python.
“When you start training on 500 GPUs, things go wrong. You really have to know what you’re doing.” —Naveen Rao, cofounder and CEO of MosaicML
“Here’s the platform, here’s the model, and you keep your data. Train your model and keep your model weights. The data stays in your network,” explains Julie Choi, MosaicML’s chief marketing and community officer. Choi says the company works with clients in the financial industry and others that are “really invested in their own IP.”
The hosted approach is a growing trend. Intel is collaborating on a private AI model for Boston Consulting Group, IBM plans to step into the arena with Watsonx AI, and existing services like Amazon’s Sagemaker and Microsoft’s Azure ML are evolving in response to demand.
MosaicML can train a host LLM in under 10 days and will automatically compensate for hardware failures that occur in training.MosaicML
Training a hosted AI model remains expensive, difficult, and time consuming, but drastically less so than going it alone. On 5 May 2023, MosaicML announced it had trained an LLM model called MPT-7B for less than US $200,000 in nine and a half days and without human intervention. OpenAI doesn’t reveal the cost of training its models, but estimates peg the cost of training GPT-3 at a minimum of $4.6 million dollars.
Deploying a hosted AI model also gives organizations control over issues that border on privacy, such as trust and safety. Choi says that a nutrition chat app turned to MosaicML after discovering its AI suggestions produced “fat shaming” responses. The app, which at the time used a competing LLM, couldn’t prevent undesirable responses because it didn’t control the training data or the weights used to fine-tune its output.
“We really believe that security and data privacy are paramount when you’re building AI systems. Because at the end of the day, AI is an accelerant, and it’s going to be trained on your data to help you make your decisions,” says Choi.