Machine-Learning Tool Easily Spots ChatGPT’s Writing

ChatGPT’s academic papers were caught 99 percent of the time

3 min read
An illustration shows a computer with a paper tinged red on a red background with some circular lines behind it.

Since OpenAI launched its ChatGPT chatbot in November 2022, it has been used by people to help them write everything from poems, to work emails, to research papers. Yet, while ChatGPT may masquerade as a human, the inaccuracy of its writing can introduce errors that could be devastating if used for serious tasks like academic writing.

A team of researchers from the University of Kansas has developed a tool to weed out AI-generated academic writing from the stuff penned by people, with over 99 percent accuracy. This work was published on 7 June in the journal Cell Reports Physical Science.

Heather Desaire, a professor of chemistry at the University of Kansas and lead author of the new paper, says that while she’s been “really impressed” with many of ChatGPT’s results, the limits of its accuracy are what led her to develop a new identification tool. “AI text generators like ChatGPT are not accurate all the time, and I don’t think it’s going to be very easy to make them produce only accurate information,” she says.

“In science—where we are building on the communal knowledge of the planet—I wonder what the impact will be if AI text generation is heavily leveraged in this domain,” Desaire says. “Once inaccurate information is in an AI training set, it will be even harder to distinguish fact from fiction.”

“After a while, [the ChatGPT-generated papers] had a really monotonous feel to them.” —Heather Desaire, University of Kansas

In order to convincingly mimic human-generated writing, chatbots like ChatGPT are trained on reams of real text examples. While the results are often convincing at first glance, existing machine-learning tools can reliably identify telltale signs of AI intervention, such as using less emotional language.

However, existing tools like the widely used deep-learning detector RoBERTa have limited application in academic writing, the researchers write, because academic writing is already more likely to omit emotional language. In previous studies of AI-generated academic abstracts, RoBERTa had a roughly 80 percent accuracy.

To bridge this gap, Desaire and her colleagues developed a machine-learning tool that required limited training data. To create the training data, the team collected 64 Perspectives articles—where scientists provide commentary on new research—from the journal Science, and used those articles to generate 128 ChatGPT samples. These ChatGPT samples included 1,276 paragraphs of text for the researchers’ tool to examine.

After optimizing the model, the researchers tested it on two datasets that each contained 30 original, human-written articles and 60 ChatGPT-generated articles. In these tests, the new model was 100 percent accurate when judging full articles, and 97 and 99 percent accurate on the test sets when evaluating only the first paragraph of each article. In comparison, RoBERTa had an accuracy of only 85 and 88 percent on the test sets.

From this analysis, the team identified that sentence length and complexity were a few revealing signs of AI writing compared to humans. They also found that human writers were more likely to name colleagues in their writing, while ChatGPT was more likely to use general terms like “researchers” or “others.”

Overall, Desaire says this made for more boring writing. “In general, I would say that the human-written papers were more engaging,” she says. “The AI-written papers seemed to break down complexity, for better or for worse. But after a while, they had a really monotonous feel to them.”

The researchers hope that this work can be a proof of practice that even off-the-shelf tools can be leveraged to identify AI-generated samples without extensive machine-learning knowledge.

However, these results may be promising only in the short term. Desaire and colleagues note that this scenario is still only a sliver of the type of academic writing that ChatGPT could do. For example, if ChatGPT were asked to write a perspective article in the style of a particular human sample then it might be more difficult to spot the difference.

Desaire says that she can see a future where AI like ChatGPT is used ethically but says that tools for identification will need to continue to grow with the technology to make this possible.

“I think it could be leveraged safely and effectively in the same way we use spell-check now. A basically complete draft could be edited by AI as a last-step revision for clarity,” she says. “If people do this, they need to be absolutely certain that no factual inaccuracies were introduced in this step, and I worry that this fact-check step may not always be done with rigor.”

The Conversation (0)