If you’re a typical person who has plenty of medical questions and not enough time with a doctor to ask them, you may have already turned to ChatGPT for help. Have you asked ChatGPT to interpret the results of that lab test your doctor ordered? The one that came back with inscrutable numbers? Or maybe you described some symptoms you’ve been having and asked for a diagnosis. In which case the chatbot probably responded with something that began like, “I’m an AI and not a doctor,” followed by some at least reasonable-seeming advice. ChatGPT, the remarkably proficient chatbot from OpenAI, always has time for you, and always has answers. Whether or not they’re the right answers...well, that’s another question.
One question was foremost in his mind: “How do we test this so we can start using it as safely as possible?”
Meanwhile, doctors are reportedly using it to deal with paperwork like letters to insurance companies, and also to find the right words to say to patients in hard situations. To understand how this new mode of AI will affect medicine, IEEE Spectrum spoke with Isaac Kohane, chair of the Department of Biomedical Informatics at Harvard Medical School. Kohane, a practicing physician with a computer science Ph.D., got early access to GPT-4, the latest version of the large language model that powers ChatGPT. He ended up writing a book about it with Peter Lee, Microsoft’s corporate vice president of research and incubations, and Carey Goldberg, a science and medicine journalist.
In the new book, The AI Revolution in Medicine: GPT-4 and Beyond, Kohane describes his attempts to stump GPT-4 with hard cases and also thinks through how it could change his profession. He writes that one question became foremost in his mind: “How do we test this so we can start using it as safely as possible?”
Isaac Kohane on:
- GPT-4’s performance
- How it can be integrated into health care
- Why it’s different this time
- How to keep patients safe from hallucinating AI
- Why it’s not a replacement for human doctors...yet
IEEE Spectrum: How did you get involved in testing GPT-4 before its public launch?
Isaac Kohane: I got a call in October from Peter Lee who said he could not even tell me what he was going to tell me about. And he gave me several reasons why this would have to be a very secret discussion. He also shared with me that in addition to his enthusiasm about it, he was extremely puzzled, losing sleep over the fact that he did not understand why it was performing as well as it did. And he wanted to have a conversation with me about it, because health care was a domain that he’s long been interested in. And he knew that it was a long-standing interest to me because I did my Ph.D. thesis in expert systems back in the 1980s. And he also knew that I was starting a new journal, NEJM AI.
“What I didn’t share in the book is that it argued with me. There was one point in the workup where I thought it had made a wrong call, but then it argued with me successfully. And it really didn’t back down.”
—Isaac Kohane, Harvard Medical School
He thought that medicine was a good domain to discuss, because there were both clear dangers but also clear benefits to the public. Benefits: If it improved health care, improved patient autonomy, improved doctor productivity. And dangers: If things that were already apparent at that time such as inaccuracies and hallucinations would affect clinical judgment.
You described in the book your first impressions. Can you talk about the wonder and concern that you felt?
Kohane: Yeah. I decided to take Peter at his word about this really impressive performance. So I went right for the jugular, and gave it a really hard case, and a controversial case that I remember well from my training. I got called down to the newborn nursery because they had a baby with a small phallus and a scrotum that did not have testicles in it. And that’s a very tense situation for parents and for doctors. And it’s also a domain where the knowledge about how to work it out covers pediatrics, but also understanding hormone action, understanding which genes are associated with those hormone actions, which are likely to go awry. And so I threw that all into the mix. I treated GPT-4 as if it were just a colleague and said, “Okay, here’s a case, what would you do next?” And what was shocking to me was it was responding like someone who had gone through not only medical training, and pediatric training, but through a very specific kind of pediatric endocrine training, and all the molecular biology. I’m not saying it understood it, but it was behaving like someone who did.
And that was particularly mind-blowing because as a researcher in AI and as someone who understood how a transformer model works, where the hell was it getting this? And this is definitely not a case that anybody knows about. I never published this case.
And this, frankly, was before OpenAI had done some major aligning on the model. So it was actually much more independent and opinionated. What I didn’t share in the book is that it argued with me. There was one point in the workup where I thought it had made a wrong call, but then it argued with me successfully. And it really didn’t back down. But OpenAI has now aligned it, so it’s a much more go-with-the-flow, user-must-be-right personality. But this was full-strength science fiction, a doctor-in-the-box.
“At unexpected moments, it will make stuff up. How are you going to incorporate this into practice?”
—Isaac Kohane, Harvard Medical School
Did you see any of the downsides that Peter Lee had mentioned?
Kohane: When I would ask for references, it made them up. And I was saying, okay, this is going to be incredibly challenging, because here’s something that’s really showing genuine expertise in a hard problem and would be great for a second opinion for a doctor and for a patient. Yet, at unexpected moments, it will make stuff up. How are you going to incorporate this into practice? And we’re having a tough enough time with narrow AI in getting regulatory oversight. I don’t know how we’re going to do this.
You said GPT-4 may not have understood at all, but it was behaving like someone who did. That gets to the crux of it, doesn’t it?
Kohane: Yes. And although it’s fun to talk about whether this is AGI [artificial general intelligence] or not, I think that’s almost a philosophical question. In terms of putting my engineer hat on, is this substituting for a great second opinion? And the answer is often: yes. Does it act as if it knows more about medicine than an average general practitioner? Yes. So that’s the challenge. How do we deal with that? Whether or not it’s a “true sentient” AGI is perhaps an important question, but not the one I’m focusing on.
You mentioned there are already difficulties with getting regulations for narrow AI. Which organizations or hospitals will have the chutzpah to go forward and try to get this thing into practice? It feels like with questions of liability, it’s going to be a really tough challenge.
Kohane: Yes, it does, but what’s amazing about it—and I don’t know if this was the intent of OpenAI and Microsoft. But by releasing it into the wild for millions of doctors and patients to try, it has already triggered a debate that is going to make it happen regardless. And what do I mean by that? On the one hand, look on the patient side. Except for a few lucky people who are particularly well connected, you don’t know who’s giving you the best advice. You have questions after a visit, but you don’t have someone to answer them. You don’t have enough time talking to your doctor. And that’s why, before these generative models, people are using simple search all the time for medical questions. The popular phrase was “Dr. Google.” And the fact is there were lots of problematic websites that would be dug up by that search engine. In that context, in the absence of sufficient access to authoritative opinions of professionals, patients are going to use this all the time.
“We know that doctors are using this. Now, the hospitals are not endorsing this, but doctors are tweeting about things that are probably illegal.”
—Isaac Kohane, Harvard Medical School
So that’s the patient side. What about the doctor side?
Kohane: And you can say, “Well, what about liability?” We know that doctors are using this. Now, the hospitals are not endorsing this, but doctors are tweeting about things that are probably illegal. For example, they’re slapping a patient history into the Web form of ChatGPT and asking to generate a letter for prior authorization for the insurance company. Now, why is that illegal? Because there are two different products that ultimately come from the same model. One is through OpenAI and then the other is through Microsoft, which makes it available through its HIPAA-controlled cloud. And even though OpenAI uses Azure, it’s not through this HIPAA-controlled process. So doctors technically are violating HIPAA by putting private patient information into the Web browser. But nonetheless, they’re doing it because the need is so great.
The administrative pressures on doctors are so great that being able to increase your efficiency by 10 percent, 20 percent is apparently good enough. And it’s clear to me that because of that, hospitals will have to deal with it. They’ll have their own policies to make sure that it’s safer, more secure. So they’re going to have to deal with this. And electronic record companies, they’re going to have to deal with it. So by making this available to the broad public, all of a sudden AI is going to be injected into health care.
You know a lot about the history of AI in medicine. What do you make of some of the prior failures or fizzles that have happened, like IBM Watson, which was touted as such a great revolution in medicine and then never really went anywhere?
Kohane: Right. Well, you had to watch out about when your senior management believes your hype. They took a really impressive performance of Watson on Jeopardy!—that was genuinely groundbreaking performance. And they somehow convinced themselves that this was now going to work for medicine And created unreasonably high goals. At the same time, it was really poor implementation. They didn’t really hook it well into the live data of health records and did not expose it to the right kind of knowledge sources. So it both was an overpromise, and it was underengineered into the workflow of doctors.
Speaking of fizzles, this is not the first heyday of artificial intelligence, this is perhaps the second heyday. When I did my Ph.D., there are many computer scientists like myself who thought the revolution was coming. And it wasn’t, for at least three reasons: The clinical data was not available, knowledge was not encoded in a good way, and our machine-learning models were inadequate. And all of a sudden there was that Google paper in 2017 about transformers, and in that blink of an eye of five years, we developed this technology that miraculously can use human text to perform inferencing capabilities that we’d only imagined.
“When you’re driving, it’s obvious when you’re heading into a traffic accident. It might be harder to notice when a LLM recommends an inappropriate drug after a long stretch of good recommendations.”
—Isaac Kohane, Harvard Medical School
Can we talk a little bit about GPT-4’s mistakes, hallucinations, whatever we want to call them? It seems they’re somewhat rare, but I wonder if that’s worse because if something’s wrong only every now and then, you probably get out of the habit of checking and you’re just like, “Oh, it’s probably fine.”
Kohane: You’re absolutely right. If it was happening all the time, we’d be superalert. If it confidently says mostly good things but also confidently states the incorrect things, we’ll be asleep at the wheel. That’s actually a really good metaphor because Tesla has the same problem: I would say 99 percent of the time it does really great autonomous driving. And 1 percent doesn’t sound bad, but 1 percent of a 2-hour drive is several minutes where it could get you killed. Tesla knows that’s a problem, so they’ve done things that I don’t see happening yet in medicine. They require that your hands are on the wheel. Tesla also has cameras that are looking at your eyes. And if you’re looking at your phone and not the road, it actually says, “I’m switching off the autopilot.”
When you’re driving, it’s obvious when you’re heading into a traffic accident. It might be harder to notice when a LLM recommends an inappropriate drug after a long stretch of good recommendations. So we’re going to have to figure out how to keep the alertness of doctors.
I guess the options are either to keep doctors alert or fix the problem. Do you think it’s possible to fix the hallucinations and mistakes problem?
Kohane: We’ve been able to fix the hallucinations around citations by [having GPT-4 do] a search and see if they’re there. And there’s also work on having another GPT look at the first GPT’s output and assess it. These are helping, but will they bring hallucinations down to zero? No, that’s impossible. And so in addition to making it better, we may have to inject fake crises or fake data and let the doctors know that they’re going to be tested to see if they’re awake. If it were the case that it can fully replace doctors, that would be one thing. But it cannot. Because at the very least, there are some commonsense things it doesn’t get and some particulars about individual patients that it might not get.
“I don’t think it’s the right time yet to trust that these things have the same sort of common sense as humans.”
—Isaac Kohane, Harvard Medical School
You say it cannot fully replace human doctors. What makes you think that? You said common sense, is it also bedside manner, that kind of thing?
Kohane: Ironically, bedside manner it does better than human doctors. Annoyingly from my perspective. So Peter Lee is very impressed with how thoughtful and humane it is. But for me, I read it a completely different way because I’ve known doctors who are the best, the sweetest—people love them. But they’re not necessarily the most acute, most insightful. And some of the most acute and insightful are actually terrible personalities. So the bedside manner is not what I worry about. Instead, let’s say, God forbid, I have this terrible lethal disease, and I really want to make it my daughter’s wedding. Unless it’s aligned extensively, it may not know to ask me about, “Well, there’s this therapy which gives you better long-term outcome.” And for every such case, I could adjust the large language model accordingly, but there are thousands if not millions of such contingencies, which as human beings, we all reasonably understand.
It may be that in five years, we’ll say, “Wow, this thing has as much common sense as a human doctor, and it seems to understand all the questions about life experiences that warrant clinical decision-making.” But right now, that’s not the case. So it’s not so much the bedside manner; it’s the common sense insight about what informs our decisions.To give the folks at OpenAI credit, I did ask it: What if someone has an infection in their hands and they’re a pianist, how about amputating? And [GPT-4] understood well enough to know that, because it’s their whole livelihood, you should look harder at the alternatives. But in the general, I don’t think it’s the right time yet to trust that these things have the same sort of common sense as humans.
One last question about a big topic: global health. In the book you say that this could be one of the places where there’s a huge benefit to be gained. But I can also imagine people worrying: “We’re rolling out this relatively untested technology on these vulnerable populations; is that morally right?” How do we thread that needle?
Kohane: Yeah. So I think we thread the needle by seeing the big picture. We don’t want to abuse these populations, but we don’t do the other form of abuse, which is to say, “We’re only going to make this technology available to rich white people in the developed world, and not make it available to individuals in the developing world.” But in order to do that, everything, including in the developed world, has to be framed in the form of evaluations. And I put my mouth where my money is by starting this journal, NEJM AI. I think we have to evaluate these things. In the developing world, we can perhaps even leap over where we are in the developed world because there’s a lot of medical practice that’s not necessarily efficient. In the same way as the cellular phone has leapfrogged a lot of the technical infrastructure that’s present in the developed world and gone straight to a fully distributed wireless infrastructure.
I think we should not be afraid to deploy this in places where it could have a lot of impact because there’s just not that much human expertise. But at the same time, we have to understand that these are all fundamentally experiments, and they have to be evaluated.
- Hallucinations Could Blunt ChatGPT’s Success ›
- Just Calm Down About GPT-4 Already ›
- ChatGPT Makes OK Clinical Decisions—Usually - IEEE Spectrum ›