The patient was a 39-year-old woman who came to the emergency department at Beth Israel Deaconess Medical Center in Boston. His left knee has been hurting for a few days. The day before, he had a fever of 102 degrees. It was gone now, but he was still cold. And his knee was red and swollen.
What is the diagnosis?
On a recent steamy Friday, Dr. Megan Landon, a medical resident, presented this real case to a room full of medical students and residents. They have gathered to learn a skill that can be tricky to teach — how to think like a doctor.
“Doctors are terrible at teaching other doctors how we think,” says Dr. Adam Rodman, an internist, a medical historian and an event organizer at Beth Israel Deaconess.
But this time, they can call on an expert for help reaching a diagnosis — GPT-4, the latest version of a chatbot released by the company OpenAI.
Artificial intelligence is changing many aspects of the practice of medicine, and some medical professionals are using these tools to aid them in diagnosis. Doctors at Beth Israel Deaconess, a teaching hospital affiliated with Harvard Medical School, decided to explore how chatbots could be used – and misused – in training future doctors.
Instructors like Dr. Rodman hopes that medical students can turn to GPT-4 and other chatbots for something similar to what doctors call a curbside consult — when they pull a colleague aside and ask for an opinion about a difficult case. The idea is to use a chatbot in the same way that doctors turn to each other for suggestions and insights.
For over a century, doctors have been portrayed as detectives who pick up clues and use them to find the culprit. But experienced doctors actually use a different method – pattern recognition – to figure out what’s wrong. In medicine, this is called a disease script: signs, symptoms, and test results that doctors put together to tell a coherent story based on similar cases they know or have seen themselves.
If the pain script doesn’t help, Dr. Rodman, doctors turned to other strategies, such as assigning probabilities to different diagnoses that might fit.
Researchers have tried for more than half a century to design computer programs to make medical diagnoses, but nothing has really succeeded.
Doctors say GPT-4 is different. “This will create something remarkably similar to a pain script,” said Dr. Rodman. In that way, he added, “it’s fundamentally different than a search engine.”
Dr. Rodman and other doctors at Beth Israel Deaconess requested the GPT-4 for possible tests in difficult cases. In a study published last month in the medical journal JAMA, they found that it did better than most doctors in weekly diagnostic challenges published in the New England Journal of Medicine.
But, they learned, there is an art to using the program, and there are pitfalls.
Dr. Christopher Smith, the director of the internal medicine residency program at the medical center, said medical students and residents “definitely use it.” But, he added, “whether they learn anything is an open question.”
The concern is that they may rely on AI to make diagnoses in the same way they rely on a calculator on their phones to do a math problem. That, said Dr. Smith, is dangerous.
Learning, he said, involves trying to figure things out: “That’s how we keep things. Part of learning is struggle. If you outsource learning to GPT, that struggle is gone.”
During the meeting, the students and residents broke into groups and tried to figure out what was wrong with the patient with the swollen knee. Then they turned to GPT-4.
The groups tried different methods.
One uses GPT-4 to perform an internet search, similar to how one would use Google. The chatbot brought up a list of possible diagnoses, including trauma. But when asked by group members to explain its reasoning, the bot became frustrated, explaining its choice by saying, “Trauma is a common cause of knee injury.”
Another group came up with possible hypotheses and asked GPT-4 to evaluate them. The chatbot list is aligned with the group list: infections, including Lyme disease; arthritis, including gout, a type of arthritis that involves crystals in the joints; and trauma.
GPT-4 added rheumatoid arthritis to the top possibilities, though it wasn’t high on the group’s list. Gout, the instructors later told the group, was unlikely for this patient because she was young and female. And rheumatoid arthritis is likely to go away because only one joint is inflamed, and in just a few days.
As a curbside consult, GPT-4 seems to pass the test or, at least, to agree with students and residents. But in this exercise, it offers no insights, and no pain script.
One reason may be that students and residents used the bot more like a search engine than a curbside consult.
To use the bot correctly, the instructors said, they need to start by telling the GPT-4 something like, “You’re a doctor seeing a 39-year-old woman with knee pain.” They will then need to list his symptoms before seeking a diagnosis and follow up with questions about the bot’s reasoning, just as they would with a medical colleague.
That, the instructors say, is one way to take advantage of the power of GPT-4. But it’s also important to recognize that chatbots can make mistakes and “hallucinate” – give answers that have no basis in fact. Using it requires knowing if it is wrong.
“It’s not bad to use these tools,” said Dr. Byron Crowe, an internal medicine physician at the hospital. “You just have to use them the right way.”
He gave the group a commonality.
“Pilots use GPS,” said Dr. Crowe. But, he added, airlines “have very high standards for reliability.” In medicine, he said, the use of chatbots is “very tempting,” but the same high standards should apply.
“It’s a great mental partner, but it doesn’t replace deep mental expertise,” he said.
When the session was over, the instructors revealed the true cause of the patient’s knee swelling.
This became a possibility that each group considered, and GPT-4 proposed.
He had Lyme disease.
Olivia Allison contributed reporting.