By clicking “Accept All Cookies,” you agree to the storing of cookies on your device to enhance site navigation and analyze site usage.

Skip to main content

Research Shows Human-AI Shortcomings–and How to Fix Them

Q&A: Researcher Alex Moehring gave AI to human fact-checkers and radiologists, only to deliver disappointing results. Can we reduce barriers to delivering AI’s expected gains?

November 02, 2025

Technology optimists say artificial intelligence won’t replace human workers, but instead work by our sides, helping and guiding us for greater productivity, accuracy, even creativity.

But what if experiments show that collaboration between humans and AI systems often produce no measurable gains?

That’s one of the big questions explored in two recent working papers coauthored by Alex Moehring, an Assistant Professor at Purdue University’s Daniels School of Business and a Digital Fellow at the MIT Initiative on the Digital Economy (IDE). Alex received a doctorate in management science from MIT Sloan in 2024.

To discuss the two working papers and their implications for implementing AI, Alex spoke recently with IDE contributing writer Peter Krass.

Q: Both of your working papers find the results of human-AI collaboration to be disappointing. What’s holding back progress?

We find a pretty consistent pattern: Humans use AI tools sub-optimally. To be clear, we’re thinking about classification algorithms, a form of supervised machine learning, not Generative AI. Within the classification category of AI, we consistently find that humans deviate from Bayesian updating, which is how someone optimally using the algorithms would behave. [Bayesian optimization combines statistical modeling and decision-making strategies to approximate an objective function.] This leads to humans failing to capitalize on the potential of human-AI collaboration.

Q: In one of your experiments, you and your colleagues gave AI tools to human fact-checkers. How did that work out?

The AI reported a continuous probability assessment of its likelihood that a statement is true. That one probability score captures whether the AI thinks a statement is true or false, and its confidence in that assessment.

We found that when the AI was very confident, the fact-checkers actually underperformed what the AI would have done on its own. The fact-checkers had access to the AI score, so if they were using it optimally, they should have performed at least as well as full automation. That they performed worse than full automation suggests they were under-responding to the AI in one way or another.

Q: Even when the AI confidence rating was high?

Yes, we found that even when the AI said there’s a 95% chance a statement is true, humans sometimes overrode the AI. We saw this as a puzzle. If the AI is super-confident and is so good on these cases, then why aren’t the humans just following the AI and exerting very little effort?

To solve the puzzle, we characterized the differences between how humans combine their own information with that of the AI and how a Bayesian would do so. Then we decomposed those differences into two biases.

One bias is that you could be under-responding to the AI because you’re acting as if your own private information is much more accurate or informative than it actually is. Another bias is that you’re ignoring the AI. This is what we call AI neglect. You’re acting as if the AI is less accurate than it actually is.

We found that in this setting, overconfidence — humans acting as if their own private information is much more accurate than it is — explains most of the deviations from Bayesian updating. So in the fact-checker paper, overconfidence seems to be the main driver of sub-optimal usage of the AI tool.

Q: In your other experiment, over 225 radiologists used an AI tool trained on roughly a quarter-million X-rays. What were the results?

We found that here, too, humans under-responded to the AI. But in this setting, it was almost entirely due to AI neglect. The radiologists were not appropriately sensitive to the AI signal. Radiologists spend a lot of time looking at X-rays and making these sorts of judgments. Over time, they become appropriately sensitive to their own signals. But they’re neglecting the AI.

Q: Is that a big issue?

Yes. We find the AI alone overperforms about three-quarters of all radiologists. You’d think that providing AI would make at least three-quarters of the radiologists more accurate — if they just followed it. But we found that on average, there was no impact on the accuracy of radiologists’ assessments.

Q: Were the radiologists simply ignoring the AI’s assessments?

No. Because when we provided the AI assessment, the radiologists’ assessments moved closer to the AI signal. This was another puzzle. The AI is more accurate than most radiologists, and the radiologists follow the AI in some cases. Yet on average, the AI does not improve the accuracy of human assessments.

We found that when the AI is confident — particularly, when it was confident that a case was negative — having access to the AI made the human radiologist more accurate. But when the AI was uncertain, having access to the AI actually made the human less accurate. In fact, it made them perform worse than if you gave them no information at all.

Q: What was the problem?

We found evidence of two biases. The first I already spoke about, which is AI neglect. Humans are under-responding to the AI. They’re responding, but not the same way a Bayesian would.

The second and perhaps more subtle bias is that humans fail to account for the duplicate information the AI has in addition to what they, the humans, observe. Both the human and the AI are looking at the same chest X-ray, so there’s a lot of correlated information, even conditional on whether, say, pneumonia is present. But humans acted as if the two signals were conditionally independent, or as if there was no duplicate information. Theoretically, this can lead to humans becoming less accurate, in particular when you see an AI assessment that’s uncertain.

That’s exactly what we observed.

Overall, when the AI was confident, humans became more accurate. But when the AI was uncertain, humans became less accurate.

Q: In the time since you conducted this research, what’s changed?

Generative AI is probably the biggest thing. ChatGPT has obviously impacted how a lot of people do their jobs. One question I’m actively interested in is: How do humans collaborate with GenAI tools?

GenAI introduces a lot of opportunities for research, but it also introduces some challenges. For one, we don’t have a natural benchmark of Bayesian updating we can use to compare with human decisions. So it’s harder to think about whether humans are using GenAI tools optimally.

Q: Based on your research, what can we do to improve human-AI collaboration?

It’s an important question, and I see two paths forward. One is training humans to better reason about these probabilistic assessments and incorporating multiple pieces of information. It’s a hard task. To do this, you might build new tools to help humans focus on the right pieces of information.

The second thing is building AI tools with human use in mind. One bias we find is that humans fail to account for the correlation between their own information and that of the AI. So you might train AI systems to focus on information the human is not already focusing on. Then it would be conditionally independent by design. To do this, we could rethink how we train AI tools. Not necessarily to train the best possible AI tool, but instead to train the best possible AI tool for humans.

· Meet Alex Moehring.

· Read the first working paper: Designing human-AI collaboration: A sufficient-statistic approach.

· Read the second working paper: Combining human expertise with artificial intelligence: Experimental evidence from radiology.