What is the correct role for AI? A human helper? Human replacement? Something in between?
These and other related questions are very much on the mind of Michael Schrage, a Research Fellow at the MIT Initiative on the Digital Economy (IDE).
Schrage’s research, writing and advisory work focuses on the ‘behavioral economics’ of models, prototypes and metrics as strategic resources for managing ‘innovation risk’ and opportunity. He has run design workshops and executive education programs at MIT Sloan on innovation, experimentation and ’strategic measurement’ for organizations all over the world.
Recently, Schrage sat down with Peter Krass, a contributing editor and writer to the IDE, to discuss two papers on this topic coming soon in the MIT Sloan Management Review. One deals with what he calls “reflective escrow.” The other, cowritten with David Kiron, explores AI prompt evaluation. The following is an edited version of their conversation.
Measuring AI Output vs. Outcomes
Q: A new working paper from the National Bureau of Economic Research—based on its survey of nearly 6,000 CEOs, CFOs and other executives—finds that eight in 10 organizations have seen no AI impact at all on either employment or productivity. What’s going on?
There’s a genuine split on this, even among the MIT community. The most famous line came from MIT economist and Nobel Prize winner Bob Solow some 30 years ago: “You can see the computer age everywhere but in the productivity statistics.”
Quantitative humor aside, the more time I spend with AI, and the more I see organizations try to meaningfully incorporate AI into their workflows, the more I believe legacy definitions of productivity have become anachronisms. AI has got to be—and is already becoming—much, much more than greater efficiencies for less cost. I see smart leaders—the kind you find in MIT’s Executive Ed programs—shifting their AI emphasis from better outputs to better outcomes.
I’ll go further: AI is about both learning to get better and learning to get better at learning. It’s just as important to measure learning as it is to measure outputs. And as with human capital, we can no longer divorce machine learning from machine outputs. We need to measure machine-learning curves in the context of productivity.
Q: In one of your forthcoming papers, you say AI can actually make people worse at their jobs. How does that happen?
It’s not that people necessarily get worse at their jobs; it’s that AI’s “good enough” effectiveness can make people a little lazier, complacent or less demanding than they should be. If you had a bright intern or grad student you trusted a bit too much, you might have similar concerns. So, are you becoming over-reliant on AI? How do you know? Are you taking a good-enough, one-shot answer without iterating or reflecting on the response?
With my classes, clients and research, I see too many smart humans using AI to get a decent answer faster, rather than thinking twice and seeing if they can find or create new value. That bugs me.
The big split I see is: Where and when does AI make you more mindful? And where and when does AI make you more mindless? With AI, there’s a bit of a drift, and then your judgment begins to erode. You’re still hitting your output numbers, but your ability to judge things well has declined. That’s a trap.
Q: Is this related to your idea of “delegation drift”?
Yes. The AI is giving you good advice, it’s learning, it’s giving you answers…so you give it more to do. Now, where does your contribution end, and where does the AI’s begin? This is the nature of the drift. Are you letting the AI do too much of the work?
Perhaps if you had done a better job with the original prompt, you’d be evaluating a better outcome or output. So when does the AI’s good-enough answer become slop? When does a good-enough answer demand that you do another iteration? And when is a good-enough answer actually good enough?
These questions aren’t hypothetical anymore. Today, for the likes of OpenAI, Anthropic, Google and Meta—and their enterprise users—they’re multibillion-dollar questions.
Improving AI Prompting—and Outcomes
Q: So the quality of AI results comes down to the quality of our prompts?
That’s not quite the issue. It’s how do we meaningfully evaluate and assess the accuracy, relevance, quality and utility of the outputs. This is why the notion of verification and evaluation is not a subtle distinction. You give the AI a prompt, and the AI gives you a response. You’re afraid it might be a hallucination, so you verify that it’s true. But what if the response is technically accurate, but contextually not good enough? How do you improve the quality of your own evaluation?
This is a virtuous cycle that both organizations and individuals will need to cultivate: learning to prompt and prompting to learn. That is, I’m learning to prompt, but at the same time, there should be learning from that prompt that gives me insights. We need both, but my research and classes suggest that’s not yet a norm.
Q: What’s the right mix for AI and people working together?
I’ll answer with a question that could launch a thousand doctoral dissertations: When is it more economical and efficient to have the human learn faster and better than the AI, and when is it more economical and efficient to have the AI learn faster and better than the human?
For example, if you’re doing fraud detection for Mastercard or Visa, there aren’t enough people in the world to monitor all those transactions in real time. So what do you want to learn? And what’s the confusion matrix—the cost of false positives and false negatives? Full circle! Your question brings us back to rethinking and refining the nature and purpose of productivity. The whole notion of ‘What do we want to learn?’ has fundamentally changed.
Q: How difficult is it to improve AI prompts?
The most dangerous thing I see in both my clients and my classes is people prompting AI to get the right answer. I think answers are the wrong unit of analysis. Instead, I see the greatest value—economic, organizational and personal—from using AI to generate insights. That is, insights that are actionable.
Yet in many organizations, because of budget, regulations and silos, they can’t do anything with those insights. So the most useful thing to get from our promptathons is asking, What makes insights actionable in our organization? And how can AI improve our ability to take action?
Q: Any tips for getting it right?
The metric that I’m increasingly using, mostly informally, is ROIP: return on iterative prompting. AI is very good at doing affective analysis, sentiment analysis, and not just effective analysis. So take your last 50 prompts, drop them into Claude, ChatGPT, etc., and say, “What do these prompts say about the person using them? What’s the persona? If you were revising these prompts to make them more efficient, how would you classify them?”
Again, counterintuitively, we’re not looking to solve a problem. We’re looking for actionable insights. I joke that in the future, AI will also stand for Augmented Introspection, Augmented Insights and even Artificial Imagination.
Read other IDE articles with Michael Schrage:
