When AI should show its work: 5 findings from a 450-person user study

Most AI product features hand users a recommendation. The AI picks the most likely answer, adds a brief explanation, and the user accepts or ignores it. It is fast to build and intuitive to explain to stakeholders. It is also, according to a 2026 controlled study, the wrong design for a specific and common class of situation.

The study tested AI assistance for content reporting under the EU’s Digital Services Act, but the finding holds for any flow where users make a classification or categorization decision and the AI can plausibly be wrong. That covers a lot of product surface: onboarding forms that ask users to self-categorize, support flows that route tickets, content moderation queues, and approval workflows.

Here are five findings from this body of research that product teams should understand before they ship the next AI-assisted decision flow.

1. When AI can be wrong, showing arguments beats giving a recommendation

The 2026 DSA study, which ran 450 participants through controlled AI assistance conditions, compared two designs: one that recommended a single legal category with an explanation (conventional XAI), and one that presented structured arguments for and against multiple candidate categories without preselecting any (evaluative AI).

When the AI’s recommendation was correct, the two conditions performed similarly. When the AI’s recommendation was wrong, the gap was stark. Evaluative AI achieved 74% accuracy on error trials. Conventional XAI achieved 46%. That is a 28 percentage point gap: the odds ratio was approximately 10.96, meaning users given the evaluative format were roughly eleven times more likely to land on the correct category even when the AI was steering them toward the wrong one.

A 2024 study by Le et al. confirmed a similar pattern in a different domain. Their hypothesis-driven approach, which showed users evidence for and against multiple options rather than recommending one, reduced over-reliance by 20.56 percentage points compared to a recommendation-only interface.

The design implication is specific: if your AI makes category or classification suggestions and you cannot guarantee it will be correct, presenting multiple options with supporting arguments protects you from the user blindly following a bad recommendation. The confidence-inspiring UI that says “We recommend X” is also the one most likely to get the user to pick X when X is wrong.

2. Speed and accuracy trade off, and the stakes determine which one to optimize

The evaluative AI format in the DSA study was slower. Users in the conventional XAI condition made their selections in around 204 seconds at the median. Users in the evaluative AI condition took around 236 seconds. That 32-second difference is not nothing when you multiply it across a high-volume flow.

This is not an unsolved engineering problem. It is a design decision. Speed and accuracy trade off in AI-assisted categorization, and the data supports that clearly. A 2024 study by Ren, Pardos, and Li examined AI assistance in an educational content-tagging task and found that AI recommendations cut completion time by about 50% but came with a 35% accuracy drop compared to the non-AI baseline.

The variable that should govern your choice is the cost of a wrong answer. If you are building an onboarding flow where an incorrect self-categorization delays the user by one extra step, optimize for speed. If you are building a compliance reporting workflow, a content classification system that feeds a moderation queue, or any flow where an incorrect answer has downstream consequences for the user or your platform, the 32-second cost of deliberation is cheap compared to the accuracy gain.

The framing “we can always improve accuracy later” does not survive contact with this data. The two things are in tension by design. Faster AI assistance tends to mean users follow the AI more readily, including when it is wrong.

3. AI improves selection quality but not explanation quality

If you have a “tell us why” step downstream of an AI recommendation, this one will rewrite your roadmap.

In the DSA study, neither AI condition improved the quality of users’ written explanations. Users selected the correct legal category more often when given AI assistance, but they could not articulate the reasoning any better than users who had no AI help at all. The AI improved the accuracy of the category chosen, but that capability did not transfer to the adjacent task of writing a coherent justification.

This matters in two scenarios. First, if your flow requires a free-text substantiation after the AI-assisted selection, you cannot assume the AI assistance will help with that part. Users are likely to copy the AI’s one-line explanation verbatim, or produce a thin justification that mirrors the recommendation language without capturing the actual reasoning. Second, if your platform has an appeals process, you will see more appeals that say “I selected X because the AI told me to” rather than appeals that demonstrate the user understood the decision.

Design the substantiation step separately from the selection step. Don’t assume the same AI assistance covers both. A different prompt, a different interaction pattern, or a different information surface is needed for the explanation task.

4. Error type determines whether AI assistance can help at all

Not all wrong answers are the same, and the DSA study tested this deliberately. The researchers distinguished between near-miss errors (picking a category that is semantically close to the right one), overbreadth errors (picking a category too broad for the specific content), and out-of-scope errors (picking a category that belongs to a completely different domain).

Evaluative AI worked well for near-miss errors. It worked much less well for out-of-scope errors. For out-of-scope error conditions, near-universal misrouting occurred in both the evaluative AI and conventional XAI conditions. The participants who received arguments for and against multiple categories still almost all picked the wrong category when the correct answer required them to recognize they were in the wrong domain entirely.

This is the kind of finding that changes how you think about where to deploy AI assistance. If you profile your actual error distribution and find that most of your errors are semantic close-calls (the user picked “billing” when they meant “subscription management”), AI-assisted disambiguation can help significantly. If your distribution is dominated by out-of-scope errors (the user picked “billing” when they should have filed a separate account access ticket), no current AI assistance format tested in this research will reliably fix that. The root cause is probably information architecture or workflow design, not the AI interaction model.

5. Human-AI teaming is not automatically better, and the format is the lever

A 2024 meta-analysis by Vaccaro, Almaatouq, and Malone, published in Nature Human Behaviour, synthesized 106 experimental studies on human-AI collaboration across 370 measured effects. The aggregate result is uncomfortable reading for anyone building AI-assisted features: human-AI combinations performed worse than the best of human-only or AI-only approaches, with an average effect size of Hedges’ g = -0.23. This was especially pronounced for decision-making tasks.

The meta-analysis also found that the direction of the gap mattered. When humans outperformed AI, combining the two produced gains. When AI outperformed humans, combining the two produced losses. The common pattern of deploying AI recommendations to “assist” humans in domains where the AI is actually more accurate may be actively degrading outcomes.

The DSA study, read against this backdrop, suggests the evaluative AI format is one of the levers that changes the sign of that effect. By presenting arguments rather than conclusions, it preserves the human’s ability to apply contextual judgment rather than anchoring to the AI’s confident recommendation. The format is the variable.

Before shipping an AI-assisted decision flow, the honest product question is not “does this look useful?” It is “under what error conditions does this make outcomes worse?” The Vaccaro meta-analysis suggests those conditions are common enough that the null hypothesis should be “human-AI combination degrades performance, prove otherwise.”

The common thread

These five findings point toward the same design principle: the benefit of AI assistance in decision flows is not automatic, and it depends heavily on how the assistance is presented. Confident recommendations help when the AI is right and the decision is low-stakes. They hurt when the AI is wrong and the stakes are higher. Deliberation scaffolds, showing users competing arguments without preselecting a winner, are slower but far more robust to AI error.

Run a small study on your own flow with deliberately introduced errors. If users follow the wrong recommendation even when you give them enough information to catch it, you have a confident-recommendation problem. The fix is usually a design change, not a model change.

References

Source	Author / Org	Year	Supports
AI-Assisted Illegal Content Reporting under the Digital Services Act	Research team (DSA study)	2026	74% vs 46% accuracy gap; speed tradeoff; explanation quality finding
From Evidence to Decision: Exploring Evaluative AI	Le et al.	2024	Evaluative AI reduces over-reliance by 20.56pp; Brier score improvement
When combinations of humans and AI are useful: A systematic review	Vaccaro, Almaatouq, Malone	2024	Human-AI teams perform worse on average (g=-0.23); 106 studies, 370 effects
Human-AI Collaboration Increases Skill Tagging Speed but Degrades Accuracy	Ren, Pardos, Li	2024	50% speed gain with 35% accuracy loss in AI-assisted categorization