5 ways to tell if your AI writing tool is actually working

A product manager on a twelve-person team told me she uses Notion AI every day. She asked it to draft a product spec, cleaned it up, and shipped it. Then she ran a retrospective and realized she’d spent more time editing the draft than she would have spent writing from scratch. She keeps using it anyway because it “feels faster.”

That gap between feeling faster and being faster shows up everywhere small companies use AI writing tools. A 2025 randomized controlled trial from METR put real numbers on it for software developers: participants believed they were 20% faster with AI coding assistance and were actually 19% slower. The tools weren’t bad. The developers just had no way to measure what was actually happening to their throughput.

Most blog posts about AI writing tools give you a list of tool names. This post gives you five signals to track so you can tell, without guessing, whether the tool you’re already using is worth keeping.

1. What percentage of suggestions do you actually accept

The single most useful number you can track is your acceptance rate. How often does the AI suggest something and you keep it, more or less as-is?

Research on GitHub Copilot puts the average acceptance rate for code suggestions at around 30%. One in three suggestions gets used. Two get discarded. That benchmark is worth knowing because it sets a realistic expectation. If your writing-tool acceptance rate sits near 30%, you’re in normal territory. If you’re at 5%, something is off, and it’s usually one of two things: the tool’s suggestions are consistently mismatched to your voice, or your prompts are too vague to produce anything you’d keep.

A 2026 arXiv paper on “MindCopilot” makes this measurement problem explicit. The authors argue that existing AI writing evaluations only measure output quality, which misses what actually matters when you’re using the tool every day. They propose two metrics: hierarchical acceptance rate and knowledge-aware editing distance. That’s academic language for two questions you can ask yourself this week: how often do you use what the tool suggests, and how much do you rewrite what you kept? If you’re not tracking either, you’re flying blind.

You don’t need a dashboard for this. Spend one week keeping a rough tally. If the number is low, your prompting is usually the lever, and you can fix it. If you stay below 10% for two weeks of honest effort, the tool doesn’t fit your workflow.

2. Where are the time savings actually showing up

People assume AI writing tools save time on writing. The research splits that claim apart in ways that change how you should use the tool.

A 2025 Springer study on graduate students writing professional memos with generative AI assistance found that total writing time dropped 56.7%, and the average paper grade rose from an A- to an A. A different study, by Harvard Business School researchers working with the trading firm IG Group, broke the savings into two phases: the conceptualization phase (outlining, organizing material) took 63% less time with AI, and the actual writing phase took 75% less time. Both phases sped up. The 75% writing-phase number depends heavily on how close the writer’s existing knowledge is to the topic, which is the subject of item #4.

Translated to a small-team PM or designer: if you’re using AI to think about structure (outlines, brainstorming angles, figuring out what sections a PRD needs), the savings show up fast. If you’re expecting it to write production-ready copy from scratch on a topic you don’t know well, the savings shrink and the editing burden grows. The HBS team called the second case “the GenAI wall effect.”

To check whether your own work follows the same pattern, time two things separately the next time you write something substantial. First, how long it takes to go from a blank page to a structure: an outline, a set of bullet points, a rough order of sections. Then how long it takes to turn that structure into final prose. If the AI is helping the first phase and hurting the second, use it for the first phase and stop expecting it to do the second.

3. How much do you rewrite the things you keep

Acceptance rate tells you how often you use suggestions. Editing cost tells you how much of what you kept you still had to fix.

A suggestion that you accept but then spend three minutes rewriting is not the same as a suggestion you accept and leave untouched. The MindCopilot research team called this “knowledge-aware editing distance,” and it’s the metric that predicts whether the human-AI co-writing interaction is genuinely collaborative or just creates a worse first draft you have to rehabilitate.

A simple test: after you accept an AI suggestion, count the words you change before publishing. If you’re changing most of them, the suggestion wasn’t saving you work. It was giving you something to react to, which is a different thing with a different value.

Reacting to a draft is genuinely faster than writing from a blank page. The catch is that across hundreds of weekly writing tasks, the editing cost compounds, and it isn’t always smaller than the cost of writing directly. Track your editing time for two weeks. If it’s trending down as you get better at prompting, the tool is doing what it should. If it’s flat or rising, you’re paying for the suggestion habit without the speed return.

4. Are you using the tool for writing that’s close to your domain

The same HBS study found something most teams haven’t internalized: AI closes the quality gap between domain experts and non-experts at the ideation stage, but not at the execution stage. Across web analysts, marketing specialists, and software developers, everyone scored 4.05 to 4.18 out of 5 when using AI to brainstorm and outline. On the actual writing, web analysts scored 3.96, marketing specialists scored 3.92, and software developers scored 3.42. Roughly 13% apart on the writing task, using the same AI tool, with the same instructions.

The researchers called this “knowledge distance.” The closer your existing knowledge is to what you’re writing about, the better the AI output. A product manager writing about onboarding flow will produce better copy about onboarding than a designer writing about pricing, even if both are using the same model with the same prompt. The model amplifies your existing knowledge. It doesn’t substitute for knowledge you don’t have.

When you’re deciding who should write a piece, route it to whoever is domain-closest, even if they’re less fluent with AI. A founder who has spent two years thinking about customer retention will produce better retention copy with AI than a freelance copywriter who picked up the topic last week. The tool is a multiplier. At the output stage, it doesn’t equalize.

The same logic argues for skepticism when your AI tool produces confident-sounding copy on a topic nobody at your company knows deeply. Fluency without expertise is exactly what the model is trained to produce, and it’s the hardest kind of output to catch errors in.

5. How much does your acceptance rate improve over time

Microsoft’s research on Copilot adoption found that developers take roughly 11 weeks to fully realize productivity gains from AI coding tools. Teams often see an initial dip during that ramp-up while people learn to prompt and validate AI-generated suggestions. The tool didn’t change in those 11 weeks. The users got better at using it. The same pattern holds for writing tools.

If your acceptance rate isn’t growing past month three, the usual cause is that you’re spreading the tool across too many different writing types before getting good at one. Pick a single high-volume task (weekly product updates, customer support replies, internal memos), get your acceptance rate up for that task, then expand. Generalists who use AI for everything plateau lower than specialists who use it for one thing well.

A useful framing from the Copilot data: developers keep 88% of suggestions they initially accept. The suggestions that survive to production are the ones the developer chose to keep on first look. That number is high because acceptance itself is selective. Junior users tend to accept more aggressively, seniors tend to accept less but more deliberately. The interesting metric is what survives to ship, and there the gap closes.

For writing, the same logic applies. Accepting fewer suggestions and keeping them clean is more productive than accepting more suggestions and rewriting them. As you get better at the tool, you want your acceptance rate to rise and your editing cost to fall at the same time. If only one of those is moving, figure out which one and why before you trust the productivity story you’re telling yourself.

What this adds up to

You don’t need to track all five of these at once. Start with two: your acceptance rate, and where the time savings are actually showing up (ideation or execution). Those two will tell you most of what you need to know about the tool inside two weeks of honest measurement.

Most teams won’t instrument any of this and will keep using whatever tool they started with based on how it feels. That’s fine if the tool is genuinely cheap and the alternative is worse. It’s not fine if you’re paying $40 per seat per month and have no idea whether the tool is doing anything useful.

Pick one writing task you do every week. Time yourself doing it three times with the AI tool and three times without. The difference between those two averages is the only number that actually tells you whether the tool is earning its keep. Most teams have never run this comparison, and most teams who do are surprised by the result, in one direction or the other.

References

Source	Author/Org	Year	What it supports
MindCopilot: Towards Formalizing and Evaluating Granular Human-LLM Co-Writing	arXiv preprint	2026	Hierarchical acceptance rate and editing distance as the right metrics for AI writing tool evaluation
Generative AI’s Impact on Graduate Student Professional Writing Productivity and Quality	Connell Pensky et al., IJAIED	2025	56.7% writing-time reduction, A- to A grade improvement with proper genAI instruction
Gen AI Boosts Productivity, But Can’t Turn Novices Into Experts	Bojinov et al., HBS Working Knowledge	2026	Knowledge distance concept, 13% expert-novice quality gap at execution, 63% conceptualization / 75% writing time savings
GitHub Copilot Statistics And User Trends 2026	SecondTalent / Copilot telemetry	2025	30% acceptance rate baseline, 88% retention rate, 11-week Microsoft ramp-up finding
The AI Productivity Paradox	METR	2025	Developers 19% slower despite perceiving 20% speed gain; subjective productivity perception is unreliable

5 ways to tell if your AI writing tool is actually working

1. What percentage of suggestions do you actually accept

2. Where are the time savings actually showing up

3. How much do you rewrite the things you keep

4. Are you using the tool for writing that’s close to your domain

5. How much does your acceptance rate improve over time

What this adds up to

References

See it on your own repo

Related

Best product management tools for small teams

Task management software that actually fits a small team

Best free productivity apps for small teams in 2026