How accurate is AI action-item extraction in 2026?

On a typical 30-minute business meeting, qwen2.5:7b via Ollama hits about 94% on owner attribution and ~85% on task phrasing quality. gpt-4o-mini is marginally better. The dominant remaining failure mode is implicit commitments where the antecedent has to be inferred.

Why do action items need timestamps?

Two reasons. First, audit: when the LLM is wrong, you click the timestamp and hear the original sentence. Second, playback: clicking the timestamp seeks the audio so you can verify or share the source.

Can I trust the LLM to populate Linear tickets directly?

We don't recommend auto-export. The 95% accuracy turns 1-in-20 wrong items into Linear noise. A 30-second human review before export is the right tradeoff.

Does extraction work for non-English meetings?

Yes. Both Ollama (Qwen 2.5 is multilingual) and OpenAI handle 20+ languages cleanly. Action-item phrasing in the output matches the input language by default.

What if the action item is wrong?

Click the timestamp, hear the source quote, edit or delete the item in the panel. Manual edits persist; the LLM does not re-extract on save.

Field notes·2026-05-29·9 min read

How AI action-item extraction from meeting transcripts actually works

Every meeting tool ships with 'AI action items' on the marketing page. Most of them produce a bullet list that looks plausible and bears partial relationship to the actual meeting. Here is what is happening under the hood, what we learned shipping it well, and the failure modes that no LLM has solved yet.

action itemsllmtranscriptextractiontechnical

Action-item extraction sounds like a small feature next to recording, transcription, and diarization. In practice, it is the feature people use most after a meeting ends. A 60-minute call produces 8,000 words of transcript; nobody re-reads that. They open the action-items panel, scan five lines, and move on.

Getting those five lines right is harder than it looks. We have iterated on the extraction prompt roughly 40 times since launch. This post walks through what we settled on, why timestamps matter more than wording quality, and the cases where the LLM is going to be wrong no matter how you prompt it.

What an action item actually is

Operating definition we use: an action item is a sentence where someone in the meeting agreed to do a specific thing. Three properties:

Owner: a named person (not 'the team', not 'we'). The owner has to be on the meeting.
Task: a concrete verb + object. 'Redo the OG image' is a task. 'Think about pricing' is not.
Timestamp: when in the meeting the commitment happened. Critical for audit.

If any of the three is missing, we leave it out. The bar is 'would this make sense in a Linear ticket' - if no, it's noise.

Input shape

The LLM doesn't see raw audio or even raw transcript. It sees a diarized, timestamp-annotated turn list. Each turn looks like:

{
  "speaker": "Lina",
  "start": "00:29:33",
  "end": "00:29:41",
  "text": "I'll redo the OG image - new palette. Done by Monday."
}

Three properties make this input shape work where raw text does not:

Speaker names are resolved before the LLM sees the turn. The model never has to guess who is talking - that has been settled by diarization + voice fingerprint matching. Our diarization guide goes deep on this layer.
Timestamps are explicit. The LLM is asked to return the start timestamp of the turn that produced each action item. We snap to that turn boundary deterministically; no LLM math is involved in playback seeking.
Turn boundaries match speech segments. No turn is half a sentence; the LLM gets coherent units.

Output schema

The extraction prompt asks the LLM for a strict JSON schema:

{
  "actionItems": [
    {
      "owner": "Lina",
      "task": "Redo OG image with new palette",
      "due": "Mon May 30",
      "sourceTimestamp": "00:29:33",
      "sourceQuote": "I'll redo the OG image - new palette. Done by Monday."
    }
  ]
}

Two design choices that took iteration:

We require the sourceQuote field even though the timestamp already lets us recover it. The quote is the cheap human-readable check: if it doesn't match the task, the extraction was wrong and we don't trust the timestamp either.
Due dates are inferred best-effort or left empty. We do not let the LLM guess - 'by Monday' yields a date, 'soon' yields nothing. Filled-in noise is worse than honest blanks.

The prompt

The system prompt (abridged) gives the model three rules:

01Only extract items where the speaker is committing to do something themselves. 'You should do X' is not an action item; 'I'll do X' is.
02Owner must be a named participant. If the transcript doesn't make the owner clear, skip the item.
03If a task is repeated multiple times in the meeting, return only the last occurrence (that is the binding commitment).

Three more rules we added after early failures:

Don't extract conditional commitments. 'If we ship Tuesday, I'll handle the social posts' is conditional; skip it.
Don't infer tasks from passive descriptions. 'The OG image will be redone' has no owner; skip it.
Don't include the action items in the meeting summary as separate bullets. They live in the action-items panel, not the summary, to avoid duplication.

Provider-agnostic by design

The same prompt runs against local Ollama (Qwen 2.5, Llama 3.2) and against OpenAI (gpt-4o-mini, gpt-4o) without modification. The output schema is identical. The user-facing toggle is one dropdown in Settings -> AI Assistant. We covered the comparison in the Ollama vs OpenAI guide.

Quality delta on action-item extraction specifically:

Provider	Avg items per 30-min meeting	Owner correct	Task phrasing
Ollama qwen2.5:7b	3.8	94%	Slightly verbose
Ollama llama3.2:3b	3.2	89%	Short, sometimes too short
OpenAI gpt-4o-mini	4.1	97%	Tightest
OpenAI gpt-4o	4.0	98%	Tightest, marginal gain

Numbers from a test corpus of 240 real meetings, scored by us. For most users on most meetings, qwen2.5:7b is indistinguishable from gpt-4o-mini at zero per-meeting cost.

Failure modes the LLM cannot fix

Honest list:

Implicit commitments. 'Yeah, I'll handle that' with no antecedent - the LLM cannot recover what 'that' refers to from text alone. We sometimes guess from the previous turn; we get it wrong ~15% of the time.
Distributed ownership. 'Marko and I will figure out the rollback plan' produces a joint task; we extract it under both names, which creates duplicate Linear tickets if exported. We are still tuning this.
Sarcasm and joking commitments. 'Sure, I'll redesign the whole site by tomorrow' gets extracted as a real task because the LLM cannot detect tone reliably.
Commitments later retracted. If someone agrees at 14:08 and then walks it back at 19:33, we sometimes extract both states. We are working on a 'final state' pass.

Why we don't auto-export to Linear

An obvious feature request: auto-create a Linear ticket for every extracted action item. We have specifically not built this. Two reasons:

Action-item extraction is 95% right, not 100% right. Auto-export converts 1-in-20 wrong extractions into Linear noise that needs cleanup. That's worse than the manual review.
The post-meeting review is itself useful. Two minutes of scanning the action items panel before clicking 'Export to Linear' is when most people remember the thing the LLM missed.

Our action items page describes the export targets - Linear, Notion, Markdown, JSON - all behind one click after review.

What we are working on next

Joint-ownership handling so 'Marko and I' produces one task with two assignees, not two tasks.
Retraction detection - skipping action items that the speaker walked back later in the call.
Calendar-aware due-date inference. 'By the sprint review' resolves to the next event on your calendar that matches.
Per-meeting tone tag - skip extraction on meetings flagged as informal / social, where commitments are rare and false positives are high.

Bottom line

Action-item extraction is the small feature that holds the whole post-meeting workflow together. Done well, it converts a 60-minute call into five lines you can act on. Done badly, it produces plausible-looking noise that has to be hand-checked anyway. The trick is not better models - it is structured input (named speakers, snapped timestamps), strict output schema, and the discipline to leave fields blank rather than guess.

Frequently asked

How accurate is AI action-item extraction in 2026?
On a typical 30-minute business meeting, qwen2.5:7b via Ollama hits about 94% on owner attribution and ~85% on task phrasing quality. gpt-4o-mini is marginally better. The dominant remaining failure mode is implicit commitments where the antecedent has to be inferred.
Why do action items need timestamps?
Two reasons. First, audit: when the LLM is wrong, you click the timestamp and hear the original sentence. Second, playback: clicking the timestamp seeks the audio so you can verify or share the source.
Can I trust the LLM to populate Linear tickets directly?
We don't recommend auto-export. The 95% accuracy turns 1-in-20 wrong items into Linear noise. A 30-second human review before export is the right tradeoff.
Does extraction work for non-English meetings?
Yes. Both Ollama (Qwen 2.5 is multilingual) and OpenAI handle 20+ languages cleanly. Action-item phrasing in the output matches the input language by default.
What if the action item is wrong?
Click the timestamp, hear the source quote, edit or delete the item in the panel. Manual edits persist; the LLM does not re-extract on save.

Try Mac Note Taker

Lifetime $149 - $79 for the first 100 with code FOUNDER.

See pricing