What is the difference between diarization and speaker identification?

Diarization labels who speaks when inside a single recording (Speaker A, B, C). Speaker identification matches those labels to specific people, optionally across multiple recordings. Cross-meeting recognition requires both layers.

Does pyannote run on Apple Silicon?

The PyTorch reference does, slowly, on CPU. The production path on Mac is the CoreML port of pyannote-segmentation-3.0, which runs on the Neural Engine at 8-15x real-time with negligible battery cost.

Is a voice embedding biometric data under GDPR?

Article 9 covers biometric data 'used to uniquely identify' a person. A local-only embedding library on the data subject's own device is closer in posture to FaceID than to a centralized voiceprint database. Consult counsel for your specific deployment.

What cosine similarity threshold should I use?

Mac Note Taker defaults to 0.65. Lower thresholds (0.55-0.6) over-merge similar voices; higher thresholds (0.7+) split the same person across mic types. 0.65 is the empirical sweet spot from ~2,000 test meetings.

How big is a voice profile library on disk?

Roughly 770 bytes per turn for embedding storage. A year of dense meeting usage stays under 100MB. No audio is stored in the profile - only the numeric fingerprint.

Can I export my voice profile library?

Yes. The library is a single binary file under your app's Application Support directory. You can copy it between your three activated Macs to share recognition without re-training.

Field notes·2026-05-29·12 min read

Speaker Diarization on Mac (2026): Free, Local, No Cloud

Speaker diarization - the 'who spoke when' problem - used to require a Python research stack and a willingness to ship audio to a Hugging Face inference endpoint. In 2026 it runs in real time on a MacBook Air. Here is what is actually happening, and why it matters for meeting transcripts.

diarizationpyannotecam++voice fingerprinttechnical

If you have used Mac Note Taker, you have seen the result: a transcript with named speakers and color bubbles, where the same person across multiple meetings keeps the same name without you doing anything. The architecture underneath is two models doing different jobs - a segmentation model that splits audio into per-speaker turns, and an embedding model that builds a fingerprint per voice so the same person gets recognized again next week.

This post explains how the pipeline works, why both models had to live on the Neural Engine for the workflow to be usable, and what the cosine similarity threshold actually controls.

Two problems, two models

People conflate diarization with speaker identification. They are separate:

Diarization (within one recording): 'who spoke when'. Output is a sequence of (speaker_id, start, end) tuples where speaker_id is a temporary label - Speaker A, Speaker B - valid only inside that recording.
Speaker identification (across recordings): 'this Speaker A is the same person as the Marko we labeled in Tuesday's recording'. Requires a voice fingerprint that survives audio differences.

Cloud notetakers usually do the first and skip the second. They re-introduce the same coworker as Speaker 1 every meeting because there is no persistent identity layer. The local-first stack on a Mac in 2026 does both, because both can run on the Neural Engine in real time.

Segmentation: pyannote-segmentation-3.0

pyannote-segmentation-3.0 is the open-source segmentation model that has quietly become the standard for production diarization. It is a sliding-window classifier that, for each frame, outputs the probability of each of up to three concurrent speakers. The output is then post-processed into clean speaker turns with overlap handling.

The model is ~6M parameters - small. It was originally designed for the pyannote-audio Python framework, which depends on PyTorch and runs on CPU at maybe 1-2x real-time. The breakthrough for Mac in 2025 was converting the same weights to CoreML. CoreML on the Apple Neural Engine gives roughly 8-15x real-time on M-series Macs, with negligible battery impact.

Embeddings: CAM++

Once segmentation cuts the audio into speaker turns, the second model converts each turn into a fixed-length vector that represents 'who that voice sounds like'. The standard choice in 2026 is CAM++, a 192-dimensional speaker embedding model. The output is 192 floating-point numbers per turn.

Two properties make CAM++ work for cross-meeting recognition:

Speaker discriminative: two clips of the same person produce vectors close in cosine distance; two clips of different people produce vectors far apart.
Content invariant: the embedding doesn't carry the words the person said, only how they sound. You can match a 4-second voicemail against a 90-minute meeting and the cosine score still says 'same person'.

192 floats per turn is about 770 bytes. A meeting with 400 turns produces 300KB of embedding data. The library across a year of meetings stays under 100MB even for prolific users. That is what gets stored, not audio.

Matching: cosine similarity + threshold

When a new meeting finishes, every turn has a CAM++ embedding attached. We compare each embedding against the centroid of every known voice profile in your library. The comparison metric is cosine similarity - a number between -1 and 1, where 1 is 'identical direction in vector space' and 0 is 'unrelated'.

The default match threshold in Mac Note Taker is 0.65. Above 0.65, we label the speaker with the existing profile name. Below 0.65, we treat the speaker as new and prompt you to name them. The threshold can be tuned but 0.65 is a sweet spot we landed on after testing across roughly 2,000 real meetings during 2025-2026.

A common question from privacy reviewers: does storing a voice embedding count as biometric data under GDPR Article 9? The honest answer is 'it depends on intent.' GDPR defines biometric data as data 'resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person.'

Two pragmatic factors mitigate the risk in the local-first pattern:

Storage is local to the user's device, not transmitted or pooled. The Article 9 concern is large-scale identification systems, not personal note-taking tools.
The embedding is created by and for the same person who owns the device. Consent is implicit in installing the app, not collected from third parties whose voices may be embedded.

We address this in more depth in our privacy-first meeting recording guide. The short version: a Mac-local voice profile library is closer in posture to FaceID's enclave-stored biometric than to a cloud database of voiceprints.

How it performs on real hardware

Stage	Model	Cost on M3 Pro	Where it runs
VAD	FluidAudio VAD	~0.5% CPU continuous	CPU
Segmentation	pyannote-segmentation-3.0	~2% NE	Neural Engine
Embedding	CAM++	~1% NE per turn	Neural Engine
Matching	Cosine vs profile centroids	~0.1ms per comparison	CPU

The matching step uses brute-force cosine comparison rather than an ANN index because the library size is small enough that exact comparison is faster than the overhead of maintaining an index. We re-evaluate this if user libraries grow past ~10,000 unique speakers, which has not happened yet.

Things that still go wrong

Honest accounting of the failure modes we still see:

Two people speaking at the exact same time get assigned to one speaker. Overlap handling is improving but not perfect.
A person with a heavy cold or laryngitis gets a separate-looking embedding. Mac Note Taker now back-blends after the user manually merges the two profiles.
Phone-quality audio (8 kHz) loses detail in the upper formants that CAM++ relies on. Cross-meeting matching from a phone call back to a studio recording is unreliable.
A very short turn (<2 seconds) produces a noisy embedding. We mark these as 'low confidence' and avoid creating a new profile from them.

What this enables in your workflow

When diarization runs locally and persists across meetings, three workflows become tractable that aren't with cloud notetakers:

Cross-meeting search by speaker: every utterance from your CEO across the past year, in one query.
Action items attributed to the right person automatically. Our action items page goes deep on this.
Persistent customer voice profiles: a CSM with 30 accounts gets the right name on every call without manual labeling.

The reference stack, one line per layer

Capture: AVAudioEngine + ScreenCaptureKit
VAD:     FluidAudio voice activity detector
ASR:     Parakeet TDT v3 (default) or Whisper Large v3 (CoreML)
Diar:    pyannote-segmentation-3.0 (CoreML, ANE)
Embed:   CAM++ 192-dim (CoreML, ANE)
Match:   Cosine similarity vs local profile centroids
Store:   SwiftData store, FileVault at rest

Bottom line

Speaker diarization stopped being a research problem in 2025 and started being a feature. On a 2026 Mac, the pyannote + CAM++ pipeline runs comfortably on the Neural Engine, the cross-meeting re-id is reliable above 0.65 cosine similarity, and no audio has to leave the device. If you have ever rage-renamed Speaker 1 / Speaker 2 / Speaker 3 across ten meetings in a row, this is the architecture that ends that workflow.

Frequently asked

What is the difference between diarization and speaker identification?
Diarization labels who speaks when inside a single recording (Speaker A, B, C). Speaker identification matches those labels to specific people, optionally across multiple recordings. Cross-meeting recognition requires both layers.
Does pyannote run on Apple Silicon?
The PyTorch reference does, slowly, on CPU. The production path on Mac is the CoreML port of pyannote-segmentation-3.0, which runs on the Neural Engine at 8-15x real-time with negligible battery cost.
Is a voice embedding biometric data under GDPR?
Article 9 covers biometric data 'used to uniquely identify' a person. A local-only embedding library on the data subject's own device is closer in posture to FaceID than to a centralized voiceprint database. Consult counsel for your specific deployment.
What cosine similarity threshold should I use?
Mac Note Taker defaults to 0.65. Lower thresholds (0.55-0.6) over-merge similar voices; higher thresholds (0.7+) split the same person across mic types. 0.65 is the empirical sweet spot from ~2,000 test meetings.
How big is a voice profile library on disk?
Roughly 770 bytes per turn for embedding storage. A year of dense meeting usage stays under 100MB. No audio is stored in the profile - only the numeric fingerprint.
Can I export my voice profile library?
Yes. The library is a single binary file under your app's Application Support directory. You can copy it between your three activated Macs to share recognition without re-training.

Try Mac Note Taker

Lifetime $149 - $79 for the first 100 with code FOUNDER.

See pricing