measurement-frameworkdraft

Measurement Framework: Curator pilot

Executive Summary

The pilot must prove two things: that the analysis is trustworthy and reproducible, and that coaches actually adopt the after-session loop. The single most important metric is the rate at which a coach sends the AI-drafted follow-up to a client with only light edits — it is the truest signal that the output is trusted and time is saved. Everything else is leading or diagnostic.

Metric Hierarchy

North Star: Trusted follow-ups sent per active coach per week (a sent, lightly-edited, client-facing summary = trust + value realized).
L1 — Trust: % of generated analyses the coach sends with ≤ X% edit distance; analysis reproducibility score (agreement across re-runs).
L1 — Adoption: weekly active coaches; % of sessions run through Curator.
L2 — Effort saved: coach minutes/session before vs. with Curator.
L2 — Continuity: % sessions where dossier opened pre-session; commitment-tracking usage.
Input: transcription quality (WER, diarization accuracy); consent capture rate.

Counter-Metrics (Goodhart-resistant)

Primary	Gaming risk	Counter-metric
Follow-ups sent	Coach sends low-quality summaries to hit a number	Client-side usefulness rating / reply rate
Edit distance "low"	Output bland/generic so nothing to edit	Coach-rated usefulness ≥ 8/10; specificity check
Reproducibility "high"	Pipeline outputs vague text that's trivially stable	Human-vs-AI agreement (F1) on concrete extractions
Weekly active coaches	Coaches log in but don't run sessions	% of real sessions processed

Leading / Lagging Pairs

Leading	Lag	Lagging outcome
Reproducibility score, usefulness rating	days–weeks	Trusted follow-ups sent (NSM)
Minutes saved/session	weeks	Coach retention / renewal intent
Dossier open rate	weeks–months	Continuity-driven lock-in

Experiment Plan (core reproducibility test)

Hypothesis: the analysis pipeline produces stable, human-comparable extractions.
Design: run the same consented, depersonalised session N=3 times; compare extraction sets; also compare AI vs. coach-labeled "gold" on a sample.
Metric: F1 ≥ 0.70 on attribution/extraction (target borrowed from plan page); inter-run agreement within a defined noise band; inter-annotator kappa on the gold sample.
Decision rule: below target after 3 iteration cycles → pipeline redesign or pivot (stage 25). At/above → proceed to v1.
Stat validity caveat: pilot N is tiny; treat as directional signal, power is low — do not over-claim significance.

Retention Design

Cohort coaches by pilot start; track weekly NSM per cohort; watch for PMF erosion (NSM decay after novelty). At concept stage this is a design, not yet runnable — it activates once pilot usage data exists (feeds stage 24/25).

Intervention Triggers (O→I→R→C→W)

Objective: keep trust + adoption rising.
Indicator: if reproducibility < target → Respond: pause scaling, run pipeline debugging; Constraint: no shipping untrusted output to clients; What's next: redesign or pivot.
If follow-ups-sent flat but logins high → adoption-friction investigation (UX), not more features.

Assumption Registry

Assumption	Confidence	If wrong
"Sent with light edits" = trust	Med	Need a direct trust survey instead
F1 ≥ 0.70 is "client-safe" threshold	Low	Recalibrate threshold with coaches
Tiny pilot N yields usable signal	Low	Extend pilot / more coaches

Self-Critique (≥3)

NSM ("follow-ups sent") can be satisfied by a coach who sends sloppily — counter-metrics help but add measurement burden a small pilot may not sustain.
The reproducibility threshold (F1 ≥ 0.70) is inherited from the plan page, not independently justified — it may be too low for "send to a client."
Almost all targets are unbaselined; the first pilot job is to establish baselines, so early "results" will be noisy and easily over-interpreted.