measurement-frameworkdraft
Measurement Framework: Curator pilot
Executive Summary
The pilot must prove two things: that the analysis is trustworthy and reproducible, and that coaches actually adopt the after-session loop. The single most important metric is the rate at which a coach sends the AI-drafted follow-up to a client with only light edits — it is the truest signal that the output is trusted and time is saved. Everything else is leading or diagnostic.
Metric Hierarchy
- North Star: Trusted follow-ups sent per active coach per week (a sent, lightly-edited, client-facing summary = trust + value realized).
- L1 — Trust: % of generated analyses the coach sends with ≤ X% edit distance; analysis reproducibility score (agreement across re-runs).
- L1 — Adoption: weekly active coaches; % of sessions run through Curator.
- L2 — Effort saved: coach minutes/session before vs. with Curator.
- L2 — Continuity: % sessions where dossier opened pre-session; commitment-tracking usage.
- Input: transcription quality (WER, diarization accuracy); consent capture rate.
Counter-Metrics (Goodhart-resistant)
| Primary |
Gaming risk |
Counter-metric |
| Follow-ups sent |
Coach sends low-quality summaries to hit a number |
Client-side usefulness rating / reply rate |
| Edit distance "low" |
Output bland/generic so nothing to edit |
Coach-rated usefulness ≥ 8/10; specificity check |
| Reproducibility "high" |
Pipeline outputs vague text that's trivially stable |
Human-vs-AI agreement (F1) on concrete extractions |
| Weekly active coaches |
Coaches log in but don't run sessions |
% of real sessions processed |
Leading / Lagging Pairs
| Leading |
Lag |
Lagging outcome |
| Reproducibility score, usefulness rating |
days–weeks |
Trusted follow-ups sent (NSM) |
| Minutes saved/session |
weeks |
Coach retention / renewal intent |
| Dossier open rate |
weeks–months |
Continuity-driven lock-in |
Experiment Plan (core reproducibility test)
- Hypothesis: the analysis pipeline produces stable, human-comparable extractions.
- Design: run the same consented, depersonalised session N=3 times; compare extraction sets; also compare AI vs. coach-labeled "gold" on a sample.
- Metric: F1 ≥ 0.70 on attribution/extraction (target borrowed from plan page); inter-run agreement within a defined noise band; inter-annotator kappa on the gold sample.
- Decision rule: below target after 3 iteration cycles → pipeline redesign or pivot (stage 25). At/above → proceed to v1.
- Stat validity caveat: pilot N is tiny; treat as directional signal, power is low — do not over-claim significance.
Retention Design
Cohort coaches by pilot start; track weekly NSM per cohort; watch for PMF erosion (NSM decay after novelty). At concept stage this is a design, not yet runnable — it activates once pilot usage data exists (feeds stage 24/25).
Intervention Triggers (O→I→R→C→W)
- Objective: keep trust + adoption rising.
- Indicator: if reproducibility < target → Respond: pause scaling, run pipeline debugging; Constraint: no shipping untrusted output to clients; What's next: redesign or pivot.
- If follow-ups-sent flat but logins high → adoption-friction investigation (UX), not more features.
Assumption Registry
| Assumption |
Confidence |
If wrong |
| "Sent with light edits" = trust |
Med |
Need a direct trust survey instead |
| F1 ≥ 0.70 is "client-safe" threshold |
Low |
Recalibrate threshold with coaches |
| Tiny pilot N yields usable signal |
Low |
Extend pilot / more coaches |
Self-Critique (≥3)
- NSM ("follow-ups sent") can be satisfied by a coach who sends sloppily — counter-metrics help but add measurement burden a small pilot may not sustain.
- The reproducibility threshold (F1 ≥ 0.70) is inherited from the plan page, not independently justified — it may be too low for "send to a client."
- Almost all targets are unbaselined; the first pilot job is to establish baselines, so early "results" will be noisy and easily over-interpreted.