measurement-frameworkdraft

Measurement Framework: Curator pilot

Executive Summary

The pilot must prove two things: that the analysis is trustworthy and reproducible, and that coaches actually adopt the after-session loop. The single most important metric is the rate at which a coach sends the AI-drafted follow-up to a client with only light edits — it is the truest signal that the output is trusted and time is saved. Everything else is leading or diagnostic.

Metric Hierarchy

  • North Star: Trusted follow-ups sent per active coach per week (a sent, lightly-edited, client-facing summary = trust + value realized).
  • L1 — Trust: % of generated analyses the coach sends with ≤ X% edit distance; analysis reproducibility score (agreement across re-runs).
  • L1 — Adoption: weekly active coaches; % of sessions run through Curator.
  • L2 — Effort saved: coach minutes/session before vs. with Curator.
  • L2 — Continuity: % sessions where dossier opened pre-session; commitment-tracking usage.
  • Input: transcription quality (WER, diarization accuracy); consent capture rate.

Counter-Metrics (Goodhart-resistant)

Primary Gaming risk Counter-metric
Follow-ups sent Coach sends low-quality summaries to hit a number Client-side usefulness rating / reply rate
Edit distance "low" Output bland/generic so nothing to edit Coach-rated usefulness ≥ 8/10; specificity check
Reproducibility "high" Pipeline outputs vague text that's trivially stable Human-vs-AI agreement (F1) on concrete extractions
Weekly active coaches Coaches log in but don't run sessions % of real sessions processed

Leading / Lagging Pairs

Leading Lag Lagging outcome
Reproducibility score, usefulness rating days–weeks Trusted follow-ups sent (NSM)
Minutes saved/session weeks Coach retention / renewal intent
Dossier open rate weeks–months Continuity-driven lock-in

Experiment Plan (core reproducibility test)

  • Hypothesis: the analysis pipeline produces stable, human-comparable extractions.
  • Design: run the same consented, depersonalised session N=3 times; compare extraction sets; also compare AI vs. coach-labeled "gold" on a sample.
  • Metric: F1 ≥ 0.70 on attribution/extraction (target borrowed from plan page); inter-run agreement within a defined noise band; inter-annotator kappa on the gold sample.
  • Decision rule: below target after 3 iteration cycles → pipeline redesign or pivot (stage 25). At/above → proceed to v1.
  • Stat validity caveat: pilot N is tiny; treat as directional signal, power is low — do not over-claim significance.

Retention Design

Cohort coaches by pilot start; track weekly NSM per cohort; watch for PMF erosion (NSM decay after novelty). At concept stage this is a design, not yet runnable — it activates once pilot usage data exists (feeds stage 24/25).

Intervention Triggers (O→I→R→C→W)

  • Objective: keep trust + adoption rising.
  • Indicator: if reproducibility < target → Respond: pause scaling, run pipeline debugging; Constraint: no shipping untrusted output to clients; What's next: redesign or pivot.
  • If follow-ups-sent flat but logins high → adoption-friction investigation (UX), not more features.

Assumption Registry

Assumption Confidence If wrong
"Sent with light edits" = trust Med Need a direct trust survey instead
F1 ≥ 0.70 is "client-safe" threshold Low Recalibrate threshold with coaches
Tiny pilot N yields usable signal Low Extend pilot / more coaches

Self-Critique (≥3)

  1. NSM ("follow-ups sent") can be satisfied by a coach who sends sloppily — counter-metrics help but add measurement burden a small pilot may not sustain.
  2. The reproducibility threshold (F1 ≥ 0.70) is inherited from the plan page, not independently justified — it may be too low for "send to a client."
  3. Almost all targets are unbaselined; the first pilot job is to establish baselines, so early "results" will be noisy and easily over-interpreted.