Quality

Coach quality, in the open.

Most fitness apps with an AI coach don’t publish what their coach actually does. We publish ours. Every coach prompt VolumeArc ships goes through a 20-fixture regression test. This page shows the latest results.

Coming soon. The live eval-trend chart and per-fixture last-run table are wiring up in VOL-148. The data source is docs/coach-eval-trend.json in the app repo, populated by the nightly response-eval CI job (VOL-147). Until then, this page describes what the harness covers.

What the harness covers

Readiness × Intent

Bucketed at readiness 45 / 60 / 72 / 82 / 88 across six intents (progression, deload, form, recovery, substitution, free).

Coaching style

Three personas: motivational, analytical, minimal. The system prompt envelope is asserted to match the user setting on every render.

Session history

Cold-start (0 sessions), single-session, established (5+ sessions). Tests verify the coach references real prior context when present.

Privacy mode

Strict-mode redaction is asserted to drop user identifiers from the outbound prompt envelope before it leaves the device.

Two layers of testing

Template layer (hermetic, runs in CI). Every prompt rendered through CoachPromptTemplate.render(...) is asserted to contain the template marker, the intent envelope, the persona appropriate to the user’s coaching style, and the verbatim question. A regression that bypasses the template drops the marker and trips the test. Runs on every pull request.

Response layer (live-relay, runs nightly). Twenty fixtures POST to the production relay. Each response is checked for: maximum sentence count, presence of numeric grounding (RPE / weight / reps), reference to readiness state, no banned phrases (“I don’t know”, “ChatGPT”, etc.), and pain-signal flagging where expected. Regressions open a Linear issue and notify the team.

Why we publish this

LLMs drift. Frontier models change. Prompts that worked yesterday may degrade tomorrow. The only honest answer to “is the coach actually good?” is to put the test results in front of you. If you see a fixture failing or a trend going the wrong way, you’ll see it here before we ship anything new.