Store data input
All metrics computed deterministically from these values
BLOCKED
WARN
—
Paid agent
ROAS · MER · CAC · Attribution
7 tests pending
BLOCKED
WARN
—
Growth agent
LTV · Retention · Cohorts
6 tests pending
BLOCKED
WARN
—
CRO agent
CVR · AOV · Friction
6 tests pending
BLOCKED
WARN
—
Strategy agent
Profit · Unit economics
7 tests pending
Test results
| Test | Agent | Weight | Ground truth | Tolerance | AI output | Confidence | Status | |
|---|---|---|---|---|---|---|---|---|
Load sample data then run all tests | ||||||||
⚠ Agents blocked from Strategy synthesis
Industry benchmark awarenessDTC ecommerce · UK market
Each agent's outputs are checked against real DTC industry ranges. A mathematically correct answer can still be commercially nonsensical — these checks catch that. Run ground truth first to populate actuals.
Paid agent benchmarksPaid
Growth agent benchmarksGrowth
CRO agent benchmarksCRO
Strategy agent benchmarksStrategy
Benchmark summary findings
Run ground truth computation to see benchmark analysis.
Weighted confidence scoringPer-agent 0–100 score
Each test carries a severity weight. Critical tests (ROAS, net profit, CVR) count for more than directional tests. The weighted score determines whether an agent's output is trusted, flagged, or blocked.
Paid weighted score
—
Growth weighted score
—
CRO weighted score
—
Strategy weighted score
—
Per-test confidence breakdown
Run tests to see confidence breakdown
Scoring methodology
Critical (weight 3)
ROAS, net profit, CVR, MER gate. Failure here blocks the agent entirely.
Important (weight 2)
CAC, AOV, gross margin, LTV:CAC. Failure triggers a warning to the Strategy agent.
Directional (weight 1)
Channel splits, repeat rate, benchmarks. Failure logged but doesn't affect the score materially.
Regression tracking
Save a baseline snapshot after each deploy. Every future run compares against it and flags drift above the threshold.
No baselines saved yet. Run a full test suite, then click "Save current as baseline" to record the starting point. Future runs will compare against it.
How regression tracking works
1. Save baseline
After a stable deploy, run the full suite and click "Save baseline". All ground truth values are stored with a timestamp.
2. Run on each deploy
After every future deploy, run the suite with the same dataset. The engine compares each metric against the baseline.
3. Flag drift
Any metric that drifts beyond the threshold (default 5%) is flagged. If 2+ critical metrics drift, the deploy is blocked.