Ground truth 26
Benchmark awareness 24
Confidence scoring 4 agents
Regression tracking 0 baselines
Store data input All metrics computed deterministically from these values
BLOCKED WARN
Paid agent
ROAS · MER · CAC · Attribution
7 tests pending
BLOCKED WARN
Growth agent
LTV · Retention · Cohorts
6 tests pending
BLOCKED WARN
CRO agent
CVR · AOV · Friction
6 tests pending
BLOCKED WARN
Strategy agent
Profit · Unit economics
7 tests pending
Test results
Test Agent Weight Ground truth Tolerance AI output Confidence Status
Load sample data then run all tests
⚠ Agents blocked from Strategy synthesis
Industry benchmark awarenessDTC ecommerce · UK market

Each agent's outputs are checked against real DTC industry ranges. A mathematically correct answer can still be commercially nonsensical — these checks catch that. Run ground truth first to populate actuals.

Paid agent benchmarksPaid
Growth agent benchmarksGrowth
CRO agent benchmarksCRO
Strategy agent benchmarksStrategy
Benchmark summary findings
Run ground truth computation to see benchmark analysis.
Weighted confidence scoringPer-agent 0–100 score

Each test carries a severity weight. Critical tests (ROAS, net profit, CVR) count for more than directional tests. The weighted score determines whether an agent's output is trusted, flagged, or blocked.

Paid weighted score
Growth weighted score
CRO weighted score
Strategy weighted score
Per-test confidence breakdown
Run tests to see confidence breakdown
Scoring methodology
Critical (weight 3)
ROAS, net profit, CVR, MER gate. Failure here blocks the agent entirely.
Important (weight 2)
CAC, AOV, gross margin, LTV:CAC. Failure triggers a warning to the Strategy agent.
Directional (weight 1)
Channel splits, repeat rate, benchmarks. Failure logged but doesn't affect the score materially.
Regression tracking
Save a baseline snapshot after each deploy. Every future run compares against it and flags drift above the threshold.
No baselines saved yet. Run a full test suite, then click "Save current as baseline" to record the starting point. Future runs will compare against it.
How regression tracking works
1. Save baseline
After a stable deploy, run the full suite and click "Save baseline". All ground truth values are stored with a timestamp.
2. Run on each deploy
After every future deploy, run the suite with the same dataset. The engine compares each metric against the baseline.
3. Flag drift
Any metric that drifts beyond the threshold (default 5%) is flagged. If 2+ critical metrics drift, the deploy is blocked.