QA Ground Truth Engine v3

Store data input All metrics computed deterministically from these values

Revenue (£)

Ad spend (£)

Orders

COGS (£)

Sessions

Repeat customers

Meta spend (£)

Google spend (£)

Refunds (£)

New customers

LTV (£)

Shipping (£)

BLOCKED WARN

—

Paid agent

ROAS · MER · CAC · Attribution

7 tests pending

BLOCKED WARN

—

Growth agent

LTV · Retention · Cohorts

6 tests pending

BLOCKED WARN

—

CRO agent

CVR · AOV · Friction

6 tests pending

BLOCKED WARN

—

Strategy agent

Profit · Unit economics

7 tests pending

Test results

	Test	Agent	Weight	Ground truth	Tolerance	AI output	Confidence	Status
Load sample data then run all tests

⚠ Agents blocked from Strategy synthesis

Industry benchmark awarenessDTC ecommerce · UK market

Each agent's outputs are checked against real DTC industry ranges. A mathematically correct answer can still be commercially nonsensical — these checks catch that. Run ground truth first to populate actuals.

Paid agent benchmarksPaid

Growth agent benchmarksGrowth

CRO agent benchmarksCRO

Strategy agent benchmarksStrategy

Benchmark summary findings

Run ground truth computation to see benchmark analysis.

Weighted confidence scoringPer-agent 0–100 score

Each test carries a severity weight. Critical tests (ROAS, net profit, CVR) count for more than directional tests. The weighted score determines whether an agent's output is trusted, flagged, or blocked.

Paid weighted score

—

Growth weighted score

—

CRO weighted score

—

Strategy weighted score

—

Per-test confidence breakdown

Run tests to see confidence breakdown

Scoring methodology

Critical (weight 3)

ROAS, net profit, CVR, MER gate. Failure here blocks the agent entirely.

Important (weight 2)

CAC, AOV, gross margin, LTV:CAC. Failure triggers a warning to the Strategy agent.

Directional (weight 1)

Channel splits, repeat rate, benchmarks. Failure logged but doesn't affect the score materially.

Regression tracking

Save a baseline snapshot after each deploy. Every future run compares against it and flags drift above the threshold.

No baselines saved yet. Run a full test suite, then click "Save current as baseline" to record the starting point. Future runs will compare against it.

How regression tracking works

1. Save baseline

After a stable deploy, run the full suite and click "Save baseline". All ground truth values are stored with a timestamp.

2. Run on each deploy

After every future deploy, run the suite with the same dataset. The engine compares each metric against the baseline.

3. Flag drift

Any metric that drifts beyond the threshold (default 5%) is flagged. If 2+ critical metrics drift, the deploy is blocked.