KPI Tree

Metric Definition

Experiment evaluation

Relative Lift = ((Variant Conversion Rate - Control Conversion Rate) / Control Conversion Rate) x 100
Variant Conversion RateThe measured outcome for the group shown the new experience, such as conversion rate or revenue per user
Control Conversion RateThe measured outcome for the group shown the existing experience

Track from

Metric GlossaryProduct Metrics

A/B testing analysis

A/B testing analysis is the process of comparing two or more variants shown to randomly assigned groups to decide whether a change produces a meaningfully better outcome on a defined success metric. It combines the observed difference between variants with the statistical confidence that the difference is real and not chance. Done well, it replaces opinion with evidence and tells you which changes are worth shipping.

8 min read

Generate AI summary

What is A/B testing analysis?

A/B testing analysis is the process of comparing two or more variants shown to randomly assigned groups to decide whether a change produces a meaningfully better outcome on a defined success metric. The control group sees the existing experience, the treatment group sees the proposed change, and the performance of each is measured on a metric agreed before the test begins. If a checkout page converts at 4.0 per cent for the control and 4.4 per cent for the variant, the relative lift is 10 per cent.

The analysis has three parts that must be read together. The first is the observed difference, or lift, between variants. The second is statistical significance, which tells you how likely it is that the difference is real rather than random noise. The third is practical significance, which asks whether the lift is large enough to justify the cost of shipping the change. A result can be statistically significant but too small to matter, or it can show a large difference that is not significant because the sample was too small.

Significance is usually expressed as a p-value or a confidence interval. A p-value below 0.05, or 95 per cent confidence, is the common threshold and means there is less than a 5 per cent chance the observed difference happened by luck. The right threshold depends on the cost of being wrong. A pricing change may warrant 99 per cent confidence, while a copy tweak may be actionable at 90 per cent.

Define the primary metric and the required sample size before the test goes live. Changing the success metric mid-test, or stopping the moment results look good, introduces bias that invalidates the result. Pre-registering the design is what separates rigorous experimentation from cherry-picking data.

How to calculate A/B testing analysis

The headline output of an A/B test is relative lift: the percentage by which the variant outperforms the control on the chosen metric. To trust that number you also need the sample size, the significance level, and a confidence interval around the lift. Lift alone is a point estimate, and a point estimate without a confidence interval can mislead.

Work through the inputs in order. Each one is a checkpoint that can invalidate the result if it is missing or wrong.

  1. 1

    Control and variant outcome

    Measure the success metric for each group, for example the conversion rate or revenue per user. The variant minus the control, divided by the control, gives the relative lift.

  2. 2

    Sample size per variant

    Count the number of users randomly assigned to each group. The required size is calculated before launch from the baseline rate, the minimum detectable effect, and the desired statistical power.

  3. 3

    Statistical significance

    Compute the p-value or confidence level for the observed difference. A p-value below the threshold agreed in advance means the difference is unlikely to be chance.

  4. 4

    Confidence interval

    Report the range the true lift is likely to fall within. A lift of 10 per cent with an interval of 4 to 16 per cent tells a far richer story than the single number.

A/B testing analysis in a metric tree

A single test result is hard to value in isolation. A metric tree lifts the focus from one experiment to the performance of the whole experimentation programme, connecting test velocity, win rate, and effect size to the cumulative business impact of the winners you ship.

The decomposition below shows how programme impact breaks down into the levers a team actually controls. Reading it top to bottom makes it clear why a programme can run many tests yet produce little impact: most tests do not reach significance, and the value comes from the few that do.

Metric tree insight

KPI Tree lets you model the experimentation programme as a tree where each branch has an accountable owner. Test velocity sits with the growth team, win rate with the design and research leads, and shipped impact with the product owner. When a winning test ships, its measured lift flows into the tree, and the verified impact loop checks whether the headline metric actually moved as predicted rather than just at launch.

A/B testing analysis benchmarks

There is no single benchmark for a test result, because the right lift depends on the surface and the baseline. What does benchmark well is programme health: how many tests reach significance, how often they win, and how large the winning effects are. The ranges below reflect typical mature web and product experimentation programmes.

Programme measureBelow parHealthyStrong
Share of tests reaching significanceUnder 20 per cent20 to 40 per centOver 40 per cent
Win rate of completed testsUnder 10 per cent10 to 25 per centOver 25 per cent
Average lift of winning testsUnder 2 per cent2 to 8 per centOver 8 per cent
Tests launched per quarterUnder 55 to 20Over 20

How to improve A/B testing analysis

Improving A/B testing analysis means raising the quality and throughput of decisions, not just running more tests. The aim is more trustworthy results, a higher win rate, and faster learning. These four practices move the needle most.

Pre-register the design

Fix the primary metric, guardrail metrics, and required sample size before launch. This removes the temptation to peek and stop early, which is the single biggest source of false positives.

Size tests properly

Calculate the sample needed from the baseline rate, minimum detectable effect, and power. Underpowered tests produce noisy results that cannot be trusted however clean the analysis looks.

Add guardrail metrics

Track downstream metrics like retention alongside the primary metric so a win on clicks does not quietly degrade purchases or long-term engagement.

Prioritise by expected impact

Score proposed tests by potential lift and reach so traffic and engineering time go to the experiments most likely to produce a meaningful, shippable result.

Common mistakes when tracking A/B testing analysis

  1. 1

    Peeking and stopping early

    Checking results before the test reaches its required sample, then stopping when the number looks good, inflates the false positive rate. Run a fixed-sample test to completion, or adopt a sequential method designed for valid early stopping.

  2. 2

    Ignoring multiple comparisons

    Running five variants at 95 per cent confidence without correction pushes the chance of at least one false positive to roughly 23 per cent, not 5 per cent. Apply a Bonferroni or similar adjustment.

  3. 3

    Confusing statistical and practical significance

    A significant result with a tiny effect may not be worth shipping. Define the minimum effect that would justify the change before launch and treat anything below it as inconclusive.

  4. 4

    Optimising a proxy metric

    A test that lifts clicks while lowering purchases is a net loss. Pair every primary metric with guardrails so you do not win the proxy and lose the goal.

Related metrics

Conversion rate

CVR

Marketing Metrics
ShopifyGoogle AdsGoogle AnalyticsPostHog

Metric Definition

Conversion Rate = (Number of Conversions / Total Visitors or Leads) × 100

Conversion rate measures the percentage of visitors, users, or leads who take a desired action, such as making a purchase, signing up for a trial, or submitting a form. It is the fundamental metric for evaluating the effectiveness of any acquisition funnel, landing page, or marketing campaign.

View metric

Click-through rate

CTR

Marketing Metrics
Google AdsKlaviyo

Metric Definition

CTR = (Clicks / Impressions) × 100

Click-through rate measures the percentage of people who click on a link, ad, or call-to-action after seeing it. It is one of the most fundamental engagement metrics in digital marketing, connecting impressions to action and serving as an early indicator of campaign relevance and audience targeting quality.

View metric

Feature adoption rate

Product Metrics
PostHog

Metric Definition

Feature Adoption Rate = (Users Who Used the Feature / Total Active Users) × 100

Feature adoption rate measures the percentage of users who use a specific feature within a given period. It tells product teams whether new features are resonating with users and which existing features are underutilised, guiding investment decisions and roadmap priorities.

View metric

Retention rate

Product Metrics

Metric Definition

Retention Rate = (Users Active at End of Period / Users Active at Start of Period) × 100

Retention rate measures the percentage of users or customers who continue to use your product over a given period. It is the most important growth metric because sustainable growth is impossible when users leave faster than they arrive.

View metric

How to run an A/B test with metric trees

Metric Definition

This guide shows you how to structure an A/B test inside a metric tree so the experiment evaluation feeds directly into the metrics it is meant to move.

View metric

Metric trees for product teams

Metric Definition

Product teams running experiment evaluation will see how A/B testing analysis fits alongside the other metrics a product team owns and tracks.

View metric

Turn experiments into a metric tree with KPI Tree

Model your experimentation programme as a tree that connects test velocity, win rate, and effect size to shipped business impact. Give each branch an accountable owner and let the verified impact loop confirm whether winning tests actually moved the headline number.

Experience That Matters

Built by a team that's been in your shoes

Our team brings deep experience from leading Data, Growth and People teams at some of the fastest growing scaleups in Europe through to IPO and beyond. We've faced the same challenges you're facing now.

Checkout.com
Planet
UK Government
Travelex
BT
Sainsbury's
Goldman Sachs
Dojo
Redpin
Farfetch
Just Eat for Business