Metric Definition
Experiment evaluation
Track from
A/B testing analysis
A/B testing analysis is the process of comparing two or more variants shown to randomly assigned groups to decide whether a change produces a meaningfully better outcome on a defined success metric. It combines the observed difference between variants with the statistical confidence that the difference is real and not chance. Done well, it replaces opinion with evidence and tells you which changes are worth shipping.
8 min read
What is A/B testing analysis?
A/B testing analysis is the process of comparing two or more variants shown to randomly assigned groups to decide whether a change produces a meaningfully better outcome on a defined success metric. The control group sees the existing experience, the treatment group sees the proposed change, and the performance of each is measured on a metric agreed before the test begins. If a checkout page converts at 4.0 per cent for the control and 4.4 per cent for the variant, the relative lift is 10 per cent.
The analysis has three parts that must be read together. The first is the observed difference, or lift, between variants. The second is statistical significance, which tells you how likely it is that the difference is real rather than random noise. The third is practical significance, which asks whether the lift is large enough to justify the cost of shipping the change. A result can be statistically significant but too small to matter, or it can show a large difference that is not significant because the sample was too small.
Significance is usually expressed as a p-value or a confidence interval. A p-value below 0.05, or 95 per cent confidence, is the common threshold and means there is less than a 5 per cent chance the observed difference happened by luck. The right threshold depends on the cost of being wrong. A pricing change may warrant 99 per cent confidence, while a copy tweak may be actionable at 90 per cent.
Define the primary metric and the required sample size before the test goes live. Changing the success metric mid-test, or stopping the moment results look good, introduces bias that invalidates the result. Pre-registering the design is what separates rigorous experimentation from cherry-picking data.
How to calculate A/B testing analysis
The headline output of an A/B test is relative lift: the percentage by which the variant outperforms the control on the chosen metric. To trust that number you also need the sample size, the significance level, and a confidence interval around the lift. Lift alone is a point estimate, and a point estimate without a confidence interval can mislead.
Work through the inputs in order. Each one is a checkpoint that can invalidate the result if it is missing or wrong.
- 1
Control and variant outcome
Measure the success metric for each group, for example the conversion rate or revenue per user. The variant minus the control, divided by the control, gives the relative lift.
- 2
Sample size per variant
Count the number of users randomly assigned to each group. The required size is calculated before launch from the baseline rate, the minimum detectable effect, and the desired statistical power.
- 3
Statistical significance
Compute the p-value or confidence level for the observed difference. A p-value below the threshold agreed in advance means the difference is unlikely to be chance.
- 4
Confidence interval
Report the range the true lift is likely to fall within. A lift of 10 per cent with an interval of 4 to 16 per cent tells a far richer story than the single number.
A/B testing analysis in a metric tree
A single test result is hard to value in isolation. A metric tree lifts the focus from one experiment to the performance of the whole experimentation programme, connecting test velocity, win rate, and effect size to the cumulative business impact of the winners you ship.
The decomposition below shows how programme impact breaks down into the levers a team actually controls. Reading it top to bottom makes it clear why a programme can run many tests yet produce little impact: most tests do not reach significance, and the value comes from the few that do.
Metric tree insight
KPI Tree lets you model the experimentation programme as a tree where each branch has an accountable owner. Test velocity sits with the growth team, win rate with the design and research leads, and shipped impact with the product owner. When a winning test ships, its measured lift flows into the tree, and the verified impact loop checks whether the headline metric actually moved as predicted rather than just at launch.
A/B testing analysis benchmarks
There is no single benchmark for a test result, because the right lift depends on the surface and the baseline. What does benchmark well is programme health: how many tests reach significance, how often they win, and how large the winning effects are. The ranges below reflect typical mature web and product experimentation programmes.
| Programme measure | Below par | Healthy | Strong |
|---|---|---|---|
| Share of tests reaching significance | Under 20 per cent | 20 to 40 per cent | Over 40 per cent |
| Win rate of completed tests | Under 10 per cent | 10 to 25 per cent | Over 25 per cent |
| Average lift of winning tests | Under 2 per cent | 2 to 8 per cent | Over 8 per cent |
| Tests launched per quarter | Under 5 | 5 to 20 | Over 20 |
How to improve A/B testing analysis
Improving A/B testing analysis means raising the quality and throughput of decisions, not just running more tests. The aim is more trustworthy results, a higher win rate, and faster learning. These four practices move the needle most.
Pre-register the design
Fix the primary metric, guardrail metrics, and required sample size before launch. This removes the temptation to peek and stop early, which is the single biggest source of false positives.
Size tests properly
Calculate the sample needed from the baseline rate, minimum detectable effect, and power. Underpowered tests produce noisy results that cannot be trusted however clean the analysis looks.
Add guardrail metrics
Track downstream metrics like retention alongside the primary metric so a win on clicks does not quietly degrade purchases or long-term engagement.
Prioritise by expected impact
Score proposed tests by potential lift and reach so traffic and engineering time go to the experiments most likely to produce a meaningful, shippable result.
Common mistakes when tracking A/B testing analysis
- 1
Peeking and stopping early
Checking results before the test reaches its required sample, then stopping when the number looks good, inflates the false positive rate. Run a fixed-sample test to completion, or adopt a sequential method designed for valid early stopping.
- 2
Ignoring multiple comparisons
Running five variants at 95 per cent confidence without correction pushes the chance of at least one false positive to roughly 23 per cent, not 5 per cent. Apply a Bonferroni or similar adjustment.
- 3
Confusing statistical and practical significance
A significant result with a tiny effect may not be worth shipping. Define the minimum effect that would justify the change before launch and treat anything below it as inconclusive.
- 4
Optimising a proxy metric
A test that lifts clicks while lowering purchases is a net loss. Pair every primary metric with guardrails so you do not win the proxy and lose the goal.
Related metrics
Conversion rate
CVR
Marketing MetricsMetric Definition
Conversion Rate = (Number of Conversions / Total Visitors or Leads) × 100
Conversion rate measures the percentage of visitors, users, or leads who take a desired action, such as making a purchase, signing up for a trial, or submitting a form. It is the fundamental metric for evaluating the effectiveness of any acquisition funnel, landing page, or marketing campaign.
Click-through rate
CTR
Marketing MetricsMetric Definition
CTR = (Clicks / Impressions) × 100
Click-through rate measures the percentage of people who click on a link, ad, or call-to-action after seeing it. It is one of the most fundamental engagement metrics in digital marketing, connecting impressions to action and serving as an early indicator of campaign relevance and audience targeting quality.
Feature adoption rate
Product MetricsMetric Definition
Feature Adoption Rate = (Users Who Used the Feature / Total Active Users) × 100
Feature adoption rate measures the percentage of users who use a specific feature within a given period. It tells product teams whether new features are resonating with users and which existing features are underutilised, guiding investment decisions and roadmap priorities.
Retention rate
Product MetricsMetric Definition
Retention Rate = (Users Active at End of Period / Users Active at Start of Period) × 100
Retention rate measures the percentage of users or customers who continue to use your product over a given period. It is the most important growth metric because sustainable growth is impossible when users leave faster than they arrive.
How to run an A/B test with metric trees
Metric Definition
This guide shows you how to structure an A/B test inside a metric tree so the experiment evaluation feeds directly into the metrics it is meant to move.
Metric trees for product teams
Metric Definition
Product teams running experiment evaluation will see how A/B testing analysis fits alongside the other metrics a product team owns and tracks.
Turn experiments into a metric tree with KPI Tree
Model your experimentation programme as a tree that connects test velocity, win rate, and effect size to shipped business impact. Give each branch an accountable owner and let the verified impact loop confirm whether winning tests actually moved the headline number.