When people think about quantitative testing in product experimentation, A/B testing tends to be the first one which springs to mind.
Although A/B testing is the most common form of product experimentation, there are many different forms of experiment.
These range from controlled experiment types such as multi-variant testing and non-inferiority testing to experiment proxies for product changes, such as comparing market A to market B, or before and after tests.
Every quantitative test has its use case, benefits and limitations. As does quantitative testing when it comes to product development overall.
In this guide we’re going to take you through different types of quantitative tests, discussing when you might use them, and when you might choose not to.
We’ll also take you through alternative approaches to quantitative testing when it comes to shipping products with reasonable certainty.
We’ll discuss the limitations of quantitative testing at all and what you can do instead.
By the end of this guide you should feel comfortable with many different types of quantitative tests, know when to use each, and also feel comfortable in scenarios where you decide not to use them at all.
Introduction to quantitative testing
The two categories of quantitative tests are:
- Controlled experiments: randomizing the split of traffic across experience variants. A/B testing sits in this category, as does multi variant testing. Typically large scale B2C site tactics; many B2B sites do not have the traffic scale to be able to test.
- Experiment proxies. Instead of randomly splitting controlled traffic, this approach compares data from two different groups. Examples are splitting traffic by markets, or before and after time-frame analysis.
Every approach to experimentation has its own benefits, limitations, and scenarios that it best applies to.
Controlled product experimentation
These are often used in scenarios where smaller changes can have a big impact, and where a high degree of certainty is required.
Typically this is the case for bigger sites with a lot of traffic. Low traffic sites are often not able to test due to user volume.
Get the Hustle Badger Guide to When to A/B test
Benefits of controlled experiments
Controlled experiments are called controlled as only one or two factors are changed at a time, while all other conditions are kept constant.
This increases the degree of accuracy that you can attribute to the test. Controlled experiments give a high degree of accuracy in measuring the impact of changes you are making.
Limitations of controlled experiments
You can only change a small number of things at a time. This approach lends itself to optimization rather than big shifts.
Controlled experiments also aren’t feasible in a number of scenarios, such as traffic constraints, SEO testing, or when there are legal or other types of constraints. For example when you must ship X permission, or you must ship a rebrand across the whole site at once.
Proxy product experimentation methods
Often used when a controlled test isn’t feasible or less rigorous testing methods allow you to understand whether a change has had an impact (be it positive or negative) or not.
Benefits of proxy product experimentation methods
Using an experiment proxy allows you to move faster and ship more.
You can get an idea of how a change is performing if there is a big shift using pre / post analysis. You can still see if there have been any major negative impacts.
There are many scenarios where a controlled test isn’t possible, so proxy experimentation methods are useful parts of the toolkit.
Limitations of proxy product experimentation methods
Using an experiment proxy means results are not conclusive, and you don’t have certainty in what happened.
You cannot be confident as to whether the impact you’re seeing is due to a positive change in the experience, or an external factor (i.e. a change in the market, or seasonality).
It is difficult to dig into the why behind a change. You can drill down into different metrics, but it becomes more difficult to see the overall impact without a conclusive result.
Factors influencing which category of test to pick
Normally the deciding factor for which type of test you run is either a technical or legal constraint, or driven by the amount of traffic you have.
Traffic constraints
When running a controlled experiment, you want to answer your hypothesis in a reasonable time frame. We suggest that that time frame should be no longer than 4 weeks.
Controlled experiment run-time is influenced by:
- The metric you will measure: How many users will your experiment reach?
- The desired uplift you expect to see on this metric: Does it need to be a positive change, and if so, by how much should the metric move?
- The sample size: Will all of your audience be exposed to the experiment? Where in the journey will it fire?
However in many scenarios you may find you don’t have enough traffic to achieve a result within a desirable run-time.
Ways to run controlled tests in a low traffic environment
If you have to run a controlled test in a low traffic environment, there are some workaround techniques that can allow your experiments to conclude in a reasonable time-frame:
Reduce the statistical significance you’re trying to reach.
Most companies aim for a 95% confidence interval, you might be happy to take 80% in the early stages. This can still provide a strong likelihood that your new variant outperforms your original variant.
You can also see with most experiment dashboards what the maximum negative result might be if the result is actually negative.
This allows you to reach statistical significance quicker, and still assess the balance of risk vs growth. Having said all that, it means the test is more certain than a before after proxy, but it doesn’t mean it’s totally robust
Aim for a bigger change
The bigger the variation is between your original variant, and the new variant, the quicker you will be able to see a conclusive result. If you’re in early stage growth, it’s likely that you’ll be making big changes that have a big impact.
Test on high-traffic pages
Start running experiments on your ads or landing pages, and test changes above the fold. This way you’re maximizing the sample size of the traffic that you do have.
Having said that, even with this approach there will still be times when this is just unfeasible, and you need to use a proxy approach instead.
Technical, legal or commercial constraints
There are times when you have no option but to adopt a proxy test approach. Here are some of the most common reasons why:
- Legal / risk driven features: in heavily regulated industries, or when overarching new legal requirements come in, sometimes you have no choice but to ship to all users. A good recent example of this was cookie banners which allowed users to opt into some but not all cookies
- Technical: there will always be times where you cannot technically split a feature, but additionally there are types of pages which cannot be tested. One of the most well known examples is SEO A/B testing: you can’t split the user experience for users who land on a single URL; and you can’t test SEO uplift for a single query across multiple URLs.
- Commercial: there may be times where your commercial function advocates for site wide shipping of a change. Common examples include rebrands or site wide pricing changes, where the downside to showing users a different experience on different pages outweighs the possible upside of controlled testing.
Let’s now go through both quantitative test categories in detail.
Controlled experiments
Controlled experiments are those where
- You have a high degree of control over who sees variations and what variations they see
- You gain reliable mathematical proof of the effect that these tests have on users
There are multiple different types of controlled test:
- A/B testing: good for gaining a high degree of certainty in the impact of a specific change
- A/A testing: good for checking your test tooling works, and training your team on result fluctuations prior to statistical significance
- Multi-variant testing: A/B tests on steroids (sometimes referred to as A/B/C/D… tests) these are good for testing multiple different treatments at once
- Non-inferiority testing: to check the new treatment is no worse than the original
- Hold out groups: good for seeing the big picture over longer time periods