Chapter 2 How to Set Up & Run a CRO Hypothesis & Testing Program
Once all the data is in, it is time to compile the elements that need improvement and consider how to improve them.
Each website issue detected through research must be included in a comprehensive list. This list is continuously updated as research uncovers new issues. The next step is devising solutions for those issues.
When addressing site issues, there are two types of problems:
- Those that are actionable immediately without any testing
- Those that require testing
Immediately actionable problems are often technical issues that can only be solved in a single way — you just fix them, without the need to hypothesize or test. These types of issues must be fixed immediately and efficiently.
Other problems may present two or more alternative solutions. For example, you may discover that visitors do not scroll the entire length of the product page. To solve this issue, you may propose creating a short-format page with condensed copy and alternative layouts. This is the seed of a hypothesis.
A hypothesis aims to explain an observed trend and create a solution that will improve the result.
An idea itself is not enough for a hypothesis. You need to define the exact scope of the issue:
- How many visitors are affected?
- What areas of the site are affected?
- How much revenue is being lost (if possible to determine using analytics)?
- And how do you expect the proposed solution will improve the performance of the website?
Finally, you need to have a rough idea of how much effort it will take to implement the solution.
How ideas turn into hypotheses, hypotheses turn into tests, and the cycle repeats
A sound hypothesis, based upon the product-page scroll example from the beginning of this section, might look like this:
- Problem: Based on heat and scroll maps, quantitative data, and survey results that confirmed those findings, Product Page X is too long and visitors do not convert because they never scroll all the way to the call-to-action button.
- Proposed solution: Condense copy and make only key product features and CTA button visible above the fold.
- Effort required to implement solution: 6 hours
- Expected result: Increased conversions from the current 0.6% to the overall site average of 2.5%.
Once you’ve included these main elements, which make it possible to rank hypotheses according to their importance, required effort, and expected results — you have a valid hypothesis. All models for hypothesis prioritization are based on some combination of these factors.
1. Hypothesis priority list creation
Prioritizing your hypotheses is a critical step in the CRO process. Done properly, it creates the perfect preconditions for building an effective test program.
The idea of prioritization is to make it easy to decide what hypothesis to test first.
By using the standardized elements of a hypothesis, established as we create our hypotheses, prioritizing is easy. Just select the factors you deem most important, and create a ranking system.
Let’s briefly examine a few existing models for hypothesis prioritization and how to apply them.
PIE model
This model of hypothesis prioritization ranks hypotheses according to their Potential for improvement, Importance and Ease.
- “Potential for improvement” means how likely it is that the hypothesis will result in an overall improvement.
- “Importance” refers to the severity of the observed issue.
- “Ease” denotes the effort necessary to implement the hypothesis.
In the PIE model, hypotheses are ranked from highest to lowest priority, using a scale of 1 to 10.
The weakness of this model is that the “Potential” part of the equation is often hard or impossible to estimate, and may be defined arbitrarily.
This leads to incorrect prioritization and possibly solving only minor issues on the assumption that they’ll achieve significant conversion improvements.
TIR Model
This model uses Time, Impact, and Resources as its main factors. Ranking is similar to the PIE model, except the scale runs from 1 to 5.
Developed by CRO veteran Bryan Eisenberg, this model is tied to the research model Plan, Measure, Improve, and works best when applied with that model.
ICE model
The Impact, Confidence, and Ease model is very similar to PIE, except it uses a confidence factor in place of “potential”. Like PIE, this makes it highly susceptible to subjective opinions and potentially risky.
PXL model
PXL model was developed by ConversionXL, one of the leading CRO agencies.
This model is the most complex of the four, as it tries to take into account a number of objective and real indicators with which to rank hypothesis priority. Its only weakness is its complexity and the effort required to properly use it.
Conducting a test program becomes much easier when hypotheses are properly prioritized. That way, the best-ranking hypothesis becomes the first test to conduct.
The advantages of this approach are twofold:
- You’re starting with the hypothesis that netting the greatest improvement creates the most revenue growth.
- Large wins at the outset increase confidence in the testing program and the effectiveness of CRO in general.
While most fields are self explanatory, the ‘Bucket’ field of the table actually refers to a category of solution.
Here’s an example of a hypothesis priority list:
[chart]
This is an example of a hypotheses list created using the PXL model
2. Conducting the test program
Once you have known hypotheses, and you’ve graded them according to your preferred model, establishing a testing program is a necessary and indispensable development step.
Without a testing program, it would not be possible to conduct tests due to each test’s different technical requirements. For each variation, you must develop a mockup or wireframe of the page design that will be the subject of the experiment. Furthermore, each test needs to have a calculated sample size, based on the expected effect (we’ll discuss how to calculate sample size below).
Knowing these technical requirements in advance makes it possible both to more accurately predict the duration of the entire test program and to report on the implementation progress.
A test program lists the tests derived from hypotheses and includes the following information for each test:
- Description of the test and expected result of each experiment
- Target audience
- Type of test used (we’ll get into these below)
- Mockups of the variation proposals
- Estimate of the test sample size and duration
- Estimate of the potential for interference with other tests, to avoid conflicting results
Testing programs have a dual purpose: to make it easier to follow the issues solved so far, and to provide a template for the final report on the number of tests conducted and the issues solved.
When you complete the test program and are certain that all your hypotheses have been added to it, you can start actually testing.
An important note: Even unsuccessful or inconclusive tests can serve a valuable purpose in the CRO process. They indicate where the process has gone off-track, and help you learn what does not work.
Even unsuccessful or inconclusive tests can serve a valuable purpose in the CRO process.
If a series of tests goes wrong late in the process, it can be the best indicator that the current design has reached its limits, and that to achieve better results, a radical design change is necessary.
Before you decide to do this, though, you should ensure that your hypotheses are sound, and that your tests are framed and implemented and evaluated properly. Otherwise, you may ignore the real issues with the process itself, focusing instead on the website and perceived issues.
To understand testing better, let’s see how the tests themselves work.
Test types, proper test conduction + the fundamental statistics of testing
In order to understand the actual mechanics of testing, we will now examine some of the statistical concepts that are the basis for running a test and estimating its results.
A/B testing
Let’s start with basic A/B testing: the foundational type of test used in CRO. It means pitting two different variations of a web page against each other to determine which produces better results.
Version A may have a negative effect on conversion, while version B may produce positive results
Site traffic is split equally between the two variations, and visitors’ behavior is monitored. The one that results in better performance of the website toward a set metric is declared the winner.
A/B testing relies on two primary types of testing methods:
- Frequentist inference
- Bayesian inference
Both are exact mathematical formulae applied to a given sample of results to determine the difference between them, and find out the preferred result.
When we want to find a solution to issues we’ve detected on a web page, only very rarely will we have a clear-cut, binary solution.
Most likely, there will be several different ways to solve the issue. For example, we may have solutions A, B, C and D — meaning that instead of running a test with two variations (A and B), we’ll run a four-variation test.
This changes very little in the testing process, except the traffic distribution. However, each new variation introduces an amount of uncertainty into the test, and may lead to ‘“cumulative alpha error” — a statistical concept that refers to an increase of the role of pure chance as you increase the number of test variations.
A/B testing tools
A/B testing is performed using specialized tools that do a simple job of great importance: they split the traffic between different variations evenly (or according to user input), then evaluate and report the result.
Optimizely, Visual Web Optimizer, and Google Optimize are the most common tools used to conduct A/B testing, mostly due to their accessibility.
Tools such as Maxymiser or Adobe Target are very powerful, but their cost limits their use to large ecommerce stores that are able to finance the expense. There are a number of other tools available to compare.
All A/B testing tools work by leaving a tracking cookie on each visitor’s browser.
All A/B testing tools work by leaving a tracking cookie on each visitor’s browser. As cookies are deleted by users, those users are served different variations of an experiment when they visit the website. This is what causes sample pollution in tests that run for a long time.
If sample pollution were limited only to those who deleted their cookies, this would not be much of an issue. Alas, browsers also tend to clean cookies periodically, which automatically creates a significant sample pollution problem.
To overcome this problem, you should let tests run for a maximum of one month if the sample size is sufficient.
But this is just one of the rigorous statistical rules we need to follow in order to ensure valid test results. Let’s examine the other rules closely.
Frequentist inference
Frequentist inference relies on the idea that given a large enough sample, you will be able to derive an accurate prediction of future trends based on data gathered in the past.
By using measures such as averages, variations, and deviations, we can describe any sample of events with enough accuracy to predict the probable value of any further element that we add to the sample.
According to this theory, we can test whether our hypothesis influences the average values of any given sample. Using a statistical method called a T-test, we can observe a sample after the change is introduced to either confirm or disprove our hypothesis.
A T-test relies on the concept of statistical significance — this is the ability to register the change in a sample and attribute it to something other than pure chance.
If we observe that the measurements of the sample have changed demonstrably, that means the hypothesis had the intended effect, and we have proven it.
A T-test relies on the concept of statistical significance — this is the ability to register the change in a sample and attribute it to something other than pure chance.
A simplified example would be to test if a coin toss is truly random and the coin itself has not been weighted or tampered with. Our basic hypothesis would be that the probabilities of a toss being heads or tails are equal (a 50% likelihood of each).
To test this hypothesis, toss a coin a number of times and note the result. If after a sufficient number of tosses, the result is an equal number of heads and tails, our hypothesis is confirmed as true. If not, and if there are more tails than heads or vice-versa, then we can safely conclude that the coin itself is biased and somebody is cheating.
The tricky part of successful testing is to determine the “sufficient” number of observations (in the coin-toss example, observations = tosses). The simple answer to this question is: as many as you can afford to.
Would you be able to say that the coin is unfairly weighted if, after 10 tosses, six landed on heads and four on tails? No.
You can give this answer intuitively, based on assumption. It is obvious to everyone that it is possible to have a slight deviation in your test results due to pure chance and factors you can’t change: for example, the effects of gravity, wind, a finger flip, or any number of other effects unaccounted for.
After 100 tosses with a 60 – 40 result, though, you’d be wondering — and you’d probably decide that there is something wrong with the coin.
In this simple example, we can rely on intuition, and we wouldn’t be far off.
A/B testing for CRO, however, is invariably more complex. Conversions on a website are influenced by many different factors and many of them can only be guessed at.
Relying on intuition can be downright dangerous for the business.
If we are reduced to relying on intuition in CRO, it means we’d better forsake the entire exercise and save on expenses. Fortunately, and thanks to the elaborate science of statistics, we do not need to rely on intuition. We have mathematical formulas. (Don’t worry, you don’t need to learn them by heart.)
It is time for a little statistics lingo. Once again, don’t worry, we’ll only explain a few basic concepts necessary to understand statistics.
Statistical Lingo You Need to Know
Sample Size
Let’s begin with “sample size,” as it’s a critical determinant of the entire process.
The sample size in statistics is defined as a number of observations necessary to reach a decision. It ultimately depends on the effect we want to measure — in CRO, conversions are the only important metric we want to measure, so it will most likely be the difference between the original and variation. The larger effect (or difference) we expect, the smaller the sample size we’ll need to spot that effect.
Null Hypothesis
Actual hypothesis testing means trying to prove that an original hypothesis or a ‘null hypothesis’ (meaning the one we start with) is wrong and needs to be replaced by an alternative one. The process of proving a hypothesis wrong is similar to a criminal case trial.
- The defendant (null hypothesis) is considered innocent (correct) until proven guilty beyond a reasonable doubt.
- The prosecution (you) gathers evidence to the contrary and examines witnesses to prove the null hypothesis’ guilt.
- The defense (assumptions of the null hypothesis) disputes the evidence and testimony.
- The jury (your A/B testing program) weighs evidence objectively and decides its credibility and relevance to the case.
Evidence
When enough evidence is in (AKA you’ve reached a significant sample size), the defendant is proven guilty or not.
Evidence collection in A/B testing is sampling. It consists of making a number of measurements, or observations, and recalculating the sample attributes:
- Average
- Variance
- Deviation
For the purposes of CRO, we observe individual visitor interaction with the website and measure the change in visitors’ behavior — namely conversions.
We aim to reach the calculated sample size and statistical significance of the result to prove that the variation was more successful than the original.
Statistical Significance
Here is another important, nay, critical concept.
Statistical significance means that the observed effect is real and not a result of chance, given that the null hypothesis is true.
In practice, this means that significance tells you if the result you observe is meaningful, the observed effect is real, and no other possible explanation can account for the effect you observed.
Let’s illustrate this with a simple example using a six-sided die. (You may want to make sure the die you throw is not “loaded”.)
You start with the assumption that each side actually has a 1 in 6 chance of landing. You throw the die a number of times, say 1,200 times, and you notice that numbers 3 and 5 occur bit more frequently than others.
Is there any other explanation than a loaded die to account for this?
Well, yes: imperfections of the surface on which you threw the die may have influenced its behavior. So before making the call that it’s loaded, you should try tossing it on another surface.
The same is true for A/B testing. You may implement an improvement, but without conducting an A/B test properly, you will never be sure that the result you observe is not the function of some other cause you did not take into account.
Frequentist A/B testing
To conduct T-tests, you need to define the sample size using your expected effect and a calculator, like this one by Evan Miller.
Once you define your sample size, you’ll set up a test. This step consists of making a variation, using an A/B testing tool, and making your variation live on the website. Once you do this, the testing tool takes over.
Splitting traffic between two (or more) variations, the testing tool conducts all the elaborate statistical mathematics for you and reports the results. However, to properly set up the test and correctly interpret the results, you need to have at least working knowledge of statistics.
Bayesian inference
Until a few years ago, all of the major A/B testing tools used the frequentist method almost exclusively. This has begin to change recently in favor of Bayesian inference, and now there are a few tools that use Bayesian inference exclusively, such as Google Optimize and Visual Web Optimizer.
This shift in preference has been the result of two major factors:
- More computing power available
- More intuitive reporting of the results
The first factor limited wider application of Bayesian inference until recently. Bayesian inference differs from frequentist inference in that it relies on taking into account all the new data that becomes available as you run a test.
The elaborate and complex mathematics required to update the results regularly require a lot more computational power, which only became available relatively recently.
To use our coin-tossing example again, Bayesian inference observes the number of tosses much the same way frequentist inference would. However, each new toss result is added into the pool of available information. That way, instead of needing 100 tosses to decide that the probability indeed is 50% each, Bayesian inference will decide this by toss number 50.
The major advantages of Bayesian testing are that it is faster to get results, takes into account actual new data, and it’s easier to interpret. Its only major drawback is that it is harder to calculate.
So the major advantages of Bayesian testing are that it is faster to get results, takes into account actual new data, and it’s easier to interpret. Its only major drawback is that it is harder to calculate.
However, it is not that simple. While in theory, by using Bayesian testing, it is possible to reach the test result sooner, you still need to take into account things like days of the week, seasonal influences, length of the buying cycle, and other real-world effects. It is always prudent to let tests run for least three to four full weeks.
Multivariate testing
Sometimes, we need to test different layouts of the page to see if different positions of calls to action, product images, headlines, descriptions, etc. will have a measurable impact on conversions.
To do this, we need to use what is called a multivariate test.
In this example, four versions of page layout were tested
Multivariate testing (MVT) means that we define all the possible different combinations of a page layout and test to see which one obtains the best result.
While this method is similar to A/B testing, MVT requires calculating the number of possible variations and creating all of them. For example, to test for all possible variations of four different elements, we will need to multiply the number of elements by the number of variations for each individual one.
If we have two different versions of a headline, two call-to-action button versions, and four different product images, we could end up needing 16 different variations. As you can see, the number of tests to run grows exponentially by adding more elements.
MVT can be useful to determine the precise layout of a product page, for example, but its usefulness is limited by the sheer effort necessary to design and deploy all the variations, as well as the traffic volume required to effectively run this type of test.
Bandit testing
Bandit testing is intended to solve a specific problem — implementing the best-performing variation in the shortest possible time and getting the most revenue growth.
The basic idea, like so many concepts in probability and statistics, comes from games of fortune. The statistical mechanism used for this type of test was devised by analyzing the problem of pulling the most winning levers of the slot machine called the “one-armed bandit”.
The end result of this analysis was the abstract machine called the multi-armed bandit, where the player is pulling levers (arms) that result in prizes most often.
How A/B testing compares to bandit testing
Translated into testing, it means sending the most visitors to the variation that has the largest-observed conversion rate.
While this sounds nice in theory, if you remember our short explanation of statistics, this route is fraught with danger.
- Without statistical significance and adequate sample size, it is not possible to know which of the tests is really better.
- If you use bandit testing over an extremely short period of time, you may fail to notice the impact of other factors, such as days of the week, on conversion.
Bandit testing can be extremely useful for testing things like ad campaign landing pages, which need to earn as many conversions as possible over a relatively short period of time, or really significant changes over a much longer period of time.
Otherwise, you run the risk of eliminating potentially winning variations too early on account of low initial results.
Bandit testing can work much better when used in combination with Bayesian testing, although you still have to take into account events unrelated to what you’re testing for.
3. Interpreting your test results
When the test is complete, the testing tool will report the results of the test. This is the time when your CRO hypothesis will (hopefully) be validated by the results.
To be sure that the results are correct, check how many visitors and transactions have been recorded by your testing tool, and verify that these numbers are equal to the calculated sample size, and that you’ve run the test for at least two weeks.
If, after the test reaches the required sample size and has run for long enough, the result of the test has not reached statistical significance, then it means that the hypothesis is not good enough, and that you should adjust it. Whether the hypothesized solution is not actually a solution or maybe the problem diagnosed on the page is not what actually hinders conversions, you need to revisit your hypothesis and improve it. For example, you may have hypothesized that a solution to people not converting on your long form page is the length of the copy, when in fact it is the quality of the copy that is lacking.
The results of frequentist tests are usually relayed in the form of “The probability that the observed effect is due to chance is 5%”.
As you can see, this reports that in effect, the original version of the page has been outperformed by a variation. If this is the result we expected and wanted, we conclude that the new variation is a success and implement it.
Here is an example of how Optimizely, a popular frequentist A/B testing tool, displays the results.
How Optimizely reports test results
The typical result report from the A/B testing tool, as you can see, will show the improvement and the “chance to beat the baseline” as a percentage. In this example, 72% chance is inconclusive, and if enough time has passed, you can discard this test, according to these results.
The major difference between frequentist and Bayesian inference tools lies in how each tool reports results.
Unlike frequentist results, Bayesian test results contain the chance that an actual effect will be observed and that the variation will be better than the original. In effect, this makes it much easier to interpret and understand the results.
This is the example of the result of the result screen of Google Optimize, an exclusively Bayesian inference tool:
Google Optimize uses Bayesian testing; the resulting report looks like this
As you can see, Bayesian results, unlike frequentist, show both the probability of beating the baseline and the probability that the variation will be the best. If you make multiple different variations, the second result will tell you how probable it is that one variation is the best.
In the next chapter, we’ll through the basic mechanics of testing.
Want more insights like this?
We’re on a mission to provide businesses like yours marketing and sales tips, tricks and industry leading knowledge to build the next house-hold name brand. Don’t miss a post. Sign up for our weekly newsletter.
Table of Contents
Less Development. More Marketing.
Let us future-proof your backend. You focus on building your brand.