I have been running a growth team for about two years now, and I have done a lot of A/B testing in that time. While the fundamentals of an A/B test are fairly simple, there are a lot of mistakes that can be made -- and believe me, I have made them all. However, the biggest mistake a growth team can make is not testing at all. Testing, if done correctly, not only helps optimize your customer’s experience, it also helps you understand your customers. Today, I am excited to share some of the things I have learned so far that can help you run A/B tests successfully.
Without further ado, 12 ways to screw up your A/B tests.
1. Assign people who will not get the test experience to an arm of the test
The biggest problem with this mistake is that it dilutes your test results. If in reality the test experience increases conversions by 50%, but only 1% of users assigned to that arm of the test actually get the experience, there is a good chance you won’t detect a significant impact at all. This mistake usually occurs in one of two ways:
- Assign users several steps before the actual test experience occurs.
- Assign users who are ineligible for the test experience.
2. Change your control arm
Pretty basic here, but it is not uncommon for people to want to add an “obvious fix” to the control arm because it simplifies the experiment or corrects a perceived error. I recommend you avoid doing so whenever possible, as it is difficult to assess either the psychological or technical impact a change may have on a user or his machine. We test so we can find out.
3. Give users an inconsistent experience
Many tests we run only impact the user at a single point in time, as with the onboarding flow, but some tests, like adding a paywall, have a repeated impact on the user. For a clean analysis, it is important that your test is structured to give users a consistent experience. Some thought should also be given to how you transition users off a test when it is complete. This is particularly difficult for pricing tests which may be seen by users before and after they register. The areas where you should be particularly sensitive include:
- Re-assigning users to the test. This scenario is fairly easy to avoid when the user is logged in, but much more difficult if they are just a visitor.
- Members of the same team seeing substantially different things. Again this is very difficult pre-registration, but make sure everyone on an account gets the same test assignment.
4. Make it difficult/impossible to turn off
Ideally, the team analyzing and implementing the test has the power to turn a test on or off. This sounds obvious; however, depending on your testing system, it may require an analyst to identify the change needed, an engineer to make a code change, and the ops team to do an emergency release. If you are testing as often as you should be, this type of flow quickly results in frustrated people and wasted time.
5. Fail to track all relevant actions for the test
If I had to identify the single greatest cause for an emergency release for our team, it would be failing to track a key metric that could clearly determine the efficacy of the test. When scoping out your tests, it is important to clearly identify what information you will need to get clean results. If you launch a test without a way to collect that information, add a way to do so asap or turn off the test. There is no reason to run an experiment if you can’t measure the impact. Another variation of this mistake is testing a change where you can’t articulate the positive impact you expect it to have (usually making styles consistent).
6. Fail to consider edge cases
Considering edge cases can sometimes be difficult to do, as your code may be fairly complex. For this reason, it is critical to have excellent engineers and QA people on the team. A good engineer will ensure the code handles multiple flows intelligently and will think about how the code will interact with your other tests. A good QA person will drive you crazy with edge cases that can make or break your results. Be particularly sensitive to edge cases that are highly correlated with value, such as team or enterprise experiences.
7. Introduce bias to the test results
No one does this intentionally, but it can happen pretty easily if you are filtering your results or changing your test distribution. At times we have started a test at 80:20 to be conservative with our test results, but then when we are comfortable we have changed to a 50:50 mix. If you do this (I actually try to do everything at 50:50 the entire time), you have to account for the additional cooking time the average control arm user had. One other variant of this mistake is to accidentally assign 100% of a subsegment to one arm of the test. This usually happens when a group of users cannot be allowed to experience the treatment. These users either need to be clearly flagged at the time of the test assignment, or better yet, not assigned to the test at all.
8. Fail to monitor the test closely during day one to ensure that nothing is broken in the implementation
This is a fairly controversial point, as statistical purists will complain that monitoring on day one will significantly increase your chance of a type I or type II error. You aren’t checking the results after a day to make a final call, you are checking them to make sure that the test is deployed correctly and users are getting the expected behavior. We recently deployed a test to drive sharing during the standard download flow. While the test arm increased sharing on day one by 80%, it came at the cost of a 9% drop in downloads; both impacts were highly statistically significant. In raw numbers, this test added 100 shares but dropped 200 downloads. When we saw the magnitude of this impact, we knew we needed to soften the flow a bit. It was very beneficial to have an emergency release ready to go when we started getting complaints about the change.
9. Call test a win too early
At times you will be pressured to end a test early. Maybe it’s because you want to end the quarter with a solid win, or maybe it’s because an executive wants to capture as much of the value as possible. Fight this. If you are running a growth team, your lifeblood in the organization is credibility. If people trust your methods and conclusions, you are free to test all sorts of things and really make a big impact. Consequently, nothing is scarier than reviewing an old test that was called as a win too early. It’s a whole different ballgame for calling a loss early. If you think it is a lemon, and you have learned everything you can from it, cut your losses.
10. Fail to quickly dig into inconsistencies in test assignments
Before you launch your test, you should have a pretty good idea of how many people should be assigned to it. Check your test a few days in, and verify that the assignments meet your expectations. After noticing a few tests that favored a single arm for test assignments at a statistically significant level, we began a several week witch hunt to find the culprit. After filtering for registration sources, we eventually discovered an error from our old demo flow that assigned everyone to the same arm of the test. Since these users behaved significantly differently compared to our standard users, we had introduced a systematic bias into our test results.
11. Fail to predefine success and ignore secondary effects
Some jerk once told me, “If you fail to plan, you plan to fail.” Well, in testing it seems more like, “If you fail to define success, you can choose to call whatever you did a success.” I have seen some pretty creative definitions of success that conveniently meet a nuanced result (not great for the old credibility though). A less offensive mistake, and a more common one, is to fail to consider any secondary impacts of your test. Usually, we predefine our core metric and balance it with the most likely negative result. For example: First Payments against Returning Use, First Payments against Quality of Payments, Shared Documents against Downloaded Documents. As a rule, we pair all payment metrics with engagement and quality of payments.
12. Optimizing low-value channels or functionality
In my opinion, a bad test is not one that fails, but one that wouldn’t make a difference even if it succeeded. Make sure you are testing things that have the potential to be needle movers. We once spent a month fixing the conversion funnel for one of our onboarding flows, and in the end we increased conversions by 20%. Unfortunately, this flow represented a tiny minority of our registrations and had the worst conversion rate of all (now slightly less terrible). Our time would have been better spent optimizing a higher touch experience with more qualified users.
Even when avoiding these 12 pitfalls, executing a clean test is often more difficult than it sounds. Consequently, during the test design we usually find it helpful to use Lucidchart to diagram the process in a flowchart. Doing so helps us anticipate the various edge cases and ensures we have thought about the experiment more robustly.
Hopefully implementing these tips can help you successfully use A/B testing to accelerate your company’s growth!
About the author
Spencer Mann is the Senior Director of Growth at Lucid Software, makers of top-ranked productivity apps like Lucidchart. At Lucid, Spencer is responsible for leading the growth and business intelligence teams. Prior to working at Lucid, Spencer worked at Celanese where he was a product manager for the polyester product line. Spencer graduated with a BS degree in Biological Engineering from USU, an MS degree in Biosystems Engineering from OSU, and an MBA from BYU.
Lucidchart, a cloud-based intelligent diagramming application, is a core component of Lucid Software's Visual Collaboration Suite. This intuitive, cloud-based solution empowers teams to collaborate in real-time to build flowcharts, mockups, UML diagrams, customer journey maps, and more. Lucidchart propels teams forward to build the future faster. Lucid is proud to serve top businesses around the world, including customers such as Google, GE, and NBC Universal, and 99% of the Fortune 500. Lucid partners with industry leaders, including Google, Atlassian, and Microsoft. Since its founding, Lucid has received numerous awards for its products, business, and workplace culture. For more information, visit lucidchart.com.
The fatal flaw of A/B tests: Peeking
A/B significance testing has become irresistibly simple. Plug a few numbers in an online calculator, and voilà... statistically verified results. But this on-demand verification is fatally flawed: Looking at results more than once invalidates their statistical significance.
How We Lost $250,000 by Not Documenting User Flows
How much time do you spend searching for the big wins? Now, how much time do you spend maintaining the big wins? If you haven’t developed a strong process for maintaining the big wins, it could cost you. And cost you big. In my case, that mistake was worth hundreds of thousands of dollars.