Providing a consistent user experience while A/B testing

When setting up an A/B testing framework, pay special attention to how users are assigned to different arms of a test (and which metrics you track). If you aren’t careful, you’ll give the same user a different experience each time they go through a tested flow. This creates a problem for both you running the test and for the user.

As the person running and analyzing the test, you lose data about anything but the most short-term effects of your test. When running a test, you are interested in both how the changed experience affects the specific interaction as well as the overall, long-term effects the change could have. If a user has interacted with all of the arms of a test, it’s impossible to tell if one of the arms helped or hurt in the long-run.

Giving your users a different experience every time they use your app is confusing as well. Can you imagine using an app that changed where all the buttons were every time you used it? It would be impossible to learn how to use it. It is much safer to make sure users get the same experience every time.

The easy solution is to save what arm of each test the user is on, and then look it up whenever needed. However, this approach can quickly become unwieldy. At Lucid, it’s not uncommon to run a few tests on a page, and having to do a database lookup for each test would require multiple network requests and database hits. Each additional test would make the page load more slowly.


function getArm(userId, testName) {
  var arm = getArmFromServer(userId, testName);
  //Wait for a network request that includes a database lookup every time
  //On a bad day this could take a while
  return arm;
}

Instead, we need a way to quickly decide what arm of a test a user should be on based on information we already have. In this case, we’ll use the user’s id, which we would have already loaded into our client.

To decide what arm of a test the user should be on, we take their user id, concatenate it to the name of A/B test, generate the md5 hash of the string, and convert it into the number. We then mod that number by 100 and compare it to the percentages in each arm of the test. For example, if we are running an evenly split two-armed test, the percentages would be 50-50. If the result of the mod is less than 50, the user would be assigned to the A arm, and if it is greater than or equal to 50, the user would be assigned the B arm.

The code below shows how to do this generally. By doing this, the user is kept on the same arm for the duration of the test and requires no additional storage. However, this assignment doesn’t look very random, but we’ve done some empirical work to prove that it is.


function getArm(userId, testConfig) {
  //testConfig contains the name of the test and a list of arms with their weights
  // for example {name: 'amazingTest', arms: {'T-A': 1, 'T-B': 1}}
  var totalWeight = 0;
  Object.keys(testConfig.arms).forEach(function(arm) {
    totalWeight += testConfig.arms[arm];
  });

  var idForTesting = userId + '|' + testConfig.name;

  // There are many methods and libraries for hashing a string
  // and then turning it into a javascript number
  // so I won't go into the details here
  var hashSum = getLongFromHash(idForTesting);
  var assignedValue = hashSum % totalWeight;

  var weightLeft = assignedValue;
  var keys = Object.keys(testConfig.arms);
  for (var i = 0; i < keys.length; i++) {
    var arm = keys[i];
    var weight = testConfig.arms[option];
    weightLeft -= weight;
    if (weightLeft < 0) {
      return arm;
    }
  }
}

To show that our test assignment was random, we ran a test with 750 different arms. The hypothesis was that if assignments were random, then there would be an even number of people in each arm and each arm would act roughly the same (there is no significant difference in how often one arm, registered, created documents, paid, etc).

We’ve let the test run for over six months, occasionally checking if the behavior has changed, and over 2.4 million users have been assigned an arm. There is a difference of less than 300 people in the arm with the most users and the arm with the least users, and no significant difference in behavior between arms. It is effectively random enough to form the backbone of an A/B test system.

With this framework in place, there’s virtually no limit to the number of arms our tests can have nor to the number of tests we can run at a time.

Disaster-proof A/B Testing, Pt. 2: Ensuring a consistent experience (on 750+ arms)

No Comments, Be The First!

Related Articles

Testing our homepage with KissMetrics

The Multi-Armed, Multi-Fingered Bandit: Implementing a Bandit Algorithm in a Multiple-Payout Scenario

Research-Driven Redesign

Disaster-proof A/B Testing, Pt. 1: Managing metrics

No Comments, Be The First!