As an engineer reviewing failing tests, there is almost nothing worse than trying to debug and investigate a flaky test. A flaky test is one that when run multiple times, it sometimes passes and sometimes fails. From a developer perspective, tests are supposed to just work. When they fail, they reveal a bug. When they pass, we can release with confidence. When that is not the case, the tests are worthless. The organization loses faith that the tests will actually catch bugs. Developers waste time investigating problems that do not actually exist. Then when a bug gets through to production, people respond with “why didn’t we catch this?” Plain and simple, flaky tests ruin testing suites.
At Lucid, we have experienced our fair share of flaky tests over the years and have spent hours investigating and debugging them. Hopefully by sharing our experience, you can avoid some of the pitfalls we encountered along the way.
Randomness in Code Under Test
Tests should be idempotent—the same input produces the same output. All tests follow this simple principle. They are set up to enter a particular set of inputs, which are then compared to an expected output. When the expected output does not match the actual output, the test fails. One of the first flaky tests we ran into at Lucid was a result of a violation of this idempotence rule.
This particular bug caused all the integration tests for our editor page to become flaky. Every single one of our integration tests for the editor creates a document. Every document is required to have at least one page. Every page has at least one panel. This bug manifested itself in any test for the editor and would virtually never reproduce when someone was investigating and debugging the test. Since all our unique IDs are created randomly, it was virtually impossible to figure out why all of our editor tests had become flaky.
When our panels were created, they were assigned a unique ID. It turned out that if the ID contained a ~ character, it would cause the document to enter an infinite loop and never load—absolutely terrible.
The core problem we had in debugging and understanding the ~ bug was that the code under test randomly generated the IDs. We had no way of tracking the differences between various runs of the same test because the test code being run was exactly the same. We discovered that the fix was to add an option for additional information that could be turned on during testing, specifically around anything that the code under test might do differently between test runs. This allows us to turn on extra logging, run the same test hundreds or thousands of times (if necessary), and then debug based on the logs.
In our integration tests, we have a series of tests that perform a particular set of actions and then take a screenshot for comparison. This is particularly important for testing rendering on the main canvas of our editor. For some reason, all of these image comparison tests became flaky after one commit. It was only happening about one in every 100 tests, but there was no way to identify which test would break next. In fact, this one happened so rarely that it fell under our threshold for flaky tests and we didn’t even start looking at it at first. But then the user tickets started rolling in about text issues.
In our refactor of fonts, we had introduced a race condition for font loading that would make it so that some fonts never loaded in the editor. When this happened, all the user’s text would be rendered as black boxes, leaving the editor virtually unusable.
So why was this not caught by our tests? The simple answer is that we did catch it, but because it had been marked as flaky and was based on a race condition that rarely reproduced, we ignored it. Race conditions will always be hard to catch and reproduce with high level integration testing. Instead, this bug led us to add a large number of functional tests around handling network connectivity problems. The functional tests execute callbacks in a different order, test what happens if network requests are interrupted, and use a mock timer to test a wide variety of possible responses to network requests.
Our most recent adventure in flaky tests came from trying to parallelize our tests so that instead of taking several hours to build, they could be built in minutes. This was not a bug introduced in code, but instead flaky tests were introduced by our testing framework. There were two major issues, both dealing with shared resources, that caused flakiness problems as soon as we ran multiple tests at the same time.
The first issue was a shared Google account used in some of our tests. In our integration tests, we used these Gmail accounts to test both the emails being sent to users and our integrations with Google Drive. Overall, we had about 15 tests that logged into the same Google account. When the tests were run sequentially, it was not a big deal, as each test logged out when it finished. When we moved to parallelization, there was no telling how many browsers at the same time were trying to log into the same account. Sometime Google sent us to a warning page and sometimes it randomly logged out of a different session because we had too many simultaneous sessions open. To resolve this problem, we started using a testing service for testing mailings (such as Mailinator) and using multiple Gmail accounts on a single domain for testing Google integrations.
Flaky tests ruin testing suites. They are some of the most painful bugs to debug and destroy an organization’s confidence in a testing suite. Over the years, we have learned some lessons to help prevent adding flaky tests to a testing suite.
- Have optional detailed logging for code under test, especially for any randoms it may use.
- Write functional tests for anything that might be a race condition with a mock timer.
- Don’t use shared resources in different test cases unless they are locked.
With these three tips, we have been able to drastically cut down the number of flaky tests we have in each of our test runs. Hopefully, it can help you do the same.