Testers: you know this story well. There’s a feature you’ve tested thoroughly in development and looked at it in your staging environment, but now it’s gone out to production and doesn’t work. There’s some aspect of the story that works differently when you put it in production and is causing the feature to fail. How do you make sure that when you test your product it will behave the same way in your production environment as it does in your test environment?
Our solution: A production-like test environment. We refer to it as “preprod.” How did we get there? What are the benefits? Find out below.
Life before Preprod
We refer to our previous RC testing environment as “staging.” Structurally, it is an AWS instance that runs all of our services in the same box – very similar in that respect to our development environments. The code on staging is production-like and is intended for release to production after our regression cycle. Because the structure is different, we started to notice a specific class of bug that was not reproducible on staging, yet revealed itself when we released to production.
We also have had an issue with increasing load. When I first started at Lucid Software two years ago, staging was powerful enough to cope with the traffic from a crew of six testers. Now, with an expanding QA team and a robust automation suite, we have outgrown our staging environment.
How did we get here?
While we had the option to simply upgrade our staging environment, there were several factors that led to our decision to invest in a production-like test environment (what we now refer to as “preprod”).
One of the biggest changes to our infrastructure management has been Cumulus. Before Cumulus, setting up AWS instances for preprod would have been a laborious effort. The process of setting up preprod was one of copying our production infrastructure in Cumulus, making the changes, and syncing the changes with AWS.
This meant that the process of standing up preprod was a rapid conversation between our DevOps team and testers. When we saw a configuration problem or a problem with communication between services, the fix would come about as a quick change in the codebase and a sync to AWS. Test, fix, test again was the story of creating preprod – rather than test, wait for a fix to propagate, then test again.
What is Preprod? What it can do?
Preprod is structured exactly the same as our production environment. This means that services run in scale group on their own boxes instead of on a common box. This allows us the benefits of scaling while testing. We can now scale up our services with load, test our services at scale, and better handle any of our servers in a sick state.
When we had a sick service on staging, any attempts to fix the service resulted in downtime since it was a singleton service on that environment. Any real inspection of a service in a sick state blocked any further testing by the QA team. Now, with a preprod environment, we can freeze any server in a sick state, take it out of rotation, and spin up a new one – all without disrupting any test activity at all.
Our staging server was often overloaded from testing, so any monitoring of that environment was inherently noisy. With preprod, we now have the ability to duplicate our production monitoring system while testing. We have the additional advantage of knowing exactly what kind of traffic we are generating to preprod – we control the input to our system and are easily able to determine the output on a systematic scale. This approach has led to a dramatic reduction in production alerts for our Ops team.
Our production releases are now rehearsed on preprod. We are able to rule out service dependencies and model the user impact of releasing new code to production. This has led to smoother production releases for our Ops team.
Finally, we have the ability to rehearse potential structural changes to production in preprod. In the past, when our engineering team wanted to make a change in the underlying structure of our production system, the only place to test if that change worked was in the production system. For example, if we made any changes to our security groups impacting how one service talks to another, any bugs in that configuration would have gone unfound in staging. On preprod, we have the exact same security group structure as we do in production and can immediately see the impact of any bugs around changes in security groups. We can now make changes to the preprod environment and try new ideas without any risk of impacting users.
We have invested significant resources (including infrastructure-as-code efforts such as Cumulus) in creating our preproduction testing environment, and we have seen significant returns on that investment. If our struggles sound familiar to you, a preproduction environment might make sense for your testing organization as well.