In the spring of 2018, our data science team at Lucid decided to move our data warehouse from Redshift to Snowflake. We decided to make this change because we felt like Snowflake was a better fit for our data flow and analysis. We knew it would be a significant change that would impact a lot of people, so we had to implement this transition as smoothly as possible.
From inception to turning off the Redshift instance, the total transition took us about seven months, with a team of seven engineers. Here is a rough real-time breakdown of what we did, when, and how long it took:
- 3 months of talking to all involved teams, auditing and planning the reorganization of database structure.
- 3 months of converting old or creating 300 new ETL jobs, populating the new data warehouse, and verifying the new data.
- 1 month of using the new warehouse while keeping the old one available.
The majority of the work took place in the middle three months that required engineers to write the new jobs. Typically, one or two people can do most of the organizational work in the first step, and you should ideally have very little to do in that last month.
First step: Audit every table
The first step was to determine the full scope of the task. To do this, you need to go through every table in your existing warehouse and determine the following items:
How is the table populated?
Determine all the different ways data can end up in your warehouse. Is this table populated with an ETL job? Is it an intermediate table during some sort of data processing?
The longer a warehouse is used, the more ways employees find to populate and retrieve data from it. It’s important to determine how the table is populated so you can make the following decisions:
Is this data needed?
Identify the purpose of this table. Why does it exist? Do your analysts need it to try and answer questions? Or maybe it is used by accounting to track and predict your cash flow. You’ll need to talk to many different teams to find out who uses your warehouse and how.
Ideally, you will flush out some tables that are no longer needed because the table’s value is lower than expected, or the data is now derived through some other means. By asking this question you can save yourself some unneeded work during the migration.
What is the business cost of this table?
Decide who will be impacted and how severe the impact will be for day-to-day operations if the data in this table is incorrect or changed in any way. This question is somewhat related to the “Is this data needed?” question, in that if the business impact is low or non-existent, you may not need the data. But, another part of this question is to find out exactly what the impact is if the reporting changes.
For example, your business may be used to reporting data in a certain way, but while an engineer is moving the job, they may want to fix a bug or implement a computation they believe to be more accurate. While changes like this could make the data more accurate, changing the reporting of the new data could create an inconsistency with historical data that can look alarming on a chart used by executives who need to present it to a board of investors.
Second step: Reorganize
This is the ideal time for you to reorganize your tables and permissions. If your existing data warehouse is the first one you’ve ever created, your database organization is probably fairly flat with a couple of databases holding a majority of tables. Now that your business has grown and you have new needs and new regulations, you may need to reorganize your tables with permissions in mind so you can better control who has access to what data.
For example, Lucid recently formed a data science team for analytics purposes. In order to migrate the data they would need without compromising security, we needed to remove any Personally Identifying Information (PII) from the data sets they were working on. So, we created new tables with anonymized data that the data science team could use but kept the PII data in a secure table in a way where we could link the data back if there was a business case to do so. This structure also allowed us to comply with European GDPR regulations because we could sanitize the rows in the PII database and still have useable user data in the data science database.
It also helps to organize development databases that mirror your production databases to allow your engineers to develop your new ETL jobs in an environment that matches the production environment as closely as possible. It won’t be exact, however, since you will need to grant your developers permissions to edit all databases and tables in the sandbox while restricting them in production. At Lucid, we accomplished this simply by putting a prefix of “sandbox_” on all of the sandbox databases. So, the “users” database became the “sandbox_users.” The table names stayed the same, as they were already separated from the production tables by the database. Then, switching between “development” and “production” environments was simply a matter of changing configuration files with the different database names.
Third step: Start populating your new warehouse
Depending on how much of your existing ETL infrastructure you can reuse, you’ll probably spend most of your time on this step.
You’ll first need to determine how you get your data into your new warehouse. At Lucid, we were planning on using the same Scala based ETL system for Snowflake that we used for Redshift. Both allow a JDBC client to connect to their databases and stage the data in S3, so we were able to build our new Snowflake client wrapper in much of the same way we built our Redshift one.
Once you have your client and ETL process in place, you can start to populate tables in your new warehouse using your new database organization. You won’t be able to populate all the tables at once, so you’ll need to prioritize which tables to work on first. Start with the simple logic jobs first to verify that the pipeline to your new data warehouse works as expected. Then you’ll probably want to move on to jobs that have more complex logic or permissions. This is also a good time to do some performance timings to verify that your jobs will be able to get the throughput they need to keep up with the incoming data.
Your business won’t stop running while you make the transition to your new warehouse, so you need to keep the old warehouse up and running as is. You will be paying for and managing two warehouses short term, but having the data duplicated accomplishes two goals:
- Your day-to-day business won’t be impacted by any outages.
- You have a way to compare the data in the new warehouse with known, good existing data in the old warehouse while the new tables are populated.
Fourth step: Verify the data in your new warehouse
The verification should involve two things:
- Comparing data in the old warehouse with the data in the new warehouse.
- Migrate data warehouse users from the old to the new so they can verify their processes.
In this stage, communication is key. Make sure that those who are impacted are aware that the new warehouse is now the de-facto location for their data, and that they should begin migrating to the new warehouse. Getting people to change their habits is hard, so you’ll need to be very upfront and transparent about this change. Follow up with the affected employees or more technical users to verify they are using the new warehouse, that the data is working for them, and provide a reminder that the old warehouse will eventually be shut down.
The time it takes to complete this step can vary depending on the issues that crop up. This is where any minor variations in the data will be noticed by the consumers of the data. These changes may be intentional (such as accuracy) or unintentional (a bug fix). Either way, work with each team to do one of the following::
- Ensure the problem is fixed to their satisfaction; or
- Convince the consumer that this data is more accurate and better in some fashion.
This step will be an iterative process that will be time-consuming. You’ll probably find that people have been using the old warehouse data in ways you didn’t know and it will take some time for people to trust the new process.
Fifth step: Turn off jobs that populate the old warehouse
Once you are up and populating the new warehouse, you may want to wait a week or two before stopping the jobs that populate the old warehouse. There may be data that is only used once a week or once a month that people won’t have had a chance to evaluate in your new warehouse yet.
When you do decide to stop the old jobs, you can choose to turn all of them off at once or gradually, depending on how confident you are that you’ve communicated the change to all the relevant people, seeing as anyone using the old tables will now be looking at stale data.
Sixth step: Turn off the old warehouse
While this step would appear to be the end of your adventure, it probably won’t be.
There may be some old ETL job that nobody was aware of that will start to fail. It may not even start to fail immediately, as it could be a job that runs once a week or once per month. So, even if everything initially looks good, keep in mind that something may still pop up a month from now. If you’ve done all of the above steps, any job which starts to fail at this point is populating data which is rarely (if ever) used, so you may be able to kill the job completely. If not, there should be a small enough number of jobs to fix that it won’t be overwhelming.
Seventh step: Bask in the glow of your new warehouse
At Lucid, this conversion took about seven months to complete, from inception to turning off the Redshift instance. The three months we spent planning our database and table permissions—taking into consideration the data science team and GDPR regulations—were invaluable. We had seven engineers that converted about 300 Scala jobs in roughly three months and then waited one more month before shutting down the Redshift instance entirely. After we shut down the Redshift instance, we had a small number of jobs that started failing. We missed them in the initial pass because they never showed up as jobs people used, so we were able to just terminate them.
Taking the time to do things right allowed us to have a smooth transition during this data migration. In the end, that effort meant that all of our stakeholders understood why this transition was made. They were kept in the loop during the entire process while verifying their data and have been happy with the results in the new warehouse.