When working with Amazon’s Redshift for the first time, it doesn’t take long to realize it’s different from other relational databases. You have new options like COPY and UNLOAD, and you lose familiar helpers like key constraints. You can work faster with larger sets of data than you ever could with a traditional database, but there’s a learning curve to get the most out of it.
One area we struggled with when getting started was unhelpful disk full errors, especially when we knew we had disk space to spare. Over the last year, we’ve collected a number of resources on how to manage disk space in Redshift. We’ll share what we’ve learned to help you quickly debug your own Redshift cluster and get the most out of it.
Make sure you know how much disk space you actually have
If you’re getting a disk full error when running a query, one thing for certain has happened—while running the query, one or more nodes in your cluster ran out of disk space. This could be because the query is using a ton of memory and spilling to disk or because the query is fine and you just have too much data for the cluster’s hard disks. You can figure out which is the case by seeing how much space your tables are using by querying the stv_partitions table. I like to use this query from FlyData.
select sum(capacity)/1024 as capacity_gbytes, sum(used)/1024 as used_gbytes, (sum(capacity) - sum(used))/1024 as free_gbytes from stv_partitions where part_begin=0;
Ideally, you won’t be using more than 70% of your capacity. Redshift should continuing working well even when over 80% of capacity, but it could still be causing your problem. If it looks like you have plenty of space, continue to the next section, but if you’re using more than 90%, you definitely need to jump down to the “Encoding” section.
Join at your own risk
If the query that’s failing has a join clause, there’s a good chance that’s what’s causing your errors. When Redshift executes a join, it has a few strategies for connecting rows from different tables together. By default, it performs a “hash join” by creating hashes of the join key in each table, and then it distributes them to each other node in the cluster. That means each node will have to store hashes for every row of the table. When joining large tables, this quickly fills up disk space. However, if the two join keys for the query are on the same node, the whole query can happen in place without using any additional memory. By setting up the tables so their dist keys are the same, you may be able to avoid a disk full error. When setting up dist keys, though, be wary of skew, which is discussed in the next section.
If you can’t change the dist key because the dist key is optimized for another query, the new key would cause skew issues, or some other reason, you may be able to make some changes to your query so it can still be executed. Here are some options you can try:
- Use a subquery instead of a join. Some queries that use joins only need data from one of the tables but are using the join to verify some piece of information. In those cases, the join can often be replaced by an IN clause and a subquery. For example, a common query for us is to get some piece of information about users with subscriptions. Instead of joining the two tables, we can select users whose ids are in the subscriptions table. While holding on to the result of the subquery takes some memory, it is usually much less than what’s needed for a hash join.
- Create and join subtables. In many cases, we are only retrieving small subsets of data from the tables being joined but are doing a hash join of whole tables. In those cases, you can create a table, usually a temporary one, that is a subset of a table being joined but has whatever filtering you need already applied. That way, by joining two small tables, the hashes are much smaller. You may even be able to distribute them so that there is no hash join at all. Again, this option uses some memory, but it’s much less than a hash join of the full tables.
Skewered, the downside of dist keys
If you have dist keys set up, it’s possible that you are dealing with skew, which is when more rows from a table are put on one node than the others. Significant skew can lead to disk full errors on even routine queries because any additional disk space used can cause the one overloaded node to throw an error. This query from Amazon is great for checking for skewed tables. As noted in the link, if you have high values in the “skew” column or low values in the “slices populated” column, especially for large tables, then you will likely need to rethink your dist strategy for those tables.
When setting up distribution on our cluster for a large table, we had chosen a key that had a large number of possible values so the rows should have been distributed evenly across nodes. However, what we didn’t realize was that this column was null for many of the rows. All of those rows were then stored on the same node of the cluster, causing that node to throw a disk full error on almost any query, even though we were only using 75% of our disk space.
A unique feature of Redshift compared to traditional SQL databases is that columns can be encoded to take up less space. However, there is no automatic encoding, so the user has to choose how columns will be encoded when creating a table. There are a lot of options for encoding that you can read about in Amazon’s documentation. The easiest way to get started encoding is to use Amazon’s python script to analyze your tables and get recommendations. If you are running low on disk space and haven’t encoded your tables yet, you can recover a sizable amount of space this way. If you have encoded your tables, it may be worth checking the svv_table_info table to see if any tables have been added without encoding or rerunning the above script to see if any tables should have their encoding changed.
Clean up by vacuuming
When new rows are added to Redshift, they aren’t added in their specified sort order, which is important for some encoding types to work, and when rows are deleted, the space isn’t automatically freed up. Vacuuming handles both of those problems. When you run a vacuum command on a table, it is sorted, and space used by deleted rows is freed up. If you have added or removed a large number of rows from a table, vacuuming that table will free up some space. You can read about how to run a vacuum command and what options you have here.
The idea of vacuuming comes from Redshift’s parent project Postgres, but if you are familiar with Postgres, you may be surprised to find that vacuuming doesn’t occur automatically and the command must be run manually. It is also worth noting that only one table can be vacuumed at a time, so you need to schedule vacuums carefully. To save you from having to vacuum, you should prefer dropping a table or using the “truncate” command rather than the “delete” command when deleting large amounts of data, since those commands automatically free up disk space and a vacuum won’t be required.
Out of the woods, hopefully
If you’ve followed this guide, hopefully you have enough space on your cluster and have stopped seeing disk-full errors. If you still are, however, there are really only two options left: delete data or buy another node.
Don’t be afraid to look into deleting data. We occasionally audit our tables and clear out data that was used in experiments and now-defunct projects, saving us some space. Just remember to vacuum tables that you delete rows from. If you have any questions about managing a Redshift cluster or if you’ve found another way for managing space, feel free to let us know.