ETL on Billions of Rows in Redshift

Recently we needed to move some data around. In simple terms we needed to redistribute a larger table into S3 objects based on a shard key.

The problem was that the distribution key (let’s call this company_id) was not available on the large table (let’s call this order_events) we needed to export.

We explored several solutions before deciding to use Redshift to do this kind of one-time ETL. The process was pretty simple:

  1. Use DMS to load several tables into a new Redshift cluster.
  2. Run some SQL for the ETL.
  3. Sort the table by the new distribution key.
  4. Redshift COPY into S3. We needed it in S3 in this case, but it could have been loaded anywhere quite easily.

The Redshift cluster consisted of a single node with 4 vCPUs, 31 GiB RAM, 2TB magnetic disk and 0.40GB/s of I/O throughput.

Unfortunately I only realised this would make a good blog post after the DMS was complete — so I don’t have specific stats on how long the initial load took. But if I remember correctly it was in the order of hours and not days.

So let’s begin from the point of the data already loaded into Redshift…

2,046,706,417 rows in 5.3 seconds.

Great. So Redshift is pretty good at counting rows. This is likely due to the fact that it’s a column-orientated database. So the next operation should also be pretty quick:

Done in 9.2 seconds. Wow!

Now we need to do the UPDATE JOIN that will fill in the new company_id on the 2 billion rows:

Done in just under 25 minutes. That is 1.4 million rows per second. Nice.

Next we need to split the table into one S3 object for each company_id (the new attribute we just added). This would be very slow when the table has no sort key. So we need to order it by the company_id .

So far Redshift has not even broken a sweat. But sorting is an intensive process and we are using magnetic (not SSD) storage.

It is not possible to change the sort key on existing table so we have to create a similar table with the new sort key and insert all the records in to the new table.

The transaction (which is entirely consumed by the INSERT) took 7.5 hours to complete. Not bad. What is interesting is when we look at the disk space used:

Image for post
Image for post

The INSERT actually happens in around 30 minutes. But the remaining time is sorting the data on disk.

One other thing is to make sure we did not have duplicates. This was not strictly necessary since it came from a primary key in MySQL. But seeing what Redshift can do is fun, so here we go…

As expected, no duplicate IDs. This query took 27 minutes.

Now we are able to export the new table to S3 with a bit of bash. And it worked a treat!

Originally published at http://elliot.land on September 6, 2017.

Written by

I’m a data nerd and TDD enthusiast originally from Sydney. Currently working for Uber in New York. My thoughts here are my own. 🤓 elliotchance@gmail.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store