Removing Duplicate Data in Redshift

Redshift does not have indexes, this includes primary keys. You can (and should) create these and they are helpful for the query planner, but they do not physically exist. This also means that the constraints are not enforced. When moving a lot of data around it sometimes happens that processes over lap and it leads to duplicate records.

This was our problem. There are a few ways to solve this:

  1. Create a new table, SELECT DISTINCT into the new table and do the old switch-a-roo. This would be very slow on such a large dataset, but more importantly we literally didn’t have enough storage space to recreate the table.
  2. Use some external program or processor to go through the table and delete individual or groups of records. Too hard.
  3. Use some crazy SQL statement with windowed functions to try and delete join specific rows. Probably possible, but likely very slow.
  4. Surgically find and delete the duplicate records. This seems like the best solution, and the one I used.

It works like this (all wrapped in a transaction):

  1. Find all of the primary key values that belong to duplicate rows, and store those distinct IDs in a temporary table.
  2. Create another table with the same structure as the original table and load in the distinct rows for the specific IDs identifier in the last step,
  3. Delete all of the rows in the main table for the distinct IDs in the first step.
  4. Insert back all the single rows we extracted in step 2.
  5. Some cleanup (drop tables, etc).

One bonus for us (if you could call it that) is that we knew that the duplicate records fell in a specific week range so all the queries below could have the added filter of a date range (saledateid) which is the main sort key. This very much speeds things up.

If you have an extremely large table where even this method would not be feasible it might make sense to divide your dataset into sections of your sort key and apply this technique over known sort key ranges.

Here we go:

BEGIN;

When used correctly, Redshift once again proves it’s an absolute beast at dealing with processing huge sets of data.

Originally published at http://elliot.land on March 1, 2017.

I’m a data nerd and TDD enthusiast originally from Sydney. Currently working for Uber in New York. My thoughts here are my own. 🤓 elliotchance@gmail.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store