Building a Date Dimension Table in Redshift

Elliot Chance
4 min readJun 28, 2017
Photo by Estée Janssens on Unsplash

What Is It?

It is common practice in data warehousing and reporting to use date and time dimension tables.

A date dimension table assigns an index to each day from an arbitrary starting point. Where that starting point is depends on how far back you will need to go. Usually picking a date at or slightly before the earliest records in your database is a good choice.

A time dimension table is similar in that it will assign an index to individual seconds for one day (but have no date component). Used in combination the date dimension table these two integer indexes represent an exact second for any date.

Each will have their own separate table that contains the date/time ID and any information about that particular day or second.

Why Is It Needed?

Let’s say we have a table that contains sales information; lot’s of it. How would you go about handling reports that needed to filter or summarise (group) on?

  1. Only business days (Mon-Fri).
  2. All days except the last day of each month.
  3. Specific week numbers of the year.
  4. A fiscal quarter.

Trying to do these date calculations in your queries raises a few problems:

  1. Date calculations are complicated in the easiest cases and making sure you handle all the edge cases is very complicated and error prone.
  2. Once you start performing date calculations it’s very unlikely the database will be able to make optimisations or smart decisions that allow it to use an index. If a full table scan is required and each record needs to be calculated for the date function it can make queries extremely expensive.

I See. How Do I Set This Up?

Although Redshift is advertised as having almost the same features as PostgreSQL there is one big missing feature that make it very difficult to generate a date dimension table:

  1. Only the leader node can perform any useful date calculations. This returns a lot of errors when we do any calculation involving a current timestamp.

Despite the obstacles we can still do it, here is the table definition:

CREATE TABLE date_dimension (
"date_id" INTEGER NOT NULL PRIMARY KEY,
-- DATE
"full_date" DATE NOT NULL,
"au_format_date" CHAR(10) NOT NULL,
"us_format_date" CHAR(10) NOT NULL,
-- YEAR
"year_number" SMALLINT NOT NULL,
"year_week_number" SMALLINT NOT NULL,
"year_day_number" SMALLINT NOT NULL,
"au_fiscal_year_number" SMALLINT NOT NULL,
"us_fiscal_year_number" SMALLINT NOT NULL,
-- QUARTER
"qtr_number" SMALLINT NOT NULL,
"au_fiscal_qtr_number" SMALLINT NOT NULL,
"us_fiscal_qtr_number" SMALLINT NOT NULL,
-- MONTH
"month_number" SMALLINT NOT NULL,
"month_name" CHAR(9) NOT NULL,
"month_day_number" SMALLINT NOT NULL,
-- WEEK
"week_day_number" SMALLINT NOT NULL,
-- DAY
"day_name" CHAR(9) NOT NULL,
"day_is_weekday" SMALLINT NOT NULL,
"day_is_last_of_month" SMALLINT NOT NULL
) DISTSTYLE ALL SORTKEY (date_id);

And we populate it:

INSERT INTO date_dimension
SELECT
cast(seq + 1 AS INTEGER) AS date_id,
-- DATE
datum AS full_date,
TO_CHAR(datum, 'DD/MM/YYYY') :: CHAR(10) AS au_format_date,
TO_CHAR(datum, 'MM/DD/YYYY') :: CHAR(10) AS us_format_date,
-- YEAR
cast(extract(YEAR FROM datum) AS SMALLINT) AS year_number,
cast(extract(WEEK FROM datum) AS SMALLINT) AS year_week_number,
cast(extract(DOY FROM datum) AS SMALLINT) AS year_day_number,
cast(to_char(datum + INTERVAL '6' MONTH, 'yyyy') AS SMALLINT) AS au_fiscal_year_number,
cast(to_char(datum + INTERVAL '3' MONTH, 'yyyy') AS SMALLINT) AS us_fiscal_year_number,
-- QUARTER
cast(to_char(datum, 'Q') AS SMALLINT) AS qtr_number,
cast(to_char(datum + INTERVAL '6' MONTH, 'Q') AS SMALLINT) AS au_fiscal_qtr_number,
cast(to_char(datum + INTERVAL '3' MONTH, 'Q') AS SMALLINT) AS us_fiscal_qtr_number,
-- MONTH
cast(extract(MONTH FROM datum) AS SMALLINT) AS month_number,
to_char(datum, 'Month') AS month_name,
cast(extract(DAY FROM datum) AS SMALLINT) AS month_day_number,
-- WEEK
cast(to_char(datum, 'D') AS SMALLINT) AS week_day_number,
-- DAY
to_char(datum, 'Day') AS day_name,
CASE WHEN to_char(datum, 'D') IN ('1', '7')
THEN 0
ELSE 1 END AS day_is_weekday,
CASE WHEN
extract(DAY FROM (datum + (1 - extract(DAY FROM datum)) :: INTEGER +
INTERVAL '1' MONTH) :: DATE -
INTERVAL '1' DAY) = extract(DAY FROM datum)
THEN 1
ELSE 0 END AS day_is_last_of_month
FROM
-- Generate days for the next ~20 years starting from 2011.
(
SELECT
'2011-01-01' :: DATE + generate_series AS datum,
generate_series AS seq
FROM generate_series(0, 20 * 365, 1)
) DQ
ORDER BY 1;

Now we can easily answer the previous examples:

  1. Only business days (Mon-Fri): day_is_weekday = 1
  2. All days except the first day of each month: day_is_last_of_month <> 1
  3. Specific week numbers of the year: year_week_number BETWEEN 32 AND 36
  4. A fiscal quarter: fiscal_year_number = 2017 AND fiscal_qtr_number = 3

Here is a complete example:

SELECT SUM(amount)
FROM sales
JOIN date_dimension ON dateid = saledateid
WHERE fulldate BETWEEN '2017-02-01' AND '2017-02-31'
AND day_is_weekday = 1;

You should use saledateid as a part of your SORTKEY. If you are not using Redshift there should be an index on saledateid.

Originally published at http://elliot.land on June 28, 2017.

--

--

Elliot Chance

I’m a data nerd and TDD enthusiast originally from Sydney. Currently working for Uber in New York. My thoughts here are my own. 🤓 elliotchance@gmail.com