20 Jul 2017

Using generate_series() in Redshift

Considering that Redshift clearly states that it doesn't support (the commonly used postgres function) generate_series(), it gets very frustrating if you just want to fill a table with a lot of rows and can't without a valid data-source.

Solution (Generates a billion integers on my test-cluster):

--INSERT INTO tbl
WITH x AS (
  SELECT 1 
  FROM stl_connection_log a, stl_connection_log b, stl_connection_log c
  -- LIMIT 100
)
  SELECT row_number() over (order by 1) FROM x;

For a Redshift server with even a basic level of login activity, this should generate enough rows. For e.g. On my test cluster, where I am the only user, this currently generates 4034866688 (4 billion) rows :) !

Interestingly, irrespective of the document, generate_series() actually does work on Redshift:

# select b from generate_series(1,3) as a(b);
┌───┐
│ b │
├───┤
│ 1 │
│ 2 │
│ 3 │
└───┘
(3 rows)

The reason why this wouldn't let you insert any rows to your table though, is that this is a Leader-Node-Only function, whereas INSERTs (on any non-single Redshift Cluster) are run on the Compute Nodes (which don't know about this function).

The reason why the above works, is ROW_NUMBER() and CROSS JOIN allow us to generate a large number of rows, but for that, the initial data-set (here the STL_CONNECTION_LOG System Table) should have at least some rows to multiply on! You could use any other system table (that is available on Compute Nodes) if required, for some other purpose.

Play On!

No comments: