20 Jul 2017

Using generate_series() in Redshift

Considering that Redshift clearly states that it doesn't support (the commonly used postgres function) generate_series(), it gets very frustrating if you just want to fill a table with a lot of rows and can't without a valid data-source.

Solution (Generates a billion integers on my test-cluster):

  FROM stl_connection_log a, stl_connection_log b, stl_connection_log c
  -- LIMIT 100
  SELECT row_number() over (order by 1) FROM x;

For a Redshift server with even a basic level of login activity, this should generate enough rows. For e.g. On my test cluster, where I am the only user, this currently generates 4034866688 (4 billion) rows :) !

Interestingly, irrespective of the document, generate_series() actually does work on Redshift:

# select 1 as gen from generate_series(1,3);
│ gen │
│   1 │
│   1 │
│   1 │
(3 rows)

The reason why this wouldn't let you insert any rows to your table though, is that this is a Leader-Node-Only function, whereas INSERTs (on any non-single Redshift Cluster) are run on the Compute Nodes (which don't know about this function).

The reason why the above works, is ROW_NUMBER() and CROSS JOIN allow us to generate a large number of rows, but for that, the initial data-set (here the STL_CONNECTION_LOG System Table) should have at least some rows to multiply on! You could use any other system table (that is available on Compute Nodes) if required, for some other purpose.

Play On!

31 May 2017

Patch: Using --no-comments with pg_dump

Recently I submitted a patch for review that allows a non-superuser to practically use a backup taken from pg_dump.

Currently it is a kludge (and well known at that - Ref 1 / Ref 2 / Ref 3 / Ref 4) but since it's (the ideal solution) too big a byte to chew and not in high-demand, it has seen little to no traction in the past decade.

This patch should allow the above. But more importantly, it should also allow regular users of AWS RDS Postgres as well as Google Cloud Postgres databases (who do not get SuperUser access by-design) to reliably use the backups, instead of tinkering with the backup SQLs and remove things like COMMENT ON EXTENSION for it to even run during restoration.

The bad news is that since Postgres 10 has already branched off, I doubt this would see the light of the day (if at all) any time before Postgres 11 (unless there is consensus that it's helpful enough and gets 'back-patched' to Postgres 10 to be released around September 2017).

horntail it is :)

I had been planning to contribute to Postgres Buildfarm by running it on one of my VMs and finally got that done... and horntail it is.

A few things to note, before it fades from (my) memory:
  • It (obviously) requires quite a bunch of libraries, especially if you're using the defaults (for e.g. with ssl / krb5 etc..)
    • You may either disable some of those exotic options (not ideally recommended)
    • Or, you may use a silo'ed VM just for this purpose so that it doesn't slow your regular work
    • I did neither (of the above) but didn't care about performance of this VM
      • Probably would enable more Options / Branches down the line and re-submit
  • My distro was Ubuntu and so YMMV, but the packages weren't very obvious. If you see something fail and you check logs, you'd see that 'abc' was missing, but wouldn't have the faintest idea which package satisfied that
    • Some hiccup missing libraries were (These may be obvious for most, but were new to me ):
      • gss_init_sec_context
      • msgfmt
      • tclConfig
    • It'd be good if someone could catalog the packages required
      • For all Unix variants - obviously easier said that done, but
        • It's possible that there may be more people willing (like me) but unable to scale all the walls
      • At least the super-set of what's required when all flags are enabled, from which people could try what's needed for his / her set of enabled flags
  • As it always is, the documentation is relatively clear, just that persistence (especially for exotic Linux variants / issues) is needed in large quantities.
All the best :)

14 Mar 2017

Using pg_dumpall with Google Cloud Postgres

Luckily the recent update to pg_dumpall works (without any extra effort) to dump Globals off the recently launched Google Cloud PostgreSQL service well.

This is because this Postgres service seems to be using the same limitation (as AWS RDS Postgres does) in that none of the user accounts are SuperUser in the Postgres sense. Therefore, this platform too disallows accessing pg_authid.

The fallback here too, is to use pg_roles to read (as much as is possible) to dump Role related Globals. Read here for more info on the recent changes to pg_dumpall that makes this work.

12 Mar 2017

Redshift Limitations

Recently have been toying around with Redshift (a popular Data-Warehouse tool by AWS) a little bit and see some minor but obvious limitations in SQL limitations:

FULL OUTER JOIN + ON Func(x) = Func(y) : Not Supported

Although Redshift supports most variants of FULL OUTER JOIN, however, when the JOIN condition contains a function on both sides, it gives up complaining "ERROR:  XX000: could not devise a query plan for the given query".

CREATE TABLE x (a integer);

SELECT x1.a, x2.a
FROM x x1
     ON LOWER(x1.a) = LOWER(x2.a); -- Fails
  -- ON LOWER(x1.a) = x2.a; -- Works
  -- ON x1.a = LOWER(x2.a); -- Works
  -- ON x1.a = x2.a; -- Obviously Works

Workaround: The obvious way around it is to move one of the functions to a separate SQL (for e.g create a Temp Table and use that in the final computation), something like this:


SELECT x1.a, x2.a
ON x1.a = LOWER(x2.a); -- Works