15 Sep 2017

PsqlForks now supports PipelineDB

After working on this PSQL variant that intends to support all Postgres forks, I finally narrowed down to naming it.

Since this was essentially Psql (for) Forks, quite un-intuitively I chose to name it PsqlForks.

Considering that until recently this fork just supported Amazon Redshift, this naming didn't make much sense if it wasn't supporting at least 2 forks!

Thus, PsqlForks now supports PipelineDB!

$  /opt/postgres/master/bin/psql -U pipeline -p 5434 -h localhost pipeline
psql (client-version:11devel, server-version:9.5.3, engine:pipelinedb)
Type "help" for help.

pipeline=# \q

2 Sep 2017

psql \d now supports Interleaved / Compound SORTKEYs (in Redshift)

In continuation of support for Redshift series, now Describe Table (for e.g. \d tbl) shows SORTKEY details. This resolves Issue #6 and shows both COMPOUND / INTERLEAVED variations along with all the column names.

This change was complicated because Redshift doesn't natively support LISTAGG() function on System / Catalog tables, which meant that I had to resort to a pretty verbose workaround. This in-turn meant that this patch shows only the first ten COMPOUND SORTKEYs of a table. Seriously speaking, it would really take an extreme corner-case, for someone to genuinely require a SORTKEY with 10+ columns.

This is not a limitation for INTERLEAVED SORTKEY since it only supports a maximum of 8 Columns.

db=# \d tbl_pk
                                           Table "public.tbl_pk"
 Column  |   Type   | Encoding | DistKey | SortKey | Preload | Encryption | Collation | Nullable | Default
 custkey | smallint | lzo      | f       | 0       | f       | none       |           | not null |
 PRIMARY KEY, btree (custkey)

db=# CREATE TABLE tbl_compound(
db(#   custkey   SMALLINT                ENCODE delta NOT NULL,
db(#   custname  INTEGER DEFAULT 10      ENCODE raw NULL,
db(#   gender    BOOLEAN                 ENCODE RAW,
db(#   address   CHAR(5)                 ENCODE LZO,
db(#   city      BIGINT identity(0, 1)   ENCODE DELTA,
db(#   state     DOUBLE PRECISION        ENCODE Runlength,
db(#   zipcode   REAL,
db(#   tempdel1  DECIMAL                 ENCODE Mostly16,
db(#   tempdel2  BIGINT                  ENCODE Mostly32,
db(#   tempdel3  DATE                    ENCODE DELTA32k,
db(#   tempdel4  TIMESTAMP               ENCODE Runlength,
db(#   tempdel5  TIMESTAMPTZ             ENCODE DELTA,
db(#   tempdel6  VARCHAR(MAX)            ENCODE text32k,
db(#   start_date VARCHAR(10)            ENCODE TEXT255
db(# )
db-# DISTKEY (custname)
db-# COMPOUND SORTKEY (custkey, custname, gender, address, city, state, zipcode, tempdel1, tempdel2, tempdel3, tempdel4, tempdel5, start_date);
db=# \d tbl_compound
                                                                 Table "public.tbl_compound"
   Column   |            Type             | Encoding  | DistKey | SortKey | Preload | Encryption | Collation | Nullable |              Default
 custkey    | smallint                    | delta     | f       | 1       | f       | none       |           | not null |
 custname   | integer                     | none      | t       | 2       | f       | none       |           |          | 10
 gender     | boolean                     | none      | f       | 3       | f       | none       |           |          |
 address    | character(5)                | lzo       | f       | 4       | f       | none       |           |          |
 city       | bigint                      | delta     | f       | 5       | f       | none       |           |          | "identity"(494055, 4, '0,1'::text)
 state      | double precision            | runlength | f       | 6       | f       | none       |           |          |
 zipcode    | real                        | none      | f       | 7       | f       | none       |           |          |
 tempdel1   | numeric(18,0)               | mostly16  | f       | 8       | f       | none       |           |          |
 tempdel2   | bigint                      | mostly32  | f       | 9       | f       | none       |           |          |
 tempdel3   | date                        | delta32k  | f       | 10      | f       | none       |           |          |
 tempdel4   | timestamp without time zone | runlength | f       | 11      | f       | none       |           |          |
 tempdel5   | timestamp with time zone    | delta     | f       | 12      | f       | none       |           |          |
 tempdel6   | character varying(65535)    | text32k   | f       | 0       | f       | none       |           |          |
 start_date | character varying(10)       | text255   | f       | 13      | f       | none       |           |          |
 COMPOUND SORTKEY (address,tempdel2,start_date,custkey,zipcode,tempdel4,city,state,tempdel3,custname)

db=# CREATE TABLE tbl_interleaved(custkey SMALLINT) INTERLEAVED SORTKEY (custkey);
db=# \d tbl_interleaved
                                      Table "public.tbl_interleaved"
 Column  |   Type   | Encoding | DistKey | SortKey | Preload | Encryption | Collation | Nullable | Default
 custkey | smallint | none     | f       | 1       | f       | none       |           |          |

As a side-note, there is a consideration as to whether this should be on a separate section of its own (and not under Indexes, which it clearly isn't). May be another day. Happy Redshifting :) !

31 Aug 2017

psql \d now supports DISTKEY / SORTKEY / ENCODING (in Redshift)

This is in continuation of my work for (my forked version of) psql to better support Redshift (read more here).

Now \d table provides some additional Redshift specific table properties such as:

t3=# CREATE TABLE customer(
  custkey   SMALLINT                ENCODE delta NOT NULL,
  custname  INTEGER DEFAULT 10      ENCODE raw NULL,
  gender    BOOLEAN                 ENCODE RAW,
  address   CHAR(5)                 ENCODE LZO,
  city      BIGINT identity(0, 1)   ENCODE DELTA,
  state     DOUBLE PRECISION        ENCODE Runlength,
  zipcode   REAL,
  tempdel1  DECIMAL                 ENCODE Mostly16,
  tempdel2  BIGINT                  ENCODE Mostly32,
  tempdel3  DATE                    ENCODE DELTA32k,
  tempdel4  TIMESTAMP               ENCODE Runlength,
  tempdel5  TIMESTAMPTZ             ENCODE DELTA,
  tempdel6  VARCHAR(MAX)            ENCODE text32k,
  start_date VARCHAR(10)            ENCODE TEXT255
DISTKEY (custname)
INTERLEAVED SORTKEY (custkey, custname);
t3=# \d customer
                                                                   TABLE "public.customer"
   Column   |            Type             | Encoding  | DistKey | SortKey | Preload | Encryption | Collation | Nullable |              Default
 custkey    | smallint                    | delta     | f       | 1       | f       | none       |           | not null |
 custname   | integer                     | none      | t       | 2       | f       | none       |           |          | 10
 gender     | boolean                     | none      | f       | 0       | f       | none       |           |          |
 address    | character(5)                | lzo       | f       | 0       | f       | none       |           |          |
 city       | bigint                      | delta     | f       | 0       | f       | none       |           |          | "identity"(493983, 4, '0,1'::text)
 state      | double precision            | runlength | f       | 0       | f       | none       |           |          |
 zipcode    | real                        | none      | f       | 0       | f       | none       |           |          |
 tempdel1   | numeric(18,0)               | mostly16  | f       | 0       | f       | none       |           |          |
 tempdel2   | bigint                      | mostly32  | f       | 0       | f       | none       |           |          |
 tempdel3   | date                        | delta32k  | f       | 0       | f       | none       |           |          |
 tempdel4   | timestamp without time zone | runlength | f       | 0       | f       | none       |           |          |
 tempdel5   | timestamp with time zone    | delta     | f       | 0       | f       | none       |           |          |
 tempdel6   | character varying(65535)    | text32k   | f       | 0       | f       | none       |           |          |
 start_date | character varying(10)       | text255   | f       | 0       | f       | none       |           |          |

Now that a few 'ToDos' are listed on Github Issues, next would probably involve working on this ticket, which aims at elaborate SORTKEY details (such as INTERLEAVED / COMPOUND) etc. when using Describe Table.

12 Aug 2017

Redshift support for psql

Am sure you know that psql doesn't go out of it's way to support Postgres' forks natively. I obviously understand the reasoning, which allowed me to find a gap that I could fill here.

The existing features (in psql) that work with any Postgres fork (like Redshift) are entirely because it is a fork of Postgres. Since I use psql heavily at work, last week I decided to begin maintaining a Postgres fork that better supports (Postgres forks, but initially) Redshift. As always, unless explicitly mentioned, this is entirely an unofficial effort.

The 'redshift' branch of this Postgres code-base, is aimed at supporting Redshift in many ways:
  • Support Redshift related artifacts
    • Redshift specific SQL Commands / variations
    • Redshift Libraries
  • Support AWS specific artifacts
  • Support Redshift specific changes
    • For e.g. "/d table" etc.

The idea is:
  • Maintain this branch for the long-term
    • At least as long as I have an accessible Redshift cluster
  • Down the line look at whether other Postgres forks (for e.g. RDS Postgres) need such special attention
    • Although nothing much stands out yet
      • Except some rare exceptions like this or this, which do need to go through an arduous long wait / process of refinement.
  • Change the default port to 5439 (or whatever the flavour supports)
    • ...with an evil grin ;)
  • Additionally, as far as possible:
    • Keep submitting Postgres related patches back to Postgres master
    • Keep this branch up to date with Postgres master

Update (31st August 2017)
  • Currently this branch supports most Redshift specific SQL commands such as
    • Returns non-SQL items like
      • ENCODINGs (a.k.a. Compressions like ZSTD / LZO etc )
      • REGIONs (for e.g. US-EAST-1 etc.)
  • Of course some complex variants (for e.g. GRANT SELECT, UPDATE ON ALL TABLES IN SCHEMA TO GROUP xxx ) don't automatically come up with tab-complete feature. This is primarily because psql's tab-complete feature isn't very powerful to cater to all such scenarios which in turn is because psql's auto-complete isn't a full-fledged parser to begin with.
  • In a nutshell, this branch is now in a pretty good shape to auto-complete the most common Redshift specific SQL Syntax.
  • The best part is that this still merges perfectly with Postgres mainline!

    Let me know if you find anything that needs inclusion, or if I missed something.

    3 Aug 2017

    Reducing Wires

    Recently got an additional monitor for my workstation@home and found that the following wires were indispensable:

    • USB Mouse
    • Monitor VGA / HDMI / DVI cable
    • USB Hub cable (Pen Drive etc.)
    I was lucky that this ($20 + used) Dell monitor was an awesome buy since it came with a Monitor USB Hub (besides other goodies such as vertical rotate etc).

    After a bit of rejigging, this is how things finally panned-out:
    • 1 USB Wire (from the laptop) for the MUH (Monitor USB Hub)
      • This is usually something like this.
    • Use a USB->DVI converter and use that to connect MUH -> Monitor DVI port
      • This is usually something like this.
    • Plug USB Mouse to MUH
    • With things working so well, I also plugged a Wireless Touchpad dongle to the MUH
    So now when I need to do some office work, connecting 1 USB wire gets me up and running!

    #LoveOneWires :)

    Now only if I could find a stable / foolproof Wireless solution here ;)

    29 Jul 2017

    Symbols in Redshift User Passwords work just fine

    Recently read a few posts / discussions about people doubting Redshift not accepting (working-well with) ASCII symbols in User-Passwords.

    It felt like a good time to write this short post showing that Redshift (Engine) seems to work fine with (non-alphanumeric) (printable) ASCII symbols.

    You can see a few things (in the sample output given below):

    • All non-alphanumeric printable ASCII characters worked fine (at least all that my US-International / QWERTY keyboard could throw at it)
    • For those who also need ' (single-quote) and " (double-quote) you could always use $$ as quote-delimiters
    • You still need at least One of each of the following:
      • Upper-Case English-Letter
      • Lower-Case English-Letter
      • One Digit / Numeral

    # psql -U adminuser -h  rs_cluster -p 5439 db

    psql (9.6.3, server 8.0.2)
    Type "help" for help.

    rs_cluster adminuser@db-# alter user userb with password 'Aa1~!@#$%^&*()_+-`{}[]|";:,<.>/?';
    Time: 237.012 ms
    rs_cluster adminuser@db-# \q

    # psql -U userb -h  rs_cluster -p 5439 db
    Password for user userb:

    psql (9.6.3, server 8.0.2)
    Type "help" for help.

    rs_cluster userb@db-# alter user userb with password $$Aa1~!@#$%^&*()_+-`{}[]|";:,<.>/?'"$$;
    Time: 191.505 ms
    rs_cluster adminuser@db-# \q

    # psql -U userb -h  rs_cluster -p 5439 db
    Password for user userb:

    psql (9.6.3, server 8.0.2)
    Type "help" for help.

    rs_cluster userb@db-#

    20 Jul 2017

    Using generate_series() in Redshift

    Considering that Redshift clearly states that it doesn't support (the commonly used postgres function) generate_series(), it gets very frustrating if you just want to fill a table with a lot of rows and can't without a valid data-source.

    Solution (Generates a billion integers on my test-cluster):

    --INSERT INTO tbl
    WITH x AS (
      SELECT 1 
      FROM stl_connection_log a, stl_connection_log b, stl_connection_log c
      -- LIMIT 100
      SELECT row_number() over (order by 1) FROM x;

    For a Redshift server with even a basic level of login activity, this should generate enough rows. For e.g. On my test cluster, where I am the only user, this currently generates 4034866688 (4 billion) rows :) !

    Interestingly, irrespective of the document, generate_series() actually does work on Redshift:

    # select b from generate_series(1,3) as a(b);
    │ b │
    │ 1 │
    │ 2 │
    │ 3 │
    (3 rows)

    The reason why this wouldn't let you insert any rows to your table though, is that this is a Leader-Node-Only function, whereas INSERTs (on any non-single Redshift Cluster) are run on the Compute Nodes (which don't know about this function).

    The reason why the above works, is ROW_NUMBER() and CROSS JOIN allow us to generate a large number of rows, but for that, the initial data-set (here the STL_CONNECTION_LOG System Table) should have at least some rows to multiply on! You could use any other system table (that is available on Compute Nodes) if required, for some other purpose.

    Play On!

    31 May 2017

    Patch: Using --no-comments with pg_dump

    Recently I submitted a patch for review that allows a non-superuser to practically use a backup taken from pg_dump.

    Currently it is a kludge (and well known at that - Ref 1 / Ref 2 / Ref 3 / Ref 4) but since it's (the ideal solution) too big a byte to chew and not in high-demand, it has seen little to no traction in the past decade.

    This patch should allow the above. But more importantly, it should also allow regular users of AWS RDS Postgres as well as Google Cloud Postgres databases (who do not get SuperUser access by-design) to reliably use the backups, instead of tinkering with the backup SQLs and remove things like COMMENT ON EXTENSION for it to even run during restoration.

    The bad news is that since Postgres 10 has already branched off, I doubt this would see the light of the day (if at all) any time before Postgres 11 (unless there is consensus that it's helpful enough and gets 'back-patched' to Postgres 10 to be released around September 2017).

    horntail it is :)

    I had been planning to contribute to Postgres Buildfarm by running it on one of my VMs and finally got that done... and horntail it is.

    A few things to note, before it fades from (my) memory:
    • It (obviously) requires quite a bunch of libraries, especially if you're using the defaults (for e.g. with ssl / krb5 etc..)
      • You may either disable some of those exotic options (not ideally recommended)
      • Or, you may use a silo'ed VM just for this purpose so that it doesn't slow your regular work
      • I did neither (of the above) but didn't care about performance of this VM
        • Probably would enable more Options / Branches down the line and re-submit
    • My distro was Ubuntu and so YMMV, but the packages weren't very obvious. If you see something fail and you check logs, you'd see that 'abc' was missing, but wouldn't have the faintest idea which package satisfied that
      • Some hiccup missing libraries were (These may be obvious for most, but were new to me ):
        • gss_init_sec_context
        • msgfmt
        • tclConfig
      • It'd be good if someone could catalog the packages required
        • For all Unix variants - obviously easier said that done, but
          • It's possible that there may be more people willing (like me) but unable to scale all the walls
        • At least the super-set of what's required when all flags are enabled, from which people could try what's needed for his / her set of enabled flags
    • As it always is, the documentation is relatively clear, just that persistence (especially for exotic Linux variants / issues) is needed in large quantities.
    All the best :)

    14 Mar 2017

    Using pg_dumpall with Google Cloud Postgres

    Luckily the recent update to pg_dumpall works (without any extra effort) to dump Globals off the recently launched Google Cloud PostgreSQL service well.

    This is because this Postgres service seems to be using the same limitation (as AWS RDS Postgres does) in that none of the user accounts are SuperUser in the Postgres sense. Therefore, this platform too disallows accessing pg_authid.

    The fallback here too, is to use pg_roles to read (as much as is possible) to dump Role related Globals. Read here for more info on the recent changes to pg_dumpall that makes this work.

    12 Mar 2017

    Redshift Limitations

    Recently have been toying around with Redshift (a popular Data-Warehouse tool by AWS) a little bit and see some minor but obvious limitations in SQL limitations:

    FULL OUTER JOIN + ON Func(x) = Func(y) : Not Supported

    Although Redshift supports most variants of FULL OUTER JOIN, however, when the JOIN condition contains a function on both sides, it gives up complaining "ERROR:  XX000: could not devise a query plan for the given query".

    CREATE TABLE x (a integer);

    SELECT x1.a, x2.a
    FROM x x1
         ON LOWER(x1.a) = LOWER(x2.a); -- Fails
      -- ON LOWER(x1.a) = x2.a; -- Works
      -- ON x1.a = LOWER(x2.a); -- Works
      -- ON x1.a = x2.a; -- Obviously Works

    Workaround: The obvious way around it is to move one of the functions to a separate SQL (for e.g create a Temp Table and use that in the final computation), something like this:


    SELECT x1.a, x2.a
    FROM x1
    ON x1.a = LOWER(x2.a); -- Works

    7 Mar 2017

    Using pg_dumpall with AWS RDS Postgres

    My patch (allowing a non-super-user to dump Globals) just got committed to Postgres 10.0.

    Besides the use mentioned above, this patch allows one to dump Globals from their AWS RDS Postgres Database and have a complete backup as a part of their Disaster Recovery (DR) solution. This earlier required an inelegant hack (detailed here in another post of mine) that wasn't very convenient for a regular user.

    For those interested, RDS Postgres (by design) doesn't allow you to read pg_authid, which was earlier necessary for pg_dumpall to work. With this patch checked-in, pg_dumpall now uses a workaround (pg_roles) to read the same data (except passwords) and generate the same SQL Script.

    With that mentioned, let's get our hands dirty and see a working solution:

    # Take RDS Postgres Dump
    # We store the pgdb database in pgdb.sql and globals in pgdb_globals.sql
    pg_dumpall --globals-only --no-role-password \
      -U rdssuperuser -h rdsinstance \
      2>stderr.txt > pgdb_globals.sql
    pg_dump -U rdssuperuser -h rdsinstance pgdb 2>stderr.txt > pgdb.sql

    # Restore on Local Machine
    psql -U postgres -c "CREATE DATABASE pgdb;"

    # rdsadmin database (even if empty) is required for the next script
    psql -U postgres -c "CREATE DATABASE rdsadmin;" 

    psql -U postgres pgdb < pgdb_globals.sql
    psql -U postgres pgdb < pgdb.sql

    Note: The --no-role-password flag is still necessary since AWS RDS doesn't allow you to read Passwords. Therefore, the above generates a script that resets all users to be password-less, but that's still better than having to restore Globals (like 'CREATE USER ... ') manually by hand!

    Big thanks to Simon for ironing out a few issues, Stephen, David & Robert for their (critical) reviews.