20 Apr 2024

Unlock PostgreSQL Superpowers with pg_tle

pg_tle - A Must-Know for Developers

PostgreSQL is a fantastic database, packed with features. But sometimes, you need to add a little something extra – a custom function, a specialized data type, or maybe a procedure written in your favorite programming language. That's traditionally where PostgreSQL extensions come in. But for developers working with managed databases (from major cloud providers), while installing supported extensions is trivial installing custom or unsupported extensions can be tricky or even impossible.

Enter pg_tle!

What exactly is pg_tle?

pg_tle stands for Trusted Language Extensions. Here's the breakdown:

  • Extension: A piece of software that adds new functionality to PostgreSQL.
  • Trusted Language: A programming language (like JavaScript, Perl, or PL/pgSQL) that the database 'trusts' due to security restrictions built into the language itself.
  • The Magic: pg_tle provides a framework to build, package, and install extensions in these trusted languages, even in environments where you can't normally touch the underlying server.

Why is pg_tle Important for Application Developers?

  1. Unlock New Possibilities: pg_tle lets you create custom database functions tailored to your application's unique logic. Say you need complex data validation or even specialized calculations – pg_tle can make it happen.

  2. Bypass Restrictions: On managed database services, you often can't install traditional extensions. pg_tle works within these constraints allowing you to add functionality even in these environments - See Use Cases below for more.

  3. Enhanced Security: Because pg_tle uses trusted languages, your extensions have limits on what they can do. It leads to a more secure database overall.

  4. Open Sourcepg_tle is open source [3] and available on Github (Apache 2.0 License).

Use Cases - Just A Sample Of Ideas

  • Use Unsupported Extensions: This itself is a reason big enough to try out pg_tle! [4]
  • Custom Password Strength Checks: Go beyond basic password rules with an extension.
  • Login Triggers: Build custom rules that get triggered for every time a user logs into the database - not just for (the upcoming) v17 but also for older PostgreSQL versions!
  • Custom Data-Types: Build custom data-types that you could use to store data, functions and views within your database.
  • Data Transformations: Perform complex data manipulations directly within the database.

The Trade-offs

  • Complexity: Creating pg_tle extensions can be more involved than basic SQL scripting.
  • Limitations: Trusted languages still have constraints compared to the full capability of extensions developed in lower-level languages (like C).

Should You Learn More?

If any of the following resonate with you, then deeper pg_tle knowledge could be a big win:

  • You crave more flexibility in how you work with data, and although have found an extension it's not yet natively supported in your PostgreSQL installation.
  • Your application has a few "extra tricky" calculations or data processing needs, that are much easily possible with Perl, Javascript, PL/PGSQL.
  • You work with managed databases and miss those power-user extensions.

Conclusion

pg_tle is a powerful tool that adds flexibility and extensibility for PostgreSQL developers. While a bit more advanced, understanding pg_tle unlocks a new level of database customization for your applications.

If you're intrigued, keep an eye out for a follow-up post where I'll show a simple example of building a pg_tle extension!

References

  1. Trusted Language Extensions for PostgreSQL on Amazon Aurora and Amazon RDS - https://aws.amazon.com/blogs/aws/new-trusted-language-extensions-for-postgresql-on-amazon-aurora-and-amazon-rds/
  2. Creating custom data-types using Trusted Language Extensions - https://aws.amazon.com/blogs/database/create-custom-postgresql-data-types-using-trusted-language-extensions/
  3. pg_tle is open source! - https://github.com/aws/pg_tle
  4. pg_tle makes using custom extensions easier - https://aws.amazon.com/blogs/opensource/supabase-makes-extensions-easier-for-developers-with-trusted-language-extensions-for-postgresql/
  5. Installing pg_tle on Ubuntu - https://www.thatguyfromdelhi.com/2024/04/installing-pgtle-on-ubuntu-quick-guide.html

21 Dec 2021

Find Database DNS / Endpoint via SQL

How to get Database identifier using SQL

Often there is a need for client programs to find "where am I logged into?". This blog post provides a few ways to know more to answer this question. 


When more and more applications are breaking out of the regular (database + application) architecture, this gets more and more important in scenarios when you're not 100% sure "where" you just logged into.

Some common scenarios, where this is helpful are:


- Proxy between application and Database

- Load Balancer (application Load balancers & Database Load Balancers)

- Serverless Database Clusters

- Auto Scaling Instances (for e.g. EC2 based)

- To confirm whether the Database server quietly changed since last invocation (in the current session)


PSQL only - :HOST variable

When connected via PSQL, the HOST variable can tell which server you're currently connected to.  

postgres10@postgres=> SELECT :'HOST' db;

                            db

----------------------------------------------------------

 postgres10.xyz.ap-southeast-2.rds.amazonaws.com

(1 row)


Caveats:

  • This variable can be set during session, so probably it is a good idea to use it only in trusted environments.
  • This would need good testing with Proxy based architectures, since the value returned is what the client 'thinks' it connected to.
  • Also note that if the client was connected via IP address, this method doesn't do a reverse DNS lookup, so what you'd get is the IP address.


→ psql -U postgres10 -h 172.31.45.165
psql (15devel, server 10.13)
Type "help" for help.

postgres10@postgres=> SELECT :'HOST' db;
      db
---------------
 172.31.45.165
(1 row)

PSQL only - \conninfo

Similar to the method above, another psql only method is to get Connection Info via the \conninfo command. This is more descriptive (and so may be not very well suited for some needs) but additionally doesn't work very well with SELECTs.

postgres10@postgres=> \conninfo

You are connected to database "postgres" as user "postgres10" on host "postgres10.xyz.ap-southeast-2.rds.amazonaws.com" (address "1.2.3.4") at port "54321".



RDS Only - Lambda invocation

Another way to get the Server Endpoint is to use a Lambda function and let that return the Server endpoint. You can read more about how to use Lambda with RDS servers here.


Aurora Postgres only - aurora_replica_status() function

When connected to Aurora Postgres, one additional way for a client to know which of the servers it is connected to (in the cluster), it can use the Aurora function aurora_replica_status() that tells which server you're currently connected to. One additional benefit with this mechanism is that it is executed on the server, and thus returns a true picture of the servers present (and their role) in the cluster.


apg2@pg=> select server_id from aurora_replica_status();

server_id
-----------
apg2b
apg2a
(2 rows)


29 Dec 2020

Which SQL causes a Table Rewrite in Postgres?

EDIT: Updated to v17 (devel) - (Jan 2024).


While developing SQL based applications, it is commonplace to stumble on these 2 questions:

  1. What DDLs would block concurrent workload?
  2. Whether a DDL is going to rewrite the table (and in some cases may need ~ 2x disk space)?

Although completely answering Question 1 is beyond the scope of this post, one of the important pieces that helps answering both of these questions is whether a DDL is going to cause a relfilenode change..

For a brief background, each regular table in Postgres stores data in one or more files, each of which is referenced in the postgres catalog with a relfilenode. A simple way to check whether the current implementation is going to create / refer to another copy (file) is whether the relfilenode changes. (TRUNCATE is a standout here, which by design is going to purge the table data, so although the relfilenode would change here, in total it obviously wouldn't consume anywhere close to 2x disk-space)

The table below shows which DDLs would cause a table rewrite. As has been discussed here, we need some more info to completely answer Question 1, however meanwhile this table helps in making some concurrency / disk-usage related decisions for all Postgres versions supported today.



29 Feb 2020

Optimizations in GROUP BY vs SELECT DISTINCT

(This came out of something I was trying out + discussing with Postgres enthusiasts - thanks to all for clarifying doubts)

This article aims at highlighting one aspect of how the query planner implementation of SELECT * GROUP BY differs from SELECT DISTINCT.

For example:

SELECT b,c,d FROM a GROUP BY b,c,d;
vs
SELECT DISTINCT b,c,d FROM a;


We see a few scenarios where Postgres optimizes by removing unnecessary columns from the GROUP BY list (if a subset is already known to be Unique) and where Postgres could do even better. To highlight this difference, here I have an empty table with 3 columns:

postgres=# create table a (b integer, c text, d bigint);
CREATE TABLE


postgres=# \d a
                 Table "public.a"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 b      | integer |           |          |
 c      | text    |           |          |
 d      | bigint  |           |          |


On this table, we can see that SELECT * GROUP BY generates the exact same plan as SELECT DISTINCT. In particular, we're interested in the "Group Key" which is the same for both SQLs:

postgres=# explain select distinct b,c,d from a;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=29.78..31.78 rows=200 width=44)
   Group Key: b, c, d
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=29.78..31.78 rows=200 width=44)
   Group Key: b, c, d
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)



Having said that, if the same table is created with a PRIMARY KEY, we see that GROUP BY becomes smarter, in that we can see that the "Group Key" uses the Primary Key (here it is 'b') and correcty discards columns 'c' and 'd'. Nice 😄!

postgres=# create table a (b integer PRIMARY KEY, c text, d bigint);
CREATE TABLE
postgres=# explain select distinct b,c,d from a;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=29.78..41.08 rows=1130 width=44)
   Group Key: b, c, d
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=24.12..35.42 rows=1130 width=44)
   Group Key: b
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)


Let's check if we get the same optimization if we create a UNIQUE index on the column. The answer? Sadly No! Furthermore, I went ahead and created a NOT NULL constraint, but that didn't change anything either. (Do note that UNIQUE columns can have multiple rows with NULLs).

postgres=# create table a (b integer unique not null, c text, d bigint);
CREATE TABLE


postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=29.78..41.08 rows=1130 width=44)
   Group Key: b, c, d
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)



Regarding the above, IIUC this is an obvious performance optimization that Postgres is still leaving on the table (as of v13+):

postgres=# select version();
                                                     version                                                     
------------------------------------------------------------------------------------------------------------------
 PostgreSQL 13devel on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 32-bit
(1 row)


Next, does it still optimize this, if the PRIMARY KEY is not the first column in the GROUP BY? Answer? Yes! (The engine can optimize if any of the GROUPed BY column is a Primary Key! Noice !


postgres=# create table a (b integer, c text primary key, d bigint);
CREATE TABLE


postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=24.12..35.42 rows=1130 width=44)
   Group Key: c
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)


... and what if the PRIMARY KEY is a composite key of any of the columns in the GROUP BY column list? YES again 😄 !

postgres=# create table a (b int, c text, d bigint, primary key (c,d)) ;
CREATE TABLE

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN                        
------------------------------------------------------------
 HashAggregate  (cost=26.95..28.95 rows=200 width=44)
   Group Key: c, d
   ->  Seq Scan on a  (cost=0.00..21.30 rows=1130 width=44)
(3 rows)


Lastly, although some of these "optimizations" are things-to-avoid when writing good SQL, the reality is that ORM generated SQLs aren't that smart yet and then it's great that Postgres already implements these obvious optimizations.

8 Nov 2019

Compiling any C source on WSL (Linux on Windows)

This is a short post, in hopes that someone doesn't spend hours trying to wonder why a fresh Postgres Source Clone (or any C code for that matter) complains on the first (1st) non comment line about something very trivial (see sample below) as soon as you trigger a ./configure

$ ./configure
: not found: 18: ./configure:
./configure: 34: ./configure: Syntax error: newline unexpected (expecting ")")

This same Postgres code compiles beautifully in Ubuntu (on EC2) and so this is not about "apt vs yum" or "Ubuntu vs CentOS" etc..

What took me a while to figure was that this was one of the oldest of issues between Mac/Linux and Windows OS.

Line-endings!

Running something as simple as the following, at the root of the source directory got the ball rolling:

find . -type f -print0 | xargs -0 dos2unix

Enjoy Linux on Windows :) !

18 Apr 2019

How about 1000 cascading Replicas :)

The other day, I remembered an old 9.0-era mail thread (when Streaming Replication had just launched) where someone had tried to daisy-chain Postgres Replicas and see how many (s)he could muster.

If I recall correctly, the OP could squeeze only ~120 or so, mostly because the Laptop memory gave way (and not really because of an engine limitation).

I couldn't find that post, but it was intriguing to know if we could reach (at least) a thousand mark and see what kind of "Replica Lag" would that entail; thus NReplicas.

On a (very) unscientific test, my 4-Core 16G machine can spin-up (create data folders and host processes for all) 1000 Replicas in ~8m (and tear them down in another ~2m). Now am sure this could get better, but amn't complaining since this was a breeze to setup (in that it just worked without much tinkering ... besides lowering shared_buffers).

For those interested, a single UPDATE on the master, could (nearly consistently) be seen on the last Replica in less than half a second, with top showing 65% CPU idle (and 2.5 on the 1-min CPU metric) during a ~30 minute test.

Put in simple terms, what this means is that the UPDATE change traveled from the Master to a Replica (lets call it Replica1) and then from Replica1 it cascaded the change on to Replica2 (and so on a 1000 times). The said row change can be seen on Replica1000 within half a second.

So although (I hope) this isn't a real-world use-case, I still am impressed that this is right out-of-the-box and still way under the 1 second mark.... certainly worthy of a small post :) !


Host: 16GB / 4 core
Time to spin up (1000k Cascading Replicas): 8minutes
Time to tear down: 2 minutes

Test type: Constant UPDATEs (AV settings default)
Test Duration: 30min
Time for UPDATE to propagate: 500 ms!! (on average)
CPU Utilization: ~65%
CPU 1-min ratio: 2.5

17 Jan 2019

How to add remove DEFAULT PRIVILEGES

At times we need to do a DROP USER and are unable to do so, because of existing "DEFAULT PRIVILEGES" that exist associated to the user, which disallow a DROP USER to go ahead.

Common SQL such as this do not give DEFAULT PRIVILEGES.

You can find DEFAULT PRIVILEGES by using \ddp in psql. If you haven't heard of psql, probably that'd be a good place to start.

Once you have the privileges, you need to understand how permissions are assigned, some detailing on the cryptic letters and their meanings is given here.

Once you have that, you need to essentially revert the GRANTs (using REVOKE command) and remove those default privileges one by one.

A sample is given below:



pg postgres2@t3=> create group dbuser;
CREATE ROLE
pg postgres2@t3=> alter group dbuser add user jacob;
ALTER ROLE
pg postgres2@t3=> alter group dbuser add user postgres2;
ALTER ROLE
pg postgres2@t3=> alter default privileges for user dbuser  grant select on tables to jacob;
ALTER DEFAULT PRIVILEGES
pg postgres2@t3=> \ddp
             Default access privileges
┌────────┬────────┬───────┬───────────────────────┐
│ Owner  │ Schema │ Type  │   Access privileges   │
├────────┼────────┼───────┼───────────────────────┤
│ dbuser │        │ table │ jacob=r/dbuser       │
│        │        │       │ dbuser=arwdDxt/dbuser │
└────────┴────────┴───────┴───────────────────────┘
(1 row)

pg postgres2@t3=> alter default privileges for user dbuser  grant select on sequences to jacob;
ALTER DEFAULT PRIVILEGES
pg postgres2@t3=> alter default privileges for user dbuser  grant usage on sequences to jacob;
ALTER DEFAULT PRIVILEGES
pg postgres2@t3=> \ddp
              Default access privileges
┌────────┬────────┬──────────┬───────────────────────┐
│ Owner  │ Schema │   Type   │   Access privileges   │
├────────┼────────┼──────────┼───────────────────────┤
│ dbuser │        │ sequence │ jacob=rU/dbuser      │
│        │        │          │ dbuser=rwU/dbuser     │
│ dbuser │        │ table    │ jacob=r/dbuser       │
│        │        │          │ dbuser=arwdDxt/dbuser │
└────────┴────────┴──────────┴───────────────────────┘
(2 rows)

pg postgres2@t3=> drop user jacob;
ERROR:  2BP01: role "jacob" cannot be dropped because some objects depend on it
DETAIL:  privileges for default privileges on new sequences belonging to role dbuser
privileges for default privileges on new relations belonging to role dbuser
LOCATION:  DropRole, user.c:1045
pg postgres2@t3=> ALTER DEFAULT PRIVILEGES FOR USER dbuser REVOKE SELECT ON tables FROM jacob;
ALTER DEFAULT PRIVILEGES
pg postgres2@t3=> ALTER DEFAULT PRIVILEGES FOR USER dbuser REVOKE SELECT ON sequences FROM jacob;
ALTER DEFAULT PRIVILEGES
pg postgres2@t3=> ALTER DEFAULT PRIVILEGES FOR USER dbuser REVOKE USAGE ON sequences FROM jacob;
ALTER DEFAULT PRIVILEGES
pg postgres2@t3=> \du
                                                    List of roles
┌─────────────────┬────────────────────────────────────────────────────────────┬─────────────────────────────────────┐
│    Role name    │                         Attributes                         │              Member of              │
├─────────────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ dbuser          │ Cannot login                                               │ {}                                  │
│ jacob           │                                                            │ {dbuser}                            │
│ postgres2       │ Create role, Create DB                                     │ {rds_superuser,dbuser}              │
│ rds_superuser   │ Cannot login                                               │ {rds_replication}                   │
└─────────────────┴────────────────────────────────────────────────────────────┴─────────────────────────────────────┘

pg postgres2@t3=> \ddp
          Default access privileges
┌───────┬────────┬──────┬───────────────────┐
│ Owner │ Schema │ Type │ Access privileges │
├───────┼────────┼──────┼───────────────────┤
└───────┴────────┴──────┴───────────────────┘
(0 rows)

pg postgres2@t3=> drop user jacob;
DROP ROLE


This method should allow you to remove all DEFAULT PRIVILEGEs (only) for this User.

Note: Importantly, you'd need to repeat the above step for *each* database!

What's in an empty table?

How much storage does an empty table in Postgres take? This is a post about Postgres tables that store ... well basically ...  Nothing . The...