That Guy From Delhi: hack

Showing posts with label hack. Show all posts

27 Oct 2024

What's in an empty table?

How much storage does an empty table in Postgres take?

This is a post about Postgres tables that store ... well basically ... Nothing.

The idea for this post came from this tweet that hinted that an empty table on most databases today takes 16Kb of storage. Now admittedly Franck was probably reminiscing the good-old days so this is probably quite out of context, but it did get me thinking, and thus this post.

NB: Here's a video showing this in action ! - See video .

A "regular" empty table in Production

Here's a regular small table that could be found in Production. It has a Primary Key, a text column and a JSONB column. Let's check the table size using the pg_total_relation_size() postgres function (you can read more about that function here).

db1=# create table t(id bigint primary key, b text, c jsonb);

CREATE TABLE

db1=# select pg_total_relation_size('t');

pg_total_relation_size

------------------------

16384

(1 row)

Hmmm, so that tweet did have a point. Given how low-cost memory has become over the decades, it is easy to understand why databases today chose to optimize speed over memory efficiency (more on this later) and so even an empty table in Postgres, does consume 16 kb.

But "where" is the 16kb being used?

db1=# select pg_relation_size('t');

pg_relation_size

------------------

(1 row)

The relation itself isn't consuming any space! That is good (again more on this later) but then where is the space being used then?

db1=# \d t

Table "public.t"

Column | Type | Collation | Nullable | Default

--------+--------+-----------+----------+---------

id | bigint | | not null |

b | text | | |

c | jsonb | | |

Indexes:

"t_pkey" PRIMARY KEY, btree (id)

We see that the table has a Primary Key - and thus an index - t_pkey.

Does the index is consuming 16kb?

db1=# select pg_relation_size('t_pkey');

pg_relation_size

------------------

8192

(1 row)

So the index is using some of it - 8kb - so that is progress - but who's using the other 8kb?

Let's start cutting the table down

Let's start cutting down the columns and see if the disk-usage goes down.

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY, b JSONB); select pg_total_relation_size('t');

DROP TABLE

CREATE TABLE

pg_total_relation_size

------------------------

16384

(1 row)

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY, b TEXT); select pg_total_relation_size('t');

DROP TABLE

CREATE TABLE

pg_total_relation_size

------------------------

16384

(1 row)

Hmmm, that didn't help at all. Dropping either of the TEXT or JSONB column didn't help. Let's look at the expanded version of this table to see if there's any similarity in the two columns. (I've clipped the output to make it easier to read)

db1=# \d+ t

Table "public.t"

--------+--------+-----------+----------+---------+----------+-...

b | text | | | | extended | ...

c | jsonb | | | | extended | ...

Indexes:

"t_pkey" PRIMARY KEY, btree (id)

Access method: heap

Clearly, the two column's are "extended" (you can read more about it here), but basically what happened here is that an extended column type resulted in the creation of a TOAST table (you can read more about TOAST here). Let's find out "how" to find the TOAST table for the table "t", and check if that could be consuming the remaining 8kb?

db1=# select oid, relname, reltoastrelid, reltoastrelid::regclass from pg_class where relname = 't';

oid | relname | reltoastrelid | reltoastrelid

-------+---------+---------------+-------------------------

18300 | t | 18303 | pg_toast.pg_toast_18300

(1 row)

db1=# select pg_relation_size('pg_toast.pg_toast_18300');

pg_relation_size

------------------

(1 row)

Hmmm, it's not the TOAST table. But just like the main table "t", could it be that the TOAST table has supporting relations that are to blame?

db1=# select pg_total_relation_size('pg_toast.pg_toast_18300');

pg_total_relation_size

------------------------

8192

(1 row)

db1=# \d pg_toast.pg_toast_18300

TOAST table "pg_toast.pg_toast_18300"

Column | Type

------------+---------

chunk_id | oid

chunk_seq | integer

chunk_data | bytea

Owning table: "public.t"

Indexes:

"pg_toast_18300_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

db1=# select pg_relation_size('pg_toast.pg_toast_18300_index');

pg_relation_size

------------------

8192

(1 row)

Yes! So we see above, that the TOAST table implicitly has a primary key (of it's own) that uses an index - which has 1 page (8kb) assigned to it.

In a nutshell, the empty table 't' above, consumes 16kb and the storage allocation takes this shape:

Main relation - 0 bytes - 't'
Main relation Index - 8kb - 't_pkey'

Toast relation - 0 bytes - 'pg_toast_18300'
Toast relation Index - 8kb - 'pg_toast_18300_index'

Cut Cut Cut

Okay, lets see if we can reduce the table size further, by dropping both the Extended columns.

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY); select pg_total_relation_size('t');

DROP TABLE

CREATE TABLE

pg_total_relation_size

------------------------

8192

(1 row)

Okay, that makes sense. Now the main table (heap) is still not consuming anything, but the index still does.

Let's reduce further.

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT); select pg_total_relation_size('t');

DROP TABLE

CREATE TABLE

pg_total_relation_size

------------------------

(1 row)

0 bytes!!

Nice! But seriously - 0 bytes?

It kind of makes sense, that since Postgres doesn't yet have anything to store - besides the metadata of the table (and since the metadata is stored in the system catalogs - for e.g. pg_catalog schema) - there isn't anything to store in the main relation (heap) as yet.

Disk Usage - Check filesystem

Hmmm - Nah, what if I don't trust Postgres?

Let's skip Postgres functions and ask the filesystem directly - and see if the table is actually 0 bytes. Here we first find the file path of the table in question using the postgres function pg_relation_filepath() and then ask the file-system for the file-size.

db1=# select pg_relation_filepath('t');

pg_relation_filepath

----------------------

base/17727/18356

(1 row)

db1=# \! ls -la /home/robins/proj/localpg/data/base/17727/18356

-rw------- 1 robins robins 0 Sep 23 10:19 /home/robins/proj/localpg/data/base/17727/18356

So the file corresponding to the table, actually is using 0 bytes. Nice!

Now, when a table is created, some entries are added to the system catalog. Let see if the database size is signficantly more than a blank database?

postgres=# create database db1;

CREATE DATABASE

postgres=# \c db1

You are now connected to database "db1" as user "robins".

db1=# CREATE TABLE t (id BIGINT);

CREATE TABLE

db1=# select pg_database_size('db1');

pg_database_size

------------------

7482515

(1 row)

db1=# create database db2;

CREATE DATABASE

db1=# \c db2

You are now connected to database "db2" as user "robins".

db2=# select pg_database_size('db2');

pg_database_size

------------------

7482515

(1 row)

Good. So this somewhat confirms that a blank database and a database with an "empty" table use the same disk-space.

Postgres is hiding something

Technically though, I'm lying. Well actually Postgres is ~~lying~~ hiding something from the file-system (i.e. every new table does make the database grow logically - a tad little - just that most of the times, the filesystem doesn't get the memo).

Under the cover, the way postgres stores data in a table (catalog table, or any user table), although it consumes a page of disk-space (often 8Kb), logically it may be consuming only a small part of that page. This is very helpful when more rows need to be stored in the table. When more rows come in, Postgres is able to reuse the same (first) page to now logically store more data - although for the filesystem - no extra pages were requested. This is what I was hinting at earlier that today's database use disk space (and thus memory cache) with larger (8kb) chunks and continue to keep using that page (until a new page is needed). Further below, I show a brief example of how all of this works.

But to summarize, it is then unfair to say that the system catalog did not grow at all (when a new table was added) - since some catalogs are guaranteed to have grown (for e.g. metadata of the new table is stored as an extra row in pg_class etc.) within the disk blocks already allocated as in-use for that catalog.

Let's squeeze a little more?

db1=# create table a();

CREATE TABLE

Right off the bat, that might seem completely wrong. Does Postgres allow a table with no Columns?

Yes !! 😎

All databases allow creation of an empty table (obviously), but Postgres allows a new table even if there are no columns! Let's verify this from the Postgres Documentation. Although suttle, we can see that in the syntax section of the CREATE TABLE page, the column_name data_type is enclosed with a square braces [] - which implies that columns are in fact, optional. What's more, this "feature" is a part of Postgres at least for the past 20 years!

Now the utility of this table is arguable (we'll explore that below), but it is now clear that this syntax is legal and works just fine.

Let's see how such a table looks like with psql \d

db1=# \d a

Table "public.a"

Column | Type | Collation | Nullable | Default

--------+------+-----------+----------+---------

That's it! That's the complete output - Since the table has no columns, the output above (rightly) doesn't show anything.

Let's go a little deeper

The obvious next question is - What on earth could a table like this be used for? That is a perfectly good question, and the answer is probably not much. However if you instead ask whether "Squeeze a little more" mean that a no column table takes less storage space? The answer (depends on input data but) is most probably yes. Let's see how does it help Postgres if it knows that you don't want to store any column in the table.

db1=# create table a();

CREATE TABLE

db1=# \dt+ a

List of relations

--------+------+-------+--------+-------------+---------------+---------+-------------

(1 row)

db1=# insert into a select;

INSERT 0 1

db1=# \dt+ a

List of relations

--------+------+-------+--------+-------------+---------------+------------+-------------

(1 row)

Nothing new here. We see that although an empty (0 column) table consumes 0 byte for storage. And like a regular table, as soon as the first row is inserted, the table uses 1 page - which in my test database (and probably 99.99% of postgres databases world-wide) consumes 8192 bytes. This is expected, but do note that the storage of logical rows in a postgres page, is a little oddly done (and for good reason). There is lllllooooottttt of detail here - but I wouldn't blame you if you'd want to keep that aside for a cold winter morning - when armed with a cup of hot coffee.

For now, we see below that each row that is inserted into the table, consumes 24 bytes - in that 8kb page.

db1=# create extension pageinspect ;

CREATE EXTENSION

db1=# truncate table a;

TRUNCATE TABLE

db1=# insert into a select FROM generate_series(1,3);

INSERT 0 3

db1=# SELECT * FROM heap_page_items(get_raw_page('a', 0));

| t_oid | t_data

----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------

+-------+--------

1 | 8168 | 1 | 24 | 85105 | 0 | 0 | (0,1) | 0 | 2048 | 24 |

| | \x

2 | 8144 | 1 | 24 | 85105 | 0 | 0 | (0,2) | 0 | 2048 | 24 |

| | \x

3 | 8120 | 1 | 24 | 85105 | 0 | 0 | (0,3) | 0 | 2048 | 24 |

| | \x

(3 rows)

Still going Deeper - but on a Tangent

So does the above imply that adding more columns to a table would mean Postgres consumes more bytres-per-row? Let's verify:

db1=# drop table t;

DROP TABLE

db1=# create table t(id bigint);

CREATE TABLE

db1=# truncate table t; insert into t select generate_series(1,3); vacuum full t; \dt+ t

TRUNCATE TABLE

INSERT 0 3

VACUUM

List of relations

--------+------+-------+--------+-------------+---------------+------------+-------------

(1 row)

db1=# SELECT * FROM heap_page_items(get_raw_page('t', 0));

| t_oid | t_data

----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------

+-------+--------------------

1 | 8160 | 1 | 32 | 85100 | 0 | 0 | (0,1) | 1 | 2816 | 24 |

| | \x0100000000000000

2 | 8128 | 1 | 32 | 85100 | 0 | 0 | (0,2) | 1 | 2816 | 24 |

| | \x0200000000000000

3 | 8096 | 1 | 32 | 85100 | 0 | 0 | (0,3) | 1 | 2816 | 24 |

| | \x0300000000000000

(3 rows)

Here we see that each row is now consuming 32 bytes - which is an extra 8 bytes from earlier. Good chances the only column we've added is the reason for the extra 8 bytes. Let's verify that using the pg_column_size() function (you can read more about it here):

db1=# select pg_column_size(1::bigint);

pg_column_size

----------------

(1 row)

But wait, there's one more twist here:

db1=# INSERT INTO t SELECT;

INSERT 0 1

db1=# SELECT * FROM heap_page_items(get_raw_page('t', 0));

| t_oid | t_data

----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------

--+-------+--------------------

1 | 8160 | 1 | 32 | 85111 | 0 | 0 | (0,1) | 1 | 2816 | 24 |

| | \x0100000000000000

2 | 8128 | 1 | 32 | 85111 | 0 | 0 | (0,2) | 1 | 2816 | 24 |

| | \x0200000000000000

3 | 8096 | 1 | 32 | 85111 | 0 | 0 | (0,3) | 1 | 2816 | 24 |

| | \x0300000000000000

4 | 8072 | 1 | 24 | 85113 | 0 | 0 | (0,4) | 1 | 2049 | 24 | 0000000

0 | | \x

(4 rows)

So wait, see row 4. Although the table has a column, just because the column didn't have a value, the row actually consumed only 24 bytes (the minimum)?

Is this scalable? I mean can I have a 5 column table and still Postgres stores a row, but only consume the bare minimum 24 bytes? Let's see:

db1=# drop table h;

DROP TABLE

db1=# create table h(c1 bigint, c2 bigint, c3 bigint, c4 bigint, c5 bigint);

CREATE TABLE

db1=# insert into h select;

INSERT 0 1

db1=# SELECT * FROM heap_page_items(get_raw_page('h', 0));

| t_oid | t_data

----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------

--+-------+--------

1 | 8168 | 1 | 24 | 86262 | 0 | 0 | (0,1) | 5 | 2049 | 24 | 0000000

0 | | \x

(1 row)

So yes, that does work and it does scale for "many" columns - but with a minor variation. There's more detail in code, but basically for regular columns the header contains 1 bit per column which expands in 8 byte chunks - and so say for 100 column table (with no data) - Postgres consumes ~40 bytes per row.

db1=# drop table h; create table h(); select 'alter table h add column c' || n || ' bigint;' from generate_series(1,100) e(n); \gexec

ALTER TABLE

db1=# select count(*) from pg_attribute where attrelid = 'h'::regclass;

count

-------

106

(1 row)

db1=# insert into h select; vacuum full h; SELECT * FROM heap_page_items(get_raw_page('h', 0));

INSERT 0 1

VACUUM

t_bits | t_oid | t_data

----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------

--------------------------------------------------------------------------------------------------+-------+--------

1 | 8152 | 1 | 40 | 88376 | 0 | 0 | (0,1) | 100 | 2817 | 40 | 0000000

0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 | | \x

(1 row)

Given the task at hand, I'd say that's still decently crisp.

So does a zero column table, squeeze max rows per page?

Yes, and No. So going back to a 0 column table - let's try to fill the whole page with rows and see how many can be stuffed on a single page.

If you're running low on coffee - the page-header is 24 bytes and so back-of-the-envelope math suggests that number of rows possible to squeeze into a page should be - ( 8192 bytes in a page - some bytes for page header & footer ) / 24 bytes per row = ~340 rows.

db1=# truncate table a; insert into a select FROM generate_series(1,340); vacuum full a; \dt+ a

TRUNCATE TABLE

INSERT 0 340

VACUUM

List of relations

--------+------+-------+--------+-------------+---------------+-------+-------------

(1 row)

Something's not right! That size should have stayed 8Kb. Why did Postgres use 16Kb (two pages - instead of one)?

That's because of this tiny bit of trivia that the maximum number of tuples that Postgres can squeeze onto a page, is hard-coded to 291 (for an 8kb page) - which was interesting to know - but to clarify, Heap-Only-Tuples (HOT feature) can effectively force a few more rows on a running database, but we'll go deeper into that some other day.

So let's go back and confirm if that understanding is correct - that Postgres can in fact squeeze at max only 291 rows onto the same page.

db1=# truncate table a; insert into a select FROM generate_series(1,291); vacuum full a; \dt+ a

TRUNCATE TABLE

INSERT 0 291

VACUUM

List of relations

--------+------+-------+--------+-------------+---------------+------------+-------------

(1 row)

db1=# truncate table a; insert into a select FROM generate_series(1,292); vacuum full a; \dt+ a

TRUNCATE TABLE

INSERT 0 292

VACUUM

List of relations

--------+------+-------+--------+-------------+---------------+-------+-------------

(1 row)

Here we see that:

When 291 rows are inserted (blue), the table stays at 1 page (8kb)
Whereas when 292 (291+1) rows are inserted (green), the table expands to 2 pages (16 kb)

Utility

So, that's all fine, but what's this table good for?

Well beyond understanding Postgres :) not much. This table is unhelpful for most database tasks. It can't store values - it wouldn't allow columns / indexes / selective deletes (let's avoid ctid hacks for now) etc.

But if I was forced to conjure an idea, a high-contention ticker app (that only needs to store +1s) via postgres functions - may be (and that's a BIG may be) this table could be used to store +1s - with a back-off algorithm on the application side. Good chances if there's a good DBA - this is done much better in many other ways (simplest of which is at the application end, or as a value in a column etc.) but it'd be better than nothing.

There are few other possible use-cases discussed here - for e.g. if (for some reason) you'd want to add columns to a table in a programmatic fashion after a base table is created - like we did above in the 100 column table test, OR, if you want to reserve a table name (in a multi-user setup) months / years in advance etc.

Finally

Much Ado About Nothing... This was a good exercise where we learnt something new about how Postgres stores table data - when ironically - there's nothing to store :)

Hope you had fun! Comments more than welcome.

6 Jul 2024

Ollama is missing --rate-limits on downloads

I am just starting my AI journey, and trying to get Ollama to work on my linux box, was an interesting non-AI experience.

I noticed, that everytime I was trying out something new, my linux box got reliably stuck every single time I pulled a new model. htop helped point out, that each time I did a ollama pull or ollama run, it spun up a ton of threads.

Often things got so bad, that the system became quite unresponsive. Here, you can see "when" I triggered the pull:

Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=8ms TTL=64
Reply from 192.168.85.24: bytes=32 time=65ms TTL=64
Reply from 192.168.85.24: bytes=32 time=286ms TTL=64
Reply from 192.168.85.24: bytes=32 time=286ms TTL=64
Reply from 192.168.85.24: bytes=32 time=304ms TTL=64

A little searching, led me to this on-going Github thread where a feature like --rate-limit were requested for multiple reasons. Some people were unhappy with how a pull clogged their routers, some were unhappy with how it jammed all other downloads / browsing on the machine. I was troubled since my linux box (a not-so-recent but still 6.5k BogoMIPS 4vCPU i5) came to a crawl.

While the --rate-limit feature takes shape, here are two solutions that did work for me :

As soon as I started the fetch (ollama run or ollama pull etc), I used iotop to change the ionice priority to idle. This made the issue go away completely (or at least made the system quite usable). However, it was still frustrating since (unlike top and htop) one had to type the PIDs... and as you may have guessed it already, Ollama creates quite a few when it does such the fetch.

Note that doing something like nice -n 19 did not help here. This was because the ollama processes weren't actually consuming (much) CPU for this task at all!

Then I tried to use ionice, which didn't work either! Note that since Ollama uses threads, the ionice tool didn't work for me. This was because ionice doesn't work with threads within a parent process. So this meant, something like the following did not work for me:

# These did not help!

robins@dell:~$ nice -n 19 ollama run mistral # Did not work!
robins@dell:~$ ionice -c3 ollama run mistral # Did not work either!!

After some trial-and-error, a far simpler solution was to just run a series of commands immediately after triggered a new model fetch. Essentially, it got the parent PID, and then set ionice for each of the child processes for that parent:

pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'`
echo $pid
sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '`

This worked something like this:

robins@dell:~$ pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'` && [ ${#pid} -gt 1 ] && ( sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '` ; echo "done" ) || echo "skip"skip

robins@dell:~$ pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'` && [ ${#pid} -gt 1 ] && ( sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '` ; echo "done" ) || echo "skip"done

After the above, iotop started showing idle in front of each of the ollama processes:

Total DISK READ:         0.00 B/s | Total DISK WRITE:         3.27 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      36.76 K/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND                                                                                                                                                                                                                      2692712 idle ollama      0.00 B/s  867.62 K/s ollama serve
2705767 idle ollama      0.00 B/s  852.92 K/s ollama serve
2692707 idle ollama      0.00 B/s  849.24 K/s ollama serve
2693740 idle ollama      0.00 B/s  783.07 K/s ollama serve
      1 be/4 root        0.00 B/s    0.00 B/s init splash
      2 be/4 root        0.00 B/s    0.00 B/s [kthreadd]
      3 be/4 root        0.00 B/s    0.00 B/s [pool_workqueue_release]
      4 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_g]
      5 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_p]
      6 be/0 root        0.00 B/s    0.00 B/s [kworker/R-slub_]

While at it, it was funny to note that the fastest way to see whether the unresponsive system is "going to" recover (because of what I just tried) was by keeping a separate ping session to the linux box. On my local network, I knew the system is going to come back to life in the next few seconds, when I noticed that the pings begin ack'ing in 5-8ms instead of ~100+ ms during the logjam.

So yeah, +10 on the --rate-limit or something similar!

Reference:

https://github.com/ollama/ollama/issues/2006

18 Apr 2019

How about 1000 cascading Replicas :)

The other day, I remembered an old 9.0-era mail thread (when Streaming Replication had just launched) where someone had tried to daisy-chain Postgres Replicas and see how many (s)he could muster.

If I recall correctly, the OP could squeeze only ~120 or so, mostly because the Laptop memory gave way (and not really because of an engine limitation).

I couldn't find that post, but it was intriguing to know if we could reach (at least) a thousand mark and see what kind of "Replica Lag" would that entail; thus NReplicas.

On a (very) unscientific test, my 4-Core 16G machine can spin-up (create data folders and host processes for all) 1000 Replicas in ~8m (and tear them down in another ~2m). Now am sure this could get better, but amn't complaining since this was a breeze to setup (in that it just worked without much tinkering ... besides lowering shared_buffers).

For those interested, a single UPDATE on the master, could (nearly consistently) be seen on the last Replica in less than half a second, with top showing 65% CPU idle (and 2.5 on the 1-min CPU metric) during a ~30 minute test.

Put in simple terms, what this means is that the UPDATE change traveled from the Master to a Replica (lets call it Replica1) and then from Replica1 it cascaded the change on to Replica2 (and so on a 1000 times). The said row change can be seen on Replica1000 within half a second.

So although (I hope) this isn't a real-world use-case, I still am impressed that this is right out-of-the-box and still way under the 1 second mark.... certainly worthy of a small post :) !

Host: 16GB / 4 core

Time to spin up (1000k Cascading Replicas): 8minutes

Time to tear down: 2 minutes

Test type: Constant UPDATEs (AV settings default)

Test Duration: 30min

Time for UPDATE to propagate: 500 ms!! (on average)

CPU Utilization: ~65%

CPU 1-min ratio: 2.5

20 Nov 2017

Update: RDS Prewarm script updated to fetch FSM / VM chunks

(This post is in continuation to my previous post regarding Initializing RDS Postgres Instance)

This simple SQL "Initializes" the EBS volume linked to an RDS Instance, something which isn't possible to do without sending workload (and experience high Latency in the first run).

Key scenarios, where this is really helpful are:

Create a Read-Replica (or Hot Standby in Postgres terms)
Restore a new RDS Instance from a Snapshot

Update: The Script, now also does the following:

Now also fetches disk blocks related to FSM / VM of all tables
Now fetches all Indexes

Limitations that still exist:

~~TOAST tables are still directly inaccessible in RDS~~

~~Indexes for TOAST columns also fall under this category~~
~~Trying hard to see if this last hurdle can be worked around~~

~~Anyone with any ideas?!~~

Script needs to be run once per Database Owner

Not sure if there is any magic around this

Object ownership is a Postgres property

RDS Postgres does not give Superuser access

I'll try to ease this in the future

By creating a script to list the Users that this needs to run as
The other possibility is to use DBLink to run this for separate Users in a single run

I'll update here, in case I make any significant changes.

Sample Run

-[ RECORD 1 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:40:08.291891-05

table_size | 13 GB

freespace_map_size | 3240 kB

visibility_map_size | 408 kB

blocks_prefetched | 1639801

current_database | pgbench

schema_name | public

table_name | pgbench_accounts

-[ RECORD 2 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:43:37.703711-05

table_size | 2142 MB

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 274194

current_database | pgbench

schema_name | public

table_name | pgbench_accounts_pkey

-[ RECORD 3 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.899115-05

table_size | 440 kB

freespace_map_size | 24 kB

visibility_map_size | 8192 bytes

blocks_prefetched | 59

current_database | pgbench

schema_name | public

table_name | pgbench_tellers

-[ RECORD 4 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.901088-05

table_size | 240 kB

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 30

current_database | pgbench

schema_name | public

table_name | pgbench_tellers_pkey

-[ RECORD 5 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.905107-05

table_size | 40 kB

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 5

current_database | pgbench

schema_name | public

table_name | pgbench_branches_pkey

-[ RECORD 6 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.907089-05

table_size | 40 kB

freespace_map_size | 24 kB

visibility_map_size | 8192 bytes

blocks_prefetched | 9

current_database | pgbench

schema_name | public

table_name | pgbench_branches

-[ RECORD 7 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.907142-05

table_size | 0 bytes

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 0

current_database | pgbench

schema_name | public

table_name | pgbench_history

7 Nov 2017

Prewarming / Initializing an RDS Postgres instance (from S3)

UPDATE: Read this for recent updates. Now the SQL successfully fetches *all* disk blocks on most RDS PostgreSQL (read post for the rare exceptions).

As many of you know, that AWS RDS Postgres uses EBS which has an interesting feature called Lazy Loading that allows it to instantiate a disk (the size of which can be mostly anything from 10GB to 6TB) and it comes online within a matter of minutes. Although a fantastic feature, this however, can lead to unexpected outcomes when high-end production load is thrown at a newly launched RDS Postgres instance immediately after Restoring from a Snapshot.

One possible solution is to use the pg_prewarm Postgres Extension that is well supported in RDS Postgres, immediately after Restoring from a Snapshot, thereby reducing the side-effects of Lazy Loading.

Although pg_prewarm was originally meant for populating buffer-cache, this extension (in this specific use-case) is heaven-sent to initialize (fetch), (almost) the entire snapshot from S3 on to the RDS EBS volume in question. Therefore, even if you use pg_prewarm to run through all tables etc., thereby effectively evicting the recent run for the previous table from buffer-cache, it still does the job of initializing all disk-blocks with respect to the EBS volume.

I've just checked in the SQL to this repository that seems to do this magic pretty well. It also enlists why this would only take you ~70% of the way owing to restrictions / limitations (as per my current understanding).

In the Sample below, I restored a new RDS Postgres instance from a Snapshot and immediately thereafter ran this SQL on it.

Notice that the first table (pgbench_accounts) takes about 22 seconds to load the first time, and less than a second to load the second time.
Similarly the second table (pgbench_history) takes 15 seconds to load the first time and less than a second, the second time :) !

pgbench=> SELECT clock_timestamp(), pg_prewarm(c.oid::regclass),
pgbench-> relkind, c.relname
pgbench-> FROM pg_class c
pgbench-> JOIN pg_namespace n
pgbench-> ON n.oid = c.relnamespace
pgbench-> JOIN pg_user u
pgbench-> ON u.usesysid = c.relowner
pgbench-> WHERE u.usename NOT IN ('rdsadmin', 'rdsrepladmin', ' pg_signal_backend', 'rds_superuser', 'rds_replication')
pgbench-> ORDER BY c.relpages DESC;
clock_timestamp | pg_prewarm | relkind | relname
-------------------------------+------------+---------+-----------------------
2017-11-07 11:41:44.341724+00 | 17903 | r | pgbench_accounts
2017-11-07 11:42:06.059177+00 | 6518 | r | pgbench_history
2017-11-07 11:42:17.126768+00 | 2745 | i | pgbench_accounts_pkey
2017-11-07 11:42:21.406054+00 | 45 | r | pgbench_tellers
2017-11-07 11:42:21.645859+00 | 24 | r | pgbench_branches
2017-11-07 11:42:21.757086+00 | 2 | i | pgbench_branches_pkey
2017-11-07 11:42:21.757653+00 | 2 | i | pgbench_tellers_pkey
(7 rows)

pgbench=>
pgbench=> SELECT clock_timestamp(), pg_prewarm(c.oid::regclass),
pgbench-> relkind, c.relname
pgbench-> FROM pg_class c
pgbench-> JOIN pg_namespace n
pgbench-> ON n.oid = c.relnamespace
pgbench-> JOIN pg_user u
pgbench-> ON u.usesysid = c.relowner
pgbench-> WHERE u.usename NOT IN ('rdsadmin', 'rdsrepladmin', ' pg_signal_backend', 'rds_superuser', 'rds_replication')
pgbench-> ORDER BY c.relpages DESC;
clock_timestamp | pg_prewarm | relkind | relname
-------------------------------+------------+---------+-----------------------
2017-11-07 11:42:33.914195+00 | 17903 | r | pgbench_accounts
2017-11-07 11:42:33.917725+00 | 6518 | r | pgbench_history
2017-11-07 11:42:33.918919+00 | 2745 | i | pgbench_accounts_pkey
2017-11-07 11:42:33.919412+00 | 45 | r | pgbench_tellers
2017-11-07 11:42:33.919427+00 | 24 | r | pgbench_branches
2017-11-07 11:42:33.919438+00 | 2 | i | pgbench_branches_pkey
2017-11-07 11:42:33.919443+00 | 2 | i | pgbench_tellers_pkey
(7 rows)