That Guy From Delhi: optimization

Showing posts with label optimization. Show all posts

3 Dec 2025

Speed up JOIN Planning - upto 16x Faster!

The hidden cost of knowing too much. That's one way to describe what happens when your data is skewed, Postgres statistics targets are set high, and the planner tries to estimate a join.

For over 20 years, Postgres used a simple O(N^2) loop to compare (equi-join) Most Common Values (MCVs) during join estimation. It worked fine when statistics targets are small (default_statistics_target defaults to 100). But in the modern era - we often see Postgres best-practices recommend cranking that up. Customers are known to be using higher values (1000 and sometimes even higher) to handle complex data distributions + throw a 10 JOIN query to the mix - and this "dumb loop" can easily become a silent performance killer during planning.

That changes in Postgres 19.

The Problem: It's Quadratic!

When you join two tables, the planner needs to estimate how many rows will match. If both columns have MCV lists (lists of the most frequent values), eqjoinsel() tries to match them up to get a precise estimate.

Historically, it did this by comparing every item in list A with every item in list B.

If you have 100 MCVs, that's 10,000 comparisons. Fast.
If you have 10,000 MCVs (max stats target), that's 100,000,000 comparisons. Not so fast.

This meant that simply *planning* a query could take significantly longer than executing it, especially for simple OLTP queries.

The Solution: Hash It Out

The fix is elegant and effective.

Instead of a nested loop, the planner now:

1. Checks if the total number of MCVs is greater than 200 (100 each side).

2. If so, it builds a Hash Table of the MCVs from one (smaller) side.

3. It then probes this hash table with the MCVs from the other side.

The threshold of 200 was chosen because hashing has a startup cost (allocating memory, computing hashes). For smaller lists, the simple loop is actually faster. But once you cross that threshold, the hash table wins.

This transforms the complexity from O(N^2) to O(N), making the estimation step virtually instantaneous even with the largest statistics targets.

Let's Benchmark

I wanted to verify this myself, so I set up a worst-case scenario: two tables with 100,000 rows, but only 10,000 distinct values, and I cranked the statistics target to the maximum (10,000) to force a massive MCV list.

Setup


CREATE TABLE t1 (x int);
CREATE TABLE t2 (x int);

-- Scenario A: 3M rows, Stats 10000
INSERT INTO t1 SELECT x % 10000 FROM generate_series(1, 3000000) x;
INSERT INTO t2 SELECT x % 10000 FROM generate_series(1, 3000000) x;

-- Maximize statistics target
ALTER TABLE t1 ALTER COLUMN x SET STATISTICS 10000;
ALTER TABLE t2 ALTER COLUMN x SET STATISTICS 10000;

ANALYZE t1;
ANALYZE t2;

Results

I ran the following query 10 times on both versions, interleaved to ensure fairness and used median for the results:


EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM t1 JOIN t2 ON t1.x = t2.x;

Scenario A: High Stats (10k MCVs)

Rows: 3 Million
Statistics Target: 10,000
MCV Count: 10,000

Before (Postgres 18): ~27.8 ms
After (Postgres 19): ~1.75 ms
Speedup: ~16x

Scenario B: Medium Stats (1k MCVs)

Rows: 300,000
Statistics Target: 1,000
MCV Count: 1,000

Before (Postgres 18): ~0.85 ms
After (Postgres 19): ~0.60 ms
Speedup: ~1.4x (40% speed-up)

Scenario C: Default Stats (100 MCVs)

Rows: 30,000
Statistics Target: 100 (Default)
MCV Count: 100

Before (Postgres 18): ~0.40 ms
After (Postgres 19): ~0.43 ms
Speedup: None (Optimization correctly skipped)

Since the total MCV count in Scenario C (100 + 100 = 200) did not exceed the 200 threshold, Postgres 19 correctly chose the simpler loop, avoiding the overhead of building a hash table. This confirms that the patch is smart enough to only kick in when it matters.

The "Quadratic Curve" in Action

To visualize the O(N^2) vs O(N) difference, I ran a benchmark across a wide range of statistics targets (from 10 to 10,000). In each test, the number of rows was set to 300x the statistics target to ensure the MCV list was fully populated and relevant.

Stats Target	PG 18 (ms)	PG 19 (ms)	Speedup
10	0.40	0.38	-
100	0.40	0.45	-
200	0.45	0.43	-
500	0.53	0.52	-
1000	0.85	0.63	1.3x
2000	1.73	0.68	2.5x
5000	7.54	1.14	6.6x
8000	17.97	1.57	11.4x
10000	27.56	1.92	14.3x

As you can see, up to a target of 500, the difference is negligible (and the optimization might not even kick in). But as the target grows, the quadratic cost of the old method explodes, while the new hash-based method scales linearly and remains extremely fast.

This patch is a classic example of modernizing legacy assumptions. The code written 20 years ago assumed MCV lists would be short (in all fairness the default was 100 for a long time). Today's hardware and data requirements have pushed those boundaries, and Postgres is evolving to meet them.

Thanks to Ilia Evdokimov for the patch, David Geier for co-authoring, and Tom Lane for the review.

Note: This feature is currently in the master branch for Postgres 19. As with all development features, it is subject to change before the final release.

References

Commit: Speed up eqjoinsel() with lots of MCV entries
Discussion: Mailing List Thread

29 Nov 2025

Teaching Query Planner to See Inside C Functions

The Postgres planner just got a new superpower: X-ray vision for your black-box functions.

For years, Postgres has been able to "inline" simple SQL functions. If you wrote CREATE FUNCTION ... LANGUAGE SQL AS 'SELECT ...', the planner would often rip out the function call and paste the raw SQL directly into the main query. This was magic. It meant that WHERE clauses could be pushed down, indexes could be used, and performance was excellent.

But if you stepped outside the safety of simple SQL functions—say, into the world of C extensions or complex dynamic SQL—you were often stuck with a "black box." The planner saw a function, executed it blindly, and only *then* applied your filters.

That changes in Postgres 19.

The "Black Box" Problem

Imagine you have a function that returns a large dataset, but you only want a few rows:


SELECT * FROM my_complex_function() WHERE id = 42;

Before Postgres 19, if my_complex_function wasn't a simple SQL function, the database would:

1. Run my_complex_function() to completion (generating, say, 1 million rows).

2. Filter the result to find id = 42.

This is painfully inefficient. You wanted an Index Scan, but you got a full execution followed by a filter.

The Solution: `SupportRequestInlineInFrom`

With this feature, Postgres now allows function authors (specifically in C extensions) to provide a "support function" that tells the planner: *"Hey, if anyone calls me in a FROM clause, here is the raw SQL query tree that represents what I'm doing."*

The planner can then take that query tree and inline it, just like it does for standard SQL functions.

The Impact

This is a game-changer for:

Extensions: Many extensions provide table-like interfaces (e.g., foreign data wrappers, set-returning functions). They can now expose their internal logic to the planner.
Dynamic SQL: Functions that generate SQL strings and execute them can now potentially participate in optimization.

Before vs. After

Scenario: A PL/pgSQL function foo_from_bar that executes a dynamic query on text_tbl (containing 1 million rows).


CREATE TABLE text_tbl (f1 text);
INSERT INTO text_tbl SELECT 'row_' || generate_series(1, 1000000);
INSERT INTO text_tbl SELECT 'common_row' FROM generate_series(1, 100);
CREATE INDEX text_tbl_idx ON text_tbl(f1);
ANALYZE text_tbl;

Here is the PL/pgSQL function I used for testing (designed to return 100 rows). Note that it uses dynamic SQL, which is normally opaque to the planner, and then linked it to the C support function test_inline_in_from_support_func using ALTER FUNCTION.


CREATE OR REPLACE FUNCTION foo_from_bar(colname TEXT, tablename TEXT, filter TEXT)
RETURNS SETOF TEXT
LANGUAGE plpgsql
AS $function$
DECLARE
  sql TEXT;
BEGIN
  sql := format('SELECT %I::text FROM %I', colname, tablename);
  IF filter IS NOT NULL THEN
    sql := CONCAT(sql, format(' WHERE %I::text = $1', colname));
  END IF;
  RETURN QUERY EXECUTE sql USING filter;
END;
$function$ STABLE;

ALTER FUNCTION foo_from_bar(TEXT, TEXT, TEXT)
  SUPPORT test_inline_in_from_support_func;

And here is the query I ran:


SELECT * FROM foo_from_bar('f1', 'text_tbl', 'common_row');

Before (PG 18 and earlier)


->  Function Scan on public.foo_from_bar  (cost=0.25..10.25 rows=1000 width=32)
      Output: foo_from_bar
      Function Call: foo_from_bar('f1'::text, 'text_tbl'::text, 'common_row'::text)
      Buffers: shared hit=32 read=4
      Execution Time: 0.353 ms

The function is treated as a black box. The planner estimates 1000 rows (default) and scans the function output.

After (PG 19)


->  Index Only Scan using text_tbl_idx on public.text_tbl  (cost=0.42..8.44 rows=1 width=10)
      Output: text_tbl.f1
      Index Cond: (text_tbl.f1 = 'common_row'::text)
      Buffers: shared hit=1 read=3
      Execution Time: 0.064 ms

The planner "sees through" the function, realizes it's just a query on text_tbl, and uses the index directly. The execution time drops significantly (5.5x faster in this case).

I also ran a pgbench test (1000 iterations) to measure the end-to-end latency improvement:

Before: 0.107 ms average latency
After: 0.087 ms average latency
Improvement: ~19% faster end-to-end (including network overhead)

Bonus: Outer WHERE Pushdown - Orders of magnitude faster!

Next, I ran a test where NULL was passed to the function (returning all 1 million rows) and applied an external WHERE clause:


SELECT * FROM foo_from_bar('f1', 'text_tbl', NULL) WHERE foo_from_bar = 'common_row';

Before (PG 18):


->  Function Scan on public.foo_from_bar  (cost=0.25..12.75 rows=5 width=32)
      Output: foo_from_bar
      Function Call: foo_from_bar('f1'::text, 'text_tbl'::text, NULL::text)
      Filter: (foo_from_bar.foo_from_bar = 'common_row'::text)
      Rows Removed by Filter: 1000000
      Execution Time: 136.267 ms

The database scanned all 1 million rows from the function, then filtered them down to 100.

After (PG 19):


->  Index Only Scan using text_tbl_idx on public.text_tbl  (cost=0.42..8.44 rows=1 width=10)
      Output: text_tbl.f1
      Index Cond: (text_tbl.f1 = 'common_row'::text)
      Execution Time: 0.055 ms

The planner pushed the WHERE clause *inside* the function logic, triggering an index lookup (~2400x speedup).

Similarly, I tested adding a LIMIT 5 clause to the query and saw a ~20x speedup!!

Before: The database scanned the function output until it got 5 rows (~1.091 ms).
After: The planner pushed the limit down into the index scan, stopping immediately after finding 5 rows (~0.051 ms).

Testing Methodology

The benchmarks above were conducted using pgbench with the following parameters:

Iterations: 1000 transactions per client.
Concurrency: Single client (1 thread).
Data Volume: The target table text_tbl contained 1,000,100 rows.
Return Size: The query returned 100 rows per execution.
Metric: Average latency reported by pgbench (excluding connection overhead).

Commits tested: c0bc9af (before) and b140c8d (after).

The C support function (test_inline_in_from_support_func) used in this example is defined in src/test/regress/regress.c that is a part of the regression tests. To know more about support functions, see PostgreSQL Documentation on Function Optimization Information.

Discussion & Credits

Coincidentally a Linkedin thread mentioned that Set-Returning Functions (SRF) began their journey in Postgres 7.3 - that was 17 years ago! With this patch, the planner should be able to see through such a function and develop much better execution plans around them! Thanks to Paul A. Jungwirth (for initating the idea with a working patch) and Tom Lane (for in-depth review).

Note: This feature is currently in the master branch for Postgres 19. As with all development features, it is subject to change before the final release.

27 Nov 2025

Settling COUNT(*) vs COUNT(1) debate in Postgres 19

A recent commit to the PostgreSQL master branch brings a nice quality-of-life optimization for a very common SQL pattern - improving performance by up to 64% for SELECT COUNT(h) where h is a NOT NULL column.

If you've ever wondered whether you should use COUNT(*) or COUNT(1), or if you've been dutifully using COUNT(id) on a non-null column, this change is for you.

Note: This feature is currently on the PostgreSQL master branch (committed in November 2025). As with any commit on master, it is subject to change or even rollback before the final release, although this is relatively rare for committed features. If all goes well, this change will be part of the PostgreSQL 19 major version release.

The Optimization

Historically in PostgreSQL, COUNT(*) has been faster than COUNT(1) or COUNT(column).

COUNT(*): The planner knows it just needs to count rows. It doesn't need to check if any specific column is null.
COUNT(column): The executor has to fetch the value of the column for each row and check if it is NULL before counting it. This involves "deforming" the tuple (extracting the data), which adds overhead.
COUNT(1): Often used as a "faster" alternative by users, but treated similarly to COUNT(column) where the "column" is the constant 1.

The new commit, Have the planner replace COUNT(ANY) with COUNT(*), when possible, changes this behavior.

The planner can now detect when you are counting a generic expression (COUNT(ANY)) that:

1. Cannot be NULL: For example, a constant like 1 or a column defined as NOT NULL.

2. Has no extra clauses: No DISTINCT or ORDER BY inside the aggregate.

In these cases, the planner will automatically transform the query to use COUNT(*).

Performance Impact

This optimization eliminates the overhead of tuple deforming and null checking for these cases.

I ran a benchmark on a local PostgreSQL instance to verify the impact, comparing COUNT(h) (not null column), COUNT(1) (constant), and COUNT(*) (rows) on a table with 10 million rows.

Before Patch:

SELECT count(h) FROM t: ~195 ms
SELECT count(1) FROM t: ~126 ms
SELECT count(*) FROM t: ~119 ms
*Observation:* COUNT(h) is significantly slower (~64% overhead vs count(*)). COUNT(1) is faster than COUNT(h) but still lags behind COUNT(*).

After Patch:

SELECT count(h) FROM t: ~117 ms
SELECT count(1) FROM t: ~116 ms
SELECT count(*) FROM t: ~114 ms
*Observation:* COUNT(h) sees the major win, improving dramatically to match COUNT(*). COUNT(1) sees a minor speedup, also converging to the optimal COUNT(*) performance.

This means users who prefer COUNT(1) for stylistic reasons, or who count specific non-null columns, will now get the optimal performance of COUNT(*) automatically.

(The patch author, David Rowley, demonstrated similar speedups in the initial discussion).

What about Nullable Columns?

I also tested COUNT(nullable_col) to see if the optimization applies. As expected, it does not.

Before Patch: COUNT(nullable_col): ~162 ms
After Patch: COUNT(nullable_col): ~158 ms (No significant change)

Since the planner cannot guarantee the column is free of NULLs, it must still scan and check every value, so the performance remains unchanged. The optimization strictly targets columns where the planner *knows* the value cannot be NULL.

What about DISTINCT or ORDER BY?

I also tested COUNT(DISTINCT h) and COUNT(h ORDER BY h) to see if the optimization applies. As expected, it does not.

Before Patch: COUNT(DISTINCT h): ~2663 ms (Median of 10 runs)
After Patch: COUNT(DISTINCT h): ~2708 ms (Median of 10 runs)

Similarly for ORDER BY, the executor still needs to process the values to sort them, so no optimization is possible.

Test Methodology

Versions:

* Before: Tested on commit dbdc717ac6743074c3a55fc5c380638c91d24afd.

* After: Tested on commit 42473b3b31238b15cc3c030b4416b2ee79508d8c.

Data: Table t with 10 million rows, populated via INSERT INTO t (h) SELECT 1 FROM generate_Series(1,10000000);.
Procedure: Each query was executed 10 times in an interleaved pattern (A, B, C, A, B, C...) to ensure fairness and mitigate caching bias. The reported time is the median of those 10 runs.
System: Intel Core i7-4770 CPU @ 3.40GHz (4 cores/8 threads), 16GB RAM.

Detailed Analysis

To understand *why* the performance improved, we can look at the EXPLAIN plans and table statistics.

Table Size: The table t (10 million rows) is approximately 346 MB.
Disk vs. Memory: The benchmark results (median of 10 runs) indicate mostly in-memory / hot-cache performance (~3 GB/s throughput). However, even with some disk I/O involved (as seen in single-run EXPLAIN (ANALYZE, BUFFERS) tests showing ~28k buffer reads), the CPU overhead of deforming the tuple was the dominant differentiator.

The Key Difference:

In the Before scenario, COUNT(h) required the scanner to fetch the column data (width=4), whereas COUNT(*) did not (width=0).


-- Before Patch: COUNT(h)
->  Parallel Seq Scan on t  (... rows=4166657 width=4) ...

In the After scenario, the planner transforms COUNT(h) into COUNT(*), allowing it to skip fetching the column data entirely (width=0).


-- After Patch: COUNT(h)
->  Parallel Seq Scan on t  (... rows=4166657 width=0) ...

This reduction in "width" means the executor doesn't need to extract the integer value from the row, saving significant CPU cycles.

Community Discussion & Trivia

Thanks to David Rowley for working on the patch and Corey Huinker / Matheus Alcantara for reviewing this. The discussions debated on trade-offs between what is fastest and which of these is an "anti-pattern", but it is great to see that now all three are now equivalently optimized.

Future Possibilities:

The infrastructure added (SupportRequestSimplifyAggref) opens the door for further optimizations. For instance, COUNT(NULL) could theoretically be optimized to instantly return 0 without scanning any rows - although that specific change wasn't included in this commit to keep the scope focused.

References

Commit Message: Have the planner replace COUNT(ANY) with COUNT(*), when possible
Discussion Thread: Have the planner convert COUNT(1) / COUNT(not_null_col) to COUNT(*)

11 May 2025

FOR KEY SHARE optimization and the SLRU Trap

Optimizing database concurrency is a constant balancing act. We often tweak locking strategies in PostgreSQL, aiming to allow more simultaneous operations without compromising data integrity. A common scenario involves shifting from stricter row-level locks to more lenient ones. But sometimes, what seems like a straightforward optimization can lead to unexpected performance bottlenecks in less obvious parts of the system.

This post explores one such scenario: moving from SELECT FOR NO KEY UPDATE to SELECT FOR KEY SHARE, the potential for subsequent MultiXactOffsetSLRU wait events, and how PostgreSQL 17 offers a direct solution.

The Locking Strategy Shift: Aiming for Higher Concurrency

Let's start with the locks themselves. Within a transaction, you might initially use:

SELECT FOR NO KEY UPDATE: This acquires a moderately strong row lock. It prevents others from deleting the row, updating key columns, or acquiring FOR UPDATE / FOR NO KEY UPDATE locks on it. However, it does allow concurrent non-key updates and weaker FOR SHARE / FOR KEY SHARE locks. Importantly (and we’ll see why later), only one transaction can hold this lock (or FOR UPDATE) on a given row at a time.

To potentially increase concurrency, especially if you only need to prevent key changes or deletions (like ensuring a foreign key reference remains valid), you might switch to:

SELECT FOR KEY SHARE: This is a weaker, shared lock. It blocks deletions and key updates but allows concurrent non-key updates and even other concurrent SELECT FOR KEY SHARE (or FOR SHARE) locks on the exact same row.

The intended outcome of switching to FOR KEY SHARE is often to reduce blocking and allow more transactions to proceed in parallel, particularly if the main concern is referential integrity rather than preventing all concurrent modifications.

The Unforeseen Bottleneck: Enter MultiXacts and SLRU Caches

While the switch does allow higher concurrency at the row-lock level, it can create pressure elsewhere. Here’s the chain reaction:

Increased Shared Lock Concurrency: Your application now has more situations where multiple transactions hold a shared lock (FOR KEY SHARE) on the same row simultaneously.
The MultiXact System: How does PostgreSQL track that multiple transactions (potentially dozens or hundreds) have a shared interest in a single row? It uses a mechanism called MultiXact IDs (Multi-Transaction IDs). Instead of just one transaction ID locking the row, PostgreSQL assigns a special MultiXact ID that represents the group of transactions currently sharing a lock on it.
SLRU Caches: Managing this MultiXact metadata efficiently requires quick access. PostgreSQL uses specialized SLRU (Simple Least Recently Used) caches in shared memory for this. These caches store the mappings (offsets) from rows to their MultiXact member lists (MultiXactOffsetSLRU) and the member lists themselves (MultiXactMemberSLRU).
The Bottleneck (PG 16 and older): Before PostgreSQL 17, these SLRU caches had relatively small, fixed sizes determined at compile time. When the workload switch dramatically increased the demand for MultiXact tracking (due to more concurrent shared locks), these small caches could easily become overwhelmed.
The Symptom (MultiXactOffsetSLRU Waits): An overloaded SLRU cache leads to performance degradation manifesting as specific Wait Events. You might see high MultiXactOffsetSLRU waits, indicating processes are frequently:

Waiting for disk I/O because the required MultiXact offset data wasn't found in the small cache (cache miss).
Waiting to acquire low-level locks needed to access or update the cache's shared memory buffers, because many processes are trying to use the limited cache concurrently (lock contention).
Many backends appear hung - as can be seen in this recent community thread.

So trying to increase concurrency at the row-level, created a bottleneck in the underlying mechanism to manage that very concurrency!

PostgreSQL 17 to the Rescue: Configurable SLRU Buffers

Recognizing this potential bottleneck, the PostgreSQL developers introduced direct solutions in PostgreSQL 17 (released September 2024). These come in the form of new configurable parameters:

multixact_offset_buffers:

This parameter controls the size (in buffer pages) of the MultiXactOffset SLRU cache in shared memory. The default value is very small (16) and this allows administrators to allocate more RAM to cache the crucial row-to-member-list mappings. This significantly increases the cache hit rate, reduces disk I/O for MultiXact offsets, and directly alleviates the pressure causing MultiXactOffsetSLRU waits.

multixact_member_buffers:

This parameter controls the size of the MultiXactMember SLRU cache, which stores the actual lists of transaction IDs. This is possibly less directly tied to the Offset wait event, ensuring in-cache member lists improves the overall performance and throughput of the entire MultiXact lookup process, which is essential when handling high shared-lock concurrency.

You can learn more about these new parameters in this fantastic discussion - https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache

These parameters allow DBAs to tune the memory allocated to these critical caches based on their specific workload, moving away from the one-size-fits-all limitation of previous versions.

Conclusion

Switching locking strategies in PostgreSQL requires careful consideration not just of the direct blocking rules but also of the potential impact on underlying mechanisms. Moving from SELECT FOR NO KEY UPDATE to the more concurrency-friendly SELECT FOR KEY SHARE can be beneficial, but it increases the load on the MultiXact system. In versions before PostgreSQL 17, this could lead to performance bottlenecks manifesting as MultiXactOffsetSLRU wait events due to contention on small, fixed-size caches.

Thankfully, PostgreSQL 17 provides the tools needed to manage this directly with the multixact_offset_buffers and multixact_member_buffers GUCs. If you encounter these specific wait events after increasing shared lock usage, upgrading to PostgreSQL 17+ and tuning these parameters should be a key part of your resolution strategy. As always, monitor your system's wait events and performance metrics closely when making changes to locking or configuration.

29 Feb 2020

Optimizations in GROUP BY vs SELECT DISTINCT

(This came out of something I was trying out + discussing with Postgres enthusiasts - thanks to all for clarifying doubts)

This article aims at highlighting one aspect of how the query planner implementation of SELECT * GROUP BY differs from SELECT DISTINCT.

For example:

SELECT b,c,d FROM a GROUP BY b,c,d;
vs
SELECT DISTINCT b,c,d FROM a;

We see a few scenarios where Postgres optimizes by removing unnecessary columns from the GROUP BY list (if a subset is already known to be Unique) and where Postgres could do even better. To highlight this difference, here I have an empty table with 3 columns:

postgres=# create table a (b integer, c text, d bigint);
CREATE TABLE

postgres=# \d a
                 Table "public.a"
Column | Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
b      | integer |           |          |
c      | text    |           |          |
d      | bigint |           |          |

On this table, we can see that SELECT * GROUP BY generates the exact same plan as SELECT DISTINCT. In particular, we're interested in the "Group Key" which is the same for both SQLs:

postgres=# explain select distinct b,c,d from a;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=29.78..31.78 rows=200 width=44)
   Group Key: b, c, d
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=29.78..31.78 rows=200 width=44)
   Group Key: b, c, d
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

Having said that, if the same table is created with a PRIMARY KEY, we see that GROUP BY becomes smarter, in that we can see that the "Group Key" uses the Primary Key (here it is 'b') and correcty discards columns 'c' and 'd'. Nice 😄!

postgres=# create table a (b integer PRIMARY KEY, c text, d bigint);
CREATE TABLE
postgres=# explain select distinct b,c,d from a;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=29.78..41.08 rows=1130 width=44)
   Group Key: b, c, d
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=24.12..35.42 rows=1130 width=44)
   Group Key: b
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

Let's check if we get the same optimization if we create a UNIQUE index on the column. The answer? Sadly No! Furthermore, I went ahead and created a NOT NULL constraint, but that didn't change anything either. (Do note that UNIQUE columns can have multiple rows with NULLs).

postgres=# create table a (b integer unique not null, c text, d bigint);
CREATE TABLE

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=29.78..41.08 rows=1130 width=44)
   Group Key: b, c, d
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

Regarding the above, IIUC this is an obvious performance optimization that Postgres is still leaving on the table (as of v13+):

postgres=# select version();
                                                     version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 13devel on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 32-bit
(1 row)

Next, does it still optimize this, if the PRIMARY KEY is not the first column in the GROUP BY? Answer? Yes! (The engine can optimize if any of the GROUPed BY column is a Primary Key! Noice !

postgres=# create table a (b integer, c text primary key, d bigint);
CREATE TABLE

postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=24.12..35.42 rows=1130 width=44)
   Group Key: c
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

... and what if the PRIMARY KEY is a composite key of any of the columns in the GROUP BY column list? YES again 😄 !

postgres=# create table a (b int, c text, d bigint, primary key (c,d)) ;
CREATE TABLE
postgres=# explain select b,c,d from a group by b,c,d;
                         QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=26.95..28.95 rows=200 width=44)
   Group Key: c, d
   -> Seq Scan on a (cost=0.00..21.30 rows=1130 width=44)
(3 rows)

Lastly, although some of these "optimizations" are things-to-avoid when writing good SQL, the reality is that ORM generated SQLs aren't that smart yet and then it's great that Postgres already implements these obvious optimizations.

20 Nov 2017

Update: RDS Prewarm script updated to fetch FSM / VM chunks

(This post is in continuation to my previous post regarding Initializing RDS Postgres Instance)

This simple SQL "Initializes" the EBS volume linked to an RDS Instance, something which isn't possible to do without sending workload (and experience high Latency in the first run).

Key scenarios, where this is really helpful are:

Create a Read-Replica (or Hot Standby in Postgres terms)
Restore a new RDS Instance from a Snapshot

Update: The Script, now also does the following:

Now also fetches disk blocks related to FSM / VM of all tables
Now fetches all Indexes

Limitations that still exist:

~~TOAST tables are still directly inaccessible in RDS~~

~~Indexes for TOAST columns also fall under this category~~
~~Trying hard to see if this last hurdle can be worked around~~

~~Anyone with any ideas?!~~

Script needs to be run once per Database Owner

Not sure if there is any magic around this

Object ownership is a Postgres property

RDS Postgres does not give Superuser access

I'll try to ease this in the future

By creating a script to list the Users that this needs to run as
The other possibility is to use DBLink to run this for separate Users in a single run

I'll update here, in case I make any significant changes.

Sample Run

-[ RECORD 1 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:40:08.291891-05

table_size | 13 GB

freespace_map_size | 3240 kB

visibility_map_size | 408 kB

blocks_prefetched | 1639801

current_database | pgbench

schema_name | public

table_name | pgbench_accounts

-[ RECORD 2 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:43:37.703711-05

table_size | 2142 MB

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 274194

current_database | pgbench

schema_name | public

table_name | pgbench_accounts_pkey

-[ RECORD 3 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.899115-05

table_size | 440 kB

freespace_map_size | 24 kB

visibility_map_size | 8192 bytes

blocks_prefetched | 59

current_database | pgbench

schema_name | public

table_name | pgbench_tellers

-[ RECORD 4 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.901088-05

table_size | 240 kB

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 30

current_database | pgbench

schema_name | public

table_name | pgbench_tellers_pkey

-[ RECORD 5 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.905107-05

table_size | 40 kB

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 5

current_database | pgbench

schema_name | public

table_name | pgbench_branches_pkey

-[ RECORD 6 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.907089-05

table_size | 40 kB

freespace_map_size | 24 kB

visibility_map_size | 8192 bytes

blocks_prefetched | 9

current_database | pgbench

schema_name | public

table_name | pgbench_branches

-[ RECORD 7 ]-------+------------------------------

clock_timestamp | 2017-11-19 15:44:12.907142-05

table_size | 0 bytes

freespace_map_size | 0 bytes

visibility_map_size | 0 bytes

blocks_prefetched | 0

current_database | pgbench

schema_name | public

table_name | pgbench_history

3 Dec 2025

The Problem: It's Quadratic!

The Solution: Hash It Out

Let's Benchmark

Setup

Results

Scenario A: High Stats (10k MCVs)

Scenario B: Medium Stats (1k MCVs)

Scenario C: Default Stats (100 MCVs)

The "Quadratic Curve" in Action

References

29 Nov 2025

The "Black Box" Problem

The Solution: SupportRequestInlineInFrom

The Impact

Before vs. After

Before (PG 18 and earlier)

After (PG 19)

Bonus: Outer WHERE Pushdown - Orders of magnitude faster!

Testing Methodology

Discussion & Credits

27 Nov 2025

The Optimization

Performance Impact

What about Nullable Columns?

What about DISTINCT or ORDER BY?

Test Methodology

Detailed Analysis

Community Discussion & Trivia

References

11 May 2025

FOR KEY SHARE optimization and the SLRU Trap

The Locking Strategy Shift: Aiming for Higher Concurrency

The Unforeseen Bottleneck: Enter MultiXacts and SLRU Caches

PostgreSQL 17 to the Rescue: Configurable SLRU Buffers

Conclusion

29 Feb 2020

20 Nov 2017

Sample Run

The Solution: `SupportRequestInlineInFrom`