That Guy From Delhi: Speed up JOIN Planning

The hidden cost of knowing too much. That's one way to describe what happens when your data is skewed, Postgres statistics targets are set high, and the planner tries to estimate a join.

For over 20 years, Postgres used a simple O(N^2) loop to compare (equi-join) Most Common Values (MCVs) during join estimation. It worked fine when statistics targets are small (default_statistics_target defaults to 100). But in the modern era - we often see Postgres best-practices recommend cranking that up. Customers are known to be using higher values (1000 and sometimes even higher) to handle complex data distributions + throw a 10 JOIN query to the mix - and this "dumb loop" can easily become a silent performance killer during planning.

That changes in Postgres 19.

The Problem: It's Quadratic!

When you join two tables, the planner needs to estimate how many rows will match. If both columns have MCV lists (lists of the most frequent values), eqjoinsel() tries to match them up to get a precise estimate.

Historically, it did this by comparing every item in list A with every item in list B.

If you have 100 MCVs, that's 10,000 comparisons. Fast.
If you have 10,000 MCVs (max stats target), that's 100,000,000 comparisons. Not so fast.

This meant that simply *planning* a query could take significantly longer than executing it, especially for simple OLTP queries.

The Solution: Hash It Out

The fix is elegant and effective.

Instead of a nested loop, the planner now:

1. Checks if the total number of MCVs is greater than 200 (100 each side).

2. If so, it builds a Hash Table of the MCVs from one (smaller) side.

3. It then probes this hash table with the MCVs from the other side.

The threshold of 200 was chosen because hashing has a startup cost (allocating memory, computing hashes). For smaller lists, the simple loop is actually faster. But once you cross that threshold, the hash table wins.

This transforms the complexity from O(N^2) to O(N), making the estimation step virtually instantaneous even with the largest statistics targets.

Let's Benchmark

I wanted to verify this myself, so I set up a worst-case scenario: two tables with 100,000 rows, but only 10,000 distinct values, and I cranked the statistics target to the maximum (10,000) to force a massive MCV list.

Setup


CREATE TABLE t1 (x int);
CREATE TABLE t2 (x int);

-- Scenario A: 3M rows, Stats 10000
INSERT INTO t1 SELECT x % 10000 FROM generate_series(1, 3000000) x;
INSERT INTO t2 SELECT x % 10000 FROM generate_series(1, 3000000) x;

-- Maximize statistics target
ALTER TABLE t1 ALTER COLUMN x SET STATISTICS 10000;
ALTER TABLE t2 ALTER COLUMN x SET STATISTICS 10000;

ANALYZE t1;
ANALYZE t2;

Results

I ran the following query 10 times on both versions, interleaved to ensure fairness and used median for the results:


EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM t1 JOIN t2 ON t1.x = t2.x;

Scenario A: High Stats (10k MCVs)

Rows: 3 Million
Statistics Target: 10,000
MCV Count: 10,000

Before (Postgres 18): ~27.8 ms
After (Postgres 19): ~1.75 ms
Speedup: ~16x

Scenario B: Medium Stats (1k MCVs)

Rows: 300,000
Statistics Target: 1,000
MCV Count: 1,000

Before (Postgres 18): ~0.85 ms
After (Postgres 19): ~0.60 ms
Speedup: ~1.4x (40% speed-up)

Scenario C: Default Stats (100 MCVs)

Rows: 30,000
Statistics Target: 100 (Default)
MCV Count: 100

Before (Postgres 18): ~0.40 ms
After (Postgres 19): ~0.43 ms
Speedup: None (Optimization correctly skipped)

Since the total MCV count in Scenario C (100 + 100 = 200) did not exceed the 200 threshold, Postgres 19 correctly chose the simpler loop, avoiding the overhead of building a hash table. This confirms that the patch is smart enough to only kick in when it matters.

The "Quadratic Curve" in Action

To visualize the O(N^2) vs O(N) difference, I ran a benchmark across a wide range of statistics targets (from 10 to 10,000). In each test, the number of rows was set to 300x the statistics target to ensure the MCV list was fully populated and relevant.

Stats Target	PG 18 (ms)	PG 19 (ms)	Speedup
10	0.40	0.38	-
100	0.40	0.45	-
200	0.45	0.43	-
500	0.53	0.52	-
1000	0.85	0.63	1.3x
2000	1.73	0.68	2.5x
5000	7.54	1.14	6.6x
8000	17.97	1.57	11.4x
10000	27.56	1.92	14.3x

As you can see, up to a target of 500, the difference is negligible (and the optimization might not even kick in). But as the target grows, the quadratic cost of the old method explodes, while the new hash-based method scales linearly and remains extremely fast.

This patch is a classic example of modernizing legacy assumptions. The code written 20 years ago assumed MCV lists would be short (in all fairness the default was 100 for a long time). Today's hardware and data requirements have pushed those boundaries, and Postgres is evolving to meet them.

Thanks to Ilia Evdokimov for the patch, David Geier for co-authoring, and Tom Lane for the review.

Note: This feature is currently in the master branch for Postgres 19. As with all development features, it is subject to change before the final release.

References

Commit: Speed up eqjoinsel() with lots of MCV entries
Discussion: Mailing List Thread

That Guy From Delhi

3 Dec 2025

Speed up JOIN Planning - upto 16x Faster!