FOR KEY SHARE optimization and the SLRU Trap
Optimizing database concurrency is a constant balancing act. We often tweak locking strategies in PostgreSQL, aiming to allow more simultaneous operations without compromising data integrity. A common scenario involves shifting from stricter row-level locks to more lenient ones. But sometimes, what seems like a straightforward optimization can lead to unexpected performance bottlenecks in less obvious parts of the system.
This post explores one such scenario: moving from SELECT FOR NO KEY UPDATE to SELECT FOR KEY SHARE, the potential for subsequent MultiXactOffsetSLRU wait events, and how PostgreSQL 17 offers a direct solution.
The Locking Strategy Shift: Aiming for Higher Concurrency
Let's start with the locks themselves. Within a transaction, you might initially use:
SELECT FOR NO KEY UPDATE: This acquires a moderately strong row lock. It prevents others from deleting the row, updating key columns, or acquiring FOR UPDATE / FOR NO KEY UPDATE locks on it. However, it does allow concurrent non-key updates and weaker FOR SHARE / FOR KEY SHARE locks. Importantly (and we’ll see why later), only one transaction can hold this lock (or FOR UPDATE) on a given row at a time.
To potentially increase concurrency, especially if you only need to prevent key changes or deletions (like ensuring a foreign key reference remains valid), you might switch to:
SELECT FOR KEY SHARE: This is a weaker, shared lock. It blocks deletions and key updates but allows concurrent non-key updates and even other concurrent SELECT FOR KEY SHARE (or FOR SHARE) locks on the exact same row.
The intended outcome of switching to FOR KEY SHARE is often to reduce blocking and allow more transactions to proceed in parallel, particularly if the main concern is referential integrity rather than preventing all concurrent modifications.
The Unforeseen Bottleneck: Enter MultiXacts and SLRU Caches
While the switch does allow higher concurrency at the row-lock level, it can create pressure elsewhere. Here’s the chain reaction:
Increased Shared Lock Concurrency: Your application now has more situations where multiple transactions hold a shared lock (FOR KEY SHARE) on the same row simultaneously.
The MultiXact System: How does PostgreSQL track that multiple transactions (potentially dozens or hundreds) have a shared interest in a single row? It uses a mechanism called MultiXact IDs (Multi-Transaction IDs). Instead of just one transaction ID locking the row, PostgreSQL assigns a special MultiXact ID that represents the group of transactions currently sharing a lock on it.
SLRU Caches: Managing this MultiXact metadata efficiently requires quick access. PostgreSQL uses specialized SLRU (Simple Least Recently Used) caches in shared memory for this. These caches store the mappings (offsets) from rows to their MultiXact member lists (MultiXactOffsetSLRU) and the member lists themselves (MultiXactMemberSLRU).
The Bottleneck (PG 16 and older): Before PostgreSQL 17, these SLRU caches had relatively small, fixed sizes determined at compile time. When the workload switch dramatically increased the demand for MultiXact tracking (due to more concurrent shared locks), these small caches could easily become overwhelmed.
The Symptom (MultiXactOffsetSLRU Waits): An overloaded SLRU cache leads to performance degradation manifesting as specific Wait Events. You might see high MultiXactOffsetSLRU waits, indicating processes are frequently:
Waiting for disk I/O because the required MultiXact offset data wasn't found in the small cache (cache miss).
Waiting to acquire low-level locks needed to access or update the cache's shared memory buffers, because many processes are trying to use the limited cache concurrently (lock contention).
Many backends appear hung - as can be seen in this recent community thread.
So trying to increase concurrency at the row-level, created a bottleneck in the underlying mechanism to manage that very concurrency!
PostgreSQL 17 to the Rescue: Configurable SLRU Buffers
Recognizing this potential bottleneck, the PostgreSQL developers introduced direct solutions in PostgreSQL 17 (released September 2024). These come in the form of new configurable parameters:
multixact_offset_buffers:
This parameter controls the size (in buffer pages) of the MultiXactOffset SLRU cache in shared memory. The default value is very small (16) and this allows administrators to allocate more RAM to cache the crucial row-to-member-list mappings. This significantly increases the cache hit rate, reduces disk I/O for MultiXact offsets, and directly alleviates the pressure causing MultiXactOffsetSLRU waits.
multixact_member_buffers:
This parameter controls the size of the MultiXactMember SLRU cache, which stores the actual lists of transaction IDs. This is possibly less directly tied to the Offset wait event, ensuring in-cache member lists improves the overall performance and throughput of the entire MultiXact lookup process, which is essential when handling high shared-lock concurrency.
You can learn more about these new parameters in this fantastic discussion - https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache
These parameters allow DBAs to tune the memory allocated to these critical caches based on their specific workload, moving away from the one-size-fits-all limitation of previous versions.
Conclusion
Switching locking strategies in PostgreSQL requires careful consideration not just of the direct blocking rules but also of the potential impact on underlying mechanisms. Moving from SELECT FOR NO KEY UPDATE to the more concurrency-friendly SELECT FOR KEY SHARE can be beneficial, but it increases the load on the MultiXact system. In versions before PostgreSQL 17, this could lead to performance bottlenecks manifesting as MultiXactOffsetSLRU wait events due to contention on small, fixed-size caches.
Thankfully, PostgreSQL 17 provides the tools needed to manage this directly with the multixact_offset_buffers and multixact_member_buffers GUCs. If you encounter these specific wait events after increasing shared lock usage, upgrading to PostgreSQL 17+ and tuning these parameters should be a key part of your resolution strategy. As always, monitor your system's wait events and performance metrics closely when making changes to locking or configuration.
No comments:
Post a Comment