That Guy From Delhi: performance

Showing posts with label performance. Show all posts

18 May 2025

Taming ReorderBufferWrite - Boost Logical Decoding in Postgres

Performance bottlenecks in Postgres logical replication or Change Data Capture (CDC) stream can be subtle, but one specific wait event, ReorderBufferWrite, often points directly at how your application interacts with the database during these processes. Let's unpack this wait-event and see how your application's workload patterns can trigger it.

Core Concepts: Logical Decoding and WAL

To understand the problem, we first need some context:

Write-Ahead Log (WAL) is PostgreSQL's transaction log and every change is written here first (to ensure durability and enable recovery/replication)
Logical Decoding is a powerful feature that reads this WAL. Instead of just replaying physical changes (like physical replication), it translates the WAL records into a logical, easy-to-understand stream of changes, carefully taking care of rows and transactions, which is then consumed by logical replication subscribers or CDC systems.

The Reorder Buffer: Ensuring Transactional Order

Logical decoding needs to present changes in the exact order transactions were committed. However, changes from different concurrent transactions are interleaved within the WAL. To solve this, PostgreSQL uses an in-memory area called the Reorder Buffer. Its job is to collect decoded changes belonging to transactions that are still in progress. Only when a transaction commits are its changes released from the buffer in the correct sequence.

Analogy: The Assembly Line QC Station

For an analogy, imagine an assembly line where components (WAL entries) arrive continuously where workers assemble different products (transactions). Some products are quick builds, others complex multi-step assemblies. Completed products move to a Quality Control (QC) holding station (Reorder Buffer). The QC inspector (logical decoding process) ships out batches of products, but only when all products belonging to a specific batch (committed transaction) have passed inspection. The batches must be shipped in the strict order they were completed.

The Bottleneck: ReorderBufferWrite

The Reorder Buffer uses memory allocated by the logical_decoding_work_mem setting (often defaulting to 64MB). What happens if this QC holding station (Reorder Buffer) gets completely filled up with products waiting for their batch-mates (changes from active transactions), especially for those complex, slow builds (long-running transactions)?

The system can't just discard these waiting items and the best it can do is to move them temporarily to an overflow warehouse (spilling to disk). This action – writing the buffer's contents to slower disk storage – is precisely the ReorderBufferWrite wait event.

Waiting for disk I/O is significantly slower than operating in memory. Frequent ReorderBufferWrite events directly translate to increased replication lag and reduced throughput for the logical change stream. Retrieving items from the overflow warehouse (disk) drastically slows down the QC inspector's (logical decoding) shipping rate.

Workloads That Cause Spills to Disk

ReorderBufferWrite waits are primarily triggered by workload patterns that overwhelm the in-memory Reorder Buffer. Key culprits include:

Very Large Transactions: A single transaction modifying hundreds of thousands or millions of rows generates a massive volume of changes that must sit in the Reorder Buffer until the final commit. This can rapidly exhaust the available memory. Think of assembling a huge, complex product requiring many components – it occupies a lot of space at the QC station while being built.
Long-Running Transactions: A transaction might not change that much data, but if it stays open for a long time (perhaps due to complex calculations, waiting for external input, or slow queries within it), its changes linger in the Reorder Buffer. Meanwhile, other transactions complete, adding their changes. The buffer fills up with changes from many transactions, bottlenecked by the one(s) taking a long time to commit. This is like one single (inactive / idling) product assembly holding up the shipment of multiple completed batches at the QC station.
High Volume of Concurrent Changes: Even with reasonably sized and timed transactions, a very high rate of change across many concurrent sessions can collectively generate data faster than the logical decoding process can handle within the allocated Reorder Buffer memory, leading to spills. Imagine hundreds of small, quick products arriving at the QC station so fast it still gets overwhelmed.

Sudden workload spike: A related reasoning is when there is a sudden surge of application workload (writes) and they cause a rate of change higher than the logical decoding process can generally handle. In terms of workload pattern, to an experienced DBA, this may appear as:
connection spike => followed by replica-lag => and thereafter a slow recovery of that replica lag.

Insufficient logical_decoding_work_mem: If this setting is too low for your workload's typical peak demands, spills will occur more readily, even for moderately busy workloads.

What need be done?

As an application developer, you have significant control over the transaction patterns if ReorderBufferWrite waits are causing issues:

Audit & Split Large Transactions: Review bulk DML operations performed within a single transaction and refactor these into smaller batches. For e.g. Instead of one transaction for say 100k rows, use 10 transactions of 10k rows each. This drastically reduces the peak Reorder Buffer usage.
Minimize Transaction Duration: Identify code paths where transactions are held open unnecessarily long. This could be from end-user applications waiting for user-input, or doing non-database work, or calling slow external services inside a database transaction that modifies replicated tables. Try to keep database transactions short and focused. Perform reads, calculations, or external calls before starting the write transaction, or after it commits.
Optimize Queries Within Transactions: Slow SELECT, INSERT, UPDATE, or DELETE queries will naturally extend transaction duration. Use query analysis tools (EXPLAIN ANALYZE) to find and optimize slow queries involved in your transactions.
Isolate Read-Only Work: If long-running processes only read data, ensure they aren't unnecessarily starting read-write transactions on the primary where logical decoding is happening. Similarly use read-only transactions or more permissive isolation levels where possible.
Collaborate on Configuration: If application-side optimizations aren't sufficient, discuss the observed workload with your DBA or platform team. It would help to provide them with information about your transaction patterns. This could help them review whether an increase in logical_decoding_work_mem is warranted and safe within the server's overall resource constraints. Tuning memory should usually follow workload optimization.

Conclusion

ReorderBufferWrite isn't just an obscure PostgreSQL internal wait-event. It's a performance indicator directly linked to application workload patterns during logical decoding. By designing your application to use shorter, smaller, and more efficient transactions, especially on tables involved in replication, you can minimize these disk spill events and ensure a performant database.

11 May 2025

FOR KEY SHARE optimization and the SLRU Trap

Optimizing database concurrency is a constant balancing act. We often tweak locking strategies in PostgreSQL, aiming to allow more simultaneous operations without compromising data integrity. A common scenario involves shifting from stricter row-level locks to more lenient ones. But sometimes, what seems like a straightforward optimization can lead to unexpected performance bottlenecks in less obvious parts of the system.

This post explores one such scenario: moving from SELECT FOR NO KEY UPDATE to SELECT FOR KEY SHARE, the potential for subsequent MultiXactOffsetSLRU wait events, and how PostgreSQL 17 offers a direct solution.

The Locking Strategy Shift: Aiming for Higher Concurrency

Let's start with the locks themselves. Within a transaction, you might initially use:

SELECT FOR NO KEY UPDATE: This acquires a moderately strong row lock. It prevents others from deleting the row, updating key columns, or acquiring FOR UPDATE / FOR NO KEY UPDATE locks on it. However, it does allow concurrent non-key updates and weaker FOR SHARE / FOR KEY SHARE locks. Importantly (and we’ll see why later), only one transaction can hold this lock (or FOR UPDATE) on a given row at a time.

To potentially increase concurrency, especially if you only need to prevent key changes or deletions (like ensuring a foreign key reference remains valid), you might switch to:

SELECT FOR KEY SHARE: This is a weaker, shared lock. It blocks deletions and key updates but allows concurrent non-key updates and even other concurrent SELECT FOR KEY SHARE (or FOR SHARE) locks on the exact same row.

The intended outcome of switching to FOR KEY SHARE is often to reduce blocking and allow more transactions to proceed in parallel, particularly if the main concern is referential integrity rather than preventing all concurrent modifications.

The Unforeseen Bottleneck: Enter MultiXacts and SLRU Caches

While the switch does allow higher concurrency at the row-lock level, it can create pressure elsewhere. Here’s the chain reaction:

Increased Shared Lock Concurrency: Your application now has more situations where multiple transactions hold a shared lock (FOR KEY SHARE) on the same row simultaneously.
The MultiXact System: How does PostgreSQL track that multiple transactions (potentially dozens or hundreds) have a shared interest in a single row? It uses a mechanism called MultiXact IDs (Multi-Transaction IDs). Instead of just one transaction ID locking the row, PostgreSQL assigns a special MultiXact ID that represents the group of transactions currently sharing a lock on it.
SLRU Caches: Managing this MultiXact metadata efficiently requires quick access. PostgreSQL uses specialized SLRU (Simple Least Recently Used) caches in shared memory for this. These caches store the mappings (offsets) from rows to their MultiXact member lists (MultiXactOffsetSLRU) and the member lists themselves (MultiXactMemberSLRU).
The Bottleneck (PG 16 and older): Before PostgreSQL 17, these SLRU caches had relatively small, fixed sizes determined at compile time. When the workload switch dramatically increased the demand for MultiXact tracking (due to more concurrent shared locks), these small caches could easily become overwhelmed.
The Symptom (MultiXactOffsetSLRU Waits): An overloaded SLRU cache leads to performance degradation manifesting as specific Wait Events. You might see high MultiXactOffsetSLRU waits, indicating processes are frequently:

Waiting for disk I/O because the required MultiXact offset data wasn't found in the small cache (cache miss).
Waiting to acquire low-level locks needed to access or update the cache's shared memory buffers, because many processes are trying to use the limited cache concurrently (lock contention).
Many backends appear hung - as can be seen in this recent community thread.

So trying to increase concurrency at the row-level, created a bottleneck in the underlying mechanism to manage that very concurrency!

PostgreSQL 17 to the Rescue: Configurable SLRU Buffers

Recognizing this potential bottleneck, the PostgreSQL developers introduced direct solutions in PostgreSQL 17 (released September 2024). These come in the form of new configurable parameters:

multixact_offset_buffers:

This parameter controls the size (in buffer pages) of the MultiXactOffset SLRU cache in shared memory. The default value is very small (16) and this allows administrators to allocate more RAM to cache the crucial row-to-member-list mappings. This significantly increases the cache hit rate, reduces disk I/O for MultiXact offsets, and directly alleviates the pressure causing MultiXactOffsetSLRU waits.

multixact_member_buffers:

This parameter controls the size of the MultiXactMember SLRU cache, which stores the actual lists of transaction IDs. This is possibly less directly tied to the Offset wait event, ensuring in-cache member lists improves the overall performance and throughput of the entire MultiXact lookup process, which is essential when handling high shared-lock concurrency.

You can learn more about these new parameters in this fantastic discussion - https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache

These parameters allow DBAs to tune the memory allocated to these critical caches based on their specific workload, moving away from the one-size-fits-all limitation of previous versions.

Conclusion

Switching locking strategies in PostgreSQL requires careful consideration not just of the direct blocking rules but also of the potential impact on underlying mechanisms. Moving from SELECT FOR NO KEY UPDATE to the more concurrency-friendly SELECT FOR KEY SHARE can be beneficial, but it increases the load on the MultiXact system. In versions before PostgreSQL 17, this could lead to performance bottlenecks manifesting as MultiXactOffsetSLRU wait events due to contention on small, fixed-size caches.

Thankfully, PostgreSQL 17 provides the tools needed to manage this directly with the multixact_offset_buffers and multixact_member_buffers GUCs. If you encounter these specific wait events after increasing shared lock usage, upgrading to PostgreSQL 17+ and tuning these parameters should be a key part of your resolution strategy. As always, monitor your system's wait events and performance metrics closely when making changes to locking or configuration.

8 Jul 2024

On-Prem AI chatbot - Hello World!

In continuation of the recent posts...

Finally got a on-premise chat-bot running! Once downloaded, the linux box is able to spin up / down the interface in a second.

(myvenv) ai@dell:~/proj/ollama$ time ollama run mistral
>>> /bye

real    0m1.019s
user    0m0.017s
sys     0m0.009s

That, on a measly ~$70 Marketplace i5/8GB machine is appreciable (given what all I had read about the NVidia RTX 4090s etc.). Now obviously this doesn't do anything close to 70 tokens per second, but am okay with that.

(myvenv) ai@dell:~/proj/ollama$ sudo dmesg | grep -i bogo
[sudo] password for ai:
[    0.078220] Calibrating delay loop (skipped), value calculated using timer frequency.. 6585.24 BogoMIPS (lpj=3292624)
[    0.102271] smpboot: Total of 4 processors activated (26340.99 BogoMIPS)

Next, I wrote a small little hello-world script to test the bot. Now where's the fun if it were to print a static text!!:

(myvenv) ai@dell:~/t$ cat a.py
from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
result=llm.invoke("Why is 42 the answer to everything? Keep it very brief.")
print (result)

And here's the output, in just ......... 33 seconds :)

(myvenv) ai@dell:~/t$ time python a.py
A popular question! The joke about 42 being the answer to everything originated from Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy." In the book, a supercomputer named Deep Thought takes 7.5 million years to calculate the "Answer to the Ultimate Question of Life, the Universe, and Everything," which is... 42!

real    0m33.299s
user    0m0.568s
sys     0m0.104s
(myvenv) ai@dell:~/t$

And, just for kicks, works across languages / scripts too. Nice!

(myvenv) ai@dell:~/t$ ollama run mistral
>>> भारत की सबसे लंबी नदी कौन सी है?
 भारत की सबसे लंबी नदी गंगा है, जिसका पूरण 3670 किमी होता है। यह एक विश्वमित्र नदी है और बहुप्रकार से कई प्रदेशों के झिल्ले-ढाल में विचलित है।

>>>

Again, am pretty okay with this for now. I'll worry about speed tomorrow, when I have a script that's able to test the limits, and that's not today.

Hello World!

7 Jul 2024

Installing Ollama on an old linux box

Trying out Ollama - Your 10 year old box would do too.

TLDR

Yes, you CAN install an AI engine locally
No, you DON'T need to spend thousands of dollars to get started!
Agreed, that your ai engine wouldn't be snappy, it's still great to get started.

Server

You'd realise that any machine should get you going.

I had recently bought a second-hand desktop box (Dell OptiPlex 3020) from FB Marketplace and repurposed it here.
For specs, it was an Intel i5-4590 CPU @ 3.30GHz with 8GB of RAM and 250 GB of disk, nothing fancy.
It came with an AMD Radeon 8570 (2GB RAM) [4], and the Ollama install process recognized and optimized for the decade old GPU. Super-Nice!
For completeness, the box cost me $70 AUD (~50 USD) in May 2024. In other words, even for a cash-strapped avid learner, there's a very low barrier to entry here.

Install

The install steps were pretty simple [1] but as you may know, the models themselves are huge.

For e.g. look at this [3]:

mistral-7B - 4.1 GB
gemma2-27B - 16 GB
Code Llama - 4.8 GB

Given that, I'd recommend switching to a decent internet connection. If work allows, this may be a good time to go to work instead of WFH on this one. (Since I didn't have that luxury, my trusty but slow 60Mbps ADSL+ meant that I really worked up on my patience this weekend)

The thing that actually tripped me, was that Ollama threaded downloads really scream speed and it ended up clogging my test server (See my earlier blog post that goes into some details [2]).

Run with Nice

With system resources in short-supply, it made good sense, to ensure that once Ollama is installed, it is spun up with least priority.

On an Ubuntu server, I did this by modifying the ExecStart config for Ollama's systemd script.

ai@dell:~$ sudo service ollama status | grep etc
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)

ai@dell:~$ cat /etc/systemd/system/ollama.service | grep ExecStart
ExecStart=nice -n 19 /usr/local/bin/ollama serve

So when I do end up asking some fun questions, ollama is always playing "nice" :D

Enjoy ...

Reference:

Install + Quick Start: https://github.com/ollama/ollama/blob/main/README.md#quickstart
Model downloads made my server unresponsive: https://www.thatguyfromdelhi.com/2024/07/ollama-is-missing-rate-limits-on.html
Model sizes are in GBs: https://github.com/ollama/ollama/blob/main/README.md#model-library
Radeon 8570: https://www.techpowerup.com/gpu-specs/amd-radeon-hd-8570.b1325