6 Jul 2024

Ollama is missing --rate-limits on downloads

I am just starting my AI journey, and trying to get Ollama to work on my linux box, was an interesting non-AI experience.

I noticed, that everytime I was trying out something new, my linux box got reliably stuck every single time I pulled a new model. htop helped point out, that each time I did a ollama pull or ollama run, it spun up a ton of threads.

Often things got so bad, that the system became quite unresponsive. Here, you can see "when" I triggered the pull:

Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=8ms TTL=64
Reply from 192.168.85.24: bytes=32 time=65ms TTL=64
Reply from 192.168.85.24: bytes=32 time=286ms TTL=64
Reply from 192.168.85.24: bytes=32 time=286ms TTL=64
Reply from 192.168.85.24: bytes=32 time=304ms TTL=64

A little searching, led me to this on-going Github thread where a feature like --rate-limit were requested for multiple reasons. Some people were unhappy with how a pull clogged their routers, some were unhappy with how it jammed all other downloads / browsing on the machine. I was troubled since my linux box (a not-so-recent but still 6.5k BogoMIPS 4vCPU i5) came to a crawl.

While the --rate-limit feature takes shape, here are two solutions that did work for me :

  1. As soon as I started the fetch (ollama run or ollama pull etc), I used iotop to change the ionice priority to idle. This made the issue go away completely (or at least made the system quite usable). However, it was still frustrating since (unlike top and htop) one had to type the PIDs... and as you may have guessed it already, Ollama creates quite a few when it does such the fetch.

Note that doing something like nice -n 19 did not help here. This was because the ollama processes weren't actually consuming (much) CPU for this task at all!

Then I tried to use ionice, which didn't work either! Note that since Ollama uses threads, the ionice tool didn't work for me. This was because ionice doesn't work with threads within a parent process. So this meant, something like the following did not work for me:

# These did not help!

robins@dell:~$ nice -n 19 ollama run mistral # Did not work!
robins@dell:~$ ionice -c3 ollama run mistral # Did not work either!!
  1. After some trial-and-error, a far simpler solution was to just run a series of commands immediately after triggered a new model fetch. Essentially, it got the parent PID, and then set ionice for each of the child processes for that parent:
pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'`
echo $pid
sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '`

This worked something like this:

robins@dell:~$ pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'` && [ ${#pid} -gt 1 ] && ( sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '` ; echo "done" ) || echo "skip"skip
robins@dell:~$ pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'` && [ ${#pid} -gt 1 ] && ( sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '` ; echo "done" ) || echo "skip"done

After the above, iotop started showing idle in front of each of the ollama processes:

Total DISK READ:         0.00 B/s | Total DISK WRITE:         3.27 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      36.76 K/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND                                                                                                                                                                                                                      2692712 idle ollama      0.00 B/s  867.62 K/s ollama serve
2705767 idle ollama      0.00 B/s  852.92 K/s ollama serve
2692707 idle ollama      0.00 B/s  849.24 K/s ollama serve
2693740 idle ollama      0.00 B/s  783.07 K/s ollama serve
      1 be/4 root        0.00 B/s    0.00 B/s init splash
      2 be/4 root        0.00 B/s    0.00 B/s [kthreadd]
      3 be/4 root        0.00 B/s    0.00 B/s [pool_workqueue_release]
      4 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_g]
      5 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_p]
      6 be/0 root        0.00 B/s    0.00 B/s [kworker/R-slub_]

While at it, it was funny to note that the fastest way to see whether the unresponsive system is "going to" recover (because of what I just tried) was by keeping a separate ping session to the linux box. On my local network, I knew the system is going to come back to life in the next few seconds, when I noticed that the pings begin ack'ing in 5-8ms instead of ~100+ ms during the logjam.

So yeah, +10 on the --rate-limit or something similar!

Reference:

  1. https://github.com/ollama/ollama/issues/2006

No comments:

On-Prem AI chatbot - Hello World!

In continuation of the recent posts... Finally got a on-premise chat-bot running! Once downloaded, the linux box is able to spin up / down t...