8 Jul 2024

On-Prem AI chatbot - Hello World!

In continuation of the recent posts...


Finally got a on-premise chat-bot running! Once downloaded, the linux box is able to spin up / down the interface in a second.

(myvenv) ai@dell:~/proj/ollama$ time ollama run mistral
>>> /bye

real    0m1.019s
user    0m0.017s
sys     0m0.009s

That, on a measly ~$70 Marketplace i5/8GB machine is appreciable (given what all I had read about the NVidia RTX 4090s etc.). Now obviously this doesn't do anything close to 70 tokens per second, but am okay with that.

(myvenv) ai@dell:~/proj/ollama$ sudo dmesg | grep -i bogo
[sudo] password for ai:
[    0.078220] Calibrating delay loop (skipped), value calculated using timer frequency.. 6585.24 BogoMIPS (lpj=3292624)
[    0.102271] smpboot: Total of 4 processors activated (26340.99 BogoMIPS)

Next, I wrote a small little hello-world script to test the bot. Now where's the fun if it were to print a static text!!:

(myvenv) ai@dell:~/t$ cat a.py
from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
result=llm.invoke("Why is 42 the answer to everything? Keep it very brief.")
print (result)

And here's the output, in just ......... 33 seconds :)

(myvenv) ai@dell:~/t$ time python a.py
A popular question! The joke about 42 being the answer to everything originated from Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy." In the book, a supercomputer named Deep Thought takes 7.5 million years to calculate the "Answer to the Ultimate Question of Life, the Universe, and Everything," which is... 42!

real    0m33.299s
user    0m0.568s
sys     0m0.104s
(myvenv) ai@dell:~/t$

And, just for kicks, works across languages / scripts too. Nice!

(myvenv) ai@dell:~/t$ ollama run mistral
>>> भारत की सबसे लंबी नदी कौन सी है?
 भारत की सबसे लंबी नदी गंगा है, जिसका पूरण 3670 किमी होता है। यह एक विश्वमित्र नदी है और बहुप्रकार से कई प्रदेशों के झिल्ले-ढाल में विचलित है।

>>>

Again, am pretty okay with this for now. I'll worry about speed tomorrow, when I have a script that's able to test the limits, and that's not today.

Hello World!

7 Jul 2024

Installing Ollama on an old linux box

Trying out Ollama - Your 10 year old box would do too.

TLDR

  • Yes, you CAN install an AI engine locally
  • No, you DON'T need to spend thousands of dollars to get started!
  • Agreed, that your ai engine wouldn't be snappy, it's still great to get started.

Server

You'd realise that any machine should get you going.

  • I had recently bought a second-hand desktop box (Dell OptiPlex 3020) from FB Marketplace and repurposed it here.
  • For specs, it was an Intel i5-4590 CPU @ 3.30GHz with 8GB of RAM and 250 GB of disk, nothing fancy.
  • It came with an AMD Radeon 8570 (2GB RAM) [4], and the Ollama install process recognized and optimized for the decade old GPU. Super-Nice!
  • For completeness, the box cost me $70 AUD (~50 USD) in May 2024. In other words, even for a cash-strapped avid learner, there's a very low barrier to entry here.

Install

The install steps were pretty simple [1] but as you may know, the models themselves are huge.

For e.g. look at this [3]:

  • mistral-7B - 4.1 GB
  • gemma2-27B - 16 GB
  • Code Llama - 4.8 GB

Given that, I'd recommend switching to a decent internet connection. If work allows, this may be a good time to go to work instead of WFH on this one. (Since I didn't have that luxury, my trusty but slow 60Mbps ADSL+ meant that I really worked up on my patience this weekend)

The thing that actually tripped me, was that Ollama threaded downloads really scream speed and it ended up clogging my test server (See my earlier blog post that goes into some details [2]).

Run with Nice

With system resources in short-supply, it made good sense, to ensure that once Ollama is installed, it is spun up with least priority.

On an Ubuntu server, I did this by modifying the ExecStart config for Ollama's systemd script.

ai@dell:~$ sudo service ollama status | grep etc
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)

ai@dell:~$ cat /etc/systemd/system/ollama.service | grep ExecStart
ExecStart=nice -n 19 /usr/local/bin/ollama serve

So when I do end up asking some fun questions, ollama is always playing "nice" :D




Enjoy ...

Reference:

  1. Install + Quick Start: https://github.com/ollama/ollama/blob/main/README.md#quickstart

  2. Model downloads made my server unresponsive: https://www.thatguyfromdelhi.com/2024/07/ollama-is-missing-rate-limits-on.html

  3. Model sizes are in GBs: https://github.com/ollama/ollama/blob/main/README.md#model-library

  4. Radeon 8570: https://www.techpowerup.com/gpu-specs/amd-radeon-hd-8570.b1325

6 Jul 2024

Ollama is missing --rate-limits on downloads

I am just starting my AI journey, and trying to get Ollama to work on my linux box, was an interesting non-AI experience.

I noticed, that everytime I was trying out something new, my linux box got reliably stuck every single time I pulled a new model. htop helped point out, that each time I did a ollama pull or ollama run, it spun up a ton of threads.

Often things got so bad, that the system became quite unresponsive. Here, you can see "when" I triggered the pull:

Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=7ms TTL=64
Reply from 192.168.85.24: bytes=32 time=8ms TTL=64
Reply from 192.168.85.24: bytes=32 time=65ms TTL=64
Reply from 192.168.85.24: bytes=32 time=286ms TTL=64
Reply from 192.168.85.24: bytes=32 time=286ms TTL=64
Reply from 192.168.85.24: bytes=32 time=304ms TTL=64

A little searching, led me to this on-going Github thread where a feature like --rate-limit were requested for multiple reasons. Some people were unhappy with how a pull clogged their routers, some were unhappy with how it jammed all other downloads / browsing on the machine. I was troubled since my linux box (a not-so-recent but still 6.5k BogoMIPS 4vCPU i5) came to a crawl.

While the --rate-limit feature takes shape, here are two solutions that did work for me :

  1. As soon as I started the fetch (ollama run or ollama pull etc), I used iotop to change the ionice priority to idle. This made the issue go away completely (or at least made the system quite usable). However, it was still frustrating since (unlike top and htop) one had to type the PIDs... and as you may have guessed it already, Ollama creates quite a few when it does such the fetch.

Note that doing something like nice -n 19 did not help here. This was because the ollama processes weren't actually consuming (much) CPU for this task at all!

Then I tried to use ionice, which didn't work either! Note that since Ollama uses threads, the ionice tool didn't work for me. This was because ionice doesn't work with threads within a parent process. So this meant, something like the following did not work for me:

# These did not help!

robins@dell:~$ nice -n 19 ollama run mistral # Did not work!
robins@dell:~$ ionice -c3 ollama run mistral # Did not work either!!
  1. After some trial-and-error, a far simpler solution was to just run a series of commands immediately after triggered a new model fetch. Essentially, it got the parent PID, and then set ionice for each of the child processes for that parent:
pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'`
echo $pid
sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '`

This worked something like this:

robins@dell:~$ pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'` && [ ${#pid} -gt 1 ] && ( sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '` ; echo "done" ) || echo "skip"skip
robins@dell:~$ pid=`ps -ef | grep "ollama run" | grep -v grep | awk '{print $2}'` && [ ${#pid} -gt 1 ] && ( sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '` ; echo "done" ) || echo "skip"done

After the above, iotop started showing idle in front of each of the ollama processes:

Total DISK READ:         0.00 B/s | Total DISK WRITE:         3.27 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      36.76 K/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND                                                                                                                                                                                                                      2692712 idle ollama      0.00 B/s  867.62 K/s ollama serve
2705767 idle ollama      0.00 B/s  852.92 K/s ollama serve
2692707 idle ollama      0.00 B/s  849.24 K/s ollama serve
2693740 idle ollama      0.00 B/s  783.07 K/s ollama serve
      1 be/4 root        0.00 B/s    0.00 B/s init splash
      2 be/4 root        0.00 B/s    0.00 B/s [kthreadd]
      3 be/4 root        0.00 B/s    0.00 B/s [pool_workqueue_release]
      4 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_g]
      5 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_p]
      6 be/0 root        0.00 B/s    0.00 B/s [kworker/R-slub_]

While at it, it was funny to note that the fastest way to see whether the unresponsive system is "going to" recover (because of what I just tried) was by keeping a separate ping session to the linux box. On my local network, I knew the system is going to come back to life in the next few seconds, when I noticed that the pings begin ack'ing in 5-8ms instead of ~100+ ms during the logjam.

So yeah, +10 on the --rate-limit or something similar!

Reference:

  1. https://github.com/ollama/ollama/issues/2006

On-Prem AI chatbot - Hello World!

In continuation of the recent posts... Finally got a on-premise chat-bot running! Once downloaded, the linux box is able to spin up / down t...