Your work is an inspiration as always!! My n00b question is: what do you think i...

vikp · on July 23, 2023

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.

Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.

pedrovhb · on July 23, 2023

I've been playing with running some models on the free tier Oracle VM machines with 24GB RAM and Ampere CPU and it works pretty well with llama.cpp. It's actually surprisingly quick; speed doesn't scale too well with the number of threads on CPU, so even the 4 ARM64 cores on that VM, with NEON, run at a similar speed to my 24-core Ryzen 3850X (maybe about half reading speed). It can easily handle Llama 2 13B, and if I recall correctly I did manage to run a 30B model in the past too. Speed for the smaller ones is ~half reading speed or so.

It's a shame the current Llama 2 jumps from 13B to 70B. In the past I tried running larger stuff by making a 32GB swap volume, but it's just impractically slow.

brucethemoose2 · on July 23, 2023

Prompt ingestion is too slow on the Oracle VMs.

Also its really tricky to even build llama.cpp with a BLAS library, to make prompt ingestion less slow. The Oracle Linux OpenBLAS build isnt detected ootb, and it doesn't perform well compared to x86 for some reason.

LLVM/GCC have some kind of issue identifying the Ampere ARM architecture (march=native doesn't really work), so maybe this could be improved with the right compiler flags?

pedrovhb · on July 24, 2023

Not sure if that's still the case. I remember having trouble building it a couple of months ago, had to tweak the Makefile because iirc it assumed ARM64 <=> Mac, but I recently re-cloned the repo and started from scratch and it was as simple as `make DLLAMA_BLAS=1`. I don't think I have any special setup other than having installed the apt openblas dev package.

brucethemoose2 · on July 24, 2023

IDK. A bunch of basic development packages like git were missing from my Ubuntu image when I tried last week, and I just gave up because it seemed like a big rabbit hole to go down.

I can see the ARM64 versions on the Ubuntu web package list, so... IDK what was going on?

On Oracle Linux, until I changed some env variables and lines in the makefile, the openblas build would "work," but it was actually silently failing and not using OpenBLAS.

jvickers · on July 23, 2023

Is it any easier when using Ubuntu on ARM Oracle servers?

brucethemoose2 · on July 24, 2023

Nah, I tried Ubuntu too.

The OpenBLAS package was missing on ARM, along with some other dependencies I needed for compilation.

At the end of the day, even with many tweaks and custom compilation flags, the instance was averaging below 1 token/sec as a Kobold Horde host, which is below the threshold to even be allowed as a llm host.

summarity · on July 24, 2023

If you're running on Ampere, using llama.cpp is probably not ideal. While it's optimized for ARM, Ampere has native acceleration for workloads like this: https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...

Y_Y · on July 23, 2023

It might be more expensive to get a GPU instance but at a guess I'd say it's more cost-effective considering that the CPU computation will be less efficient and take much longer. I bet someone's done this out with real numbers, I just haven't seen it.

franga2000 · on July 23, 2023

This only matters if you're scaling to meet demand and demand is higher than your spare resources, which often isn't the case for hobby projects. The 10€/mo VPS I've had for over 6 years now still has a few cores and GBs or RAM spare, so running a small model on the CPU for a personal project that only me and a few friends occasionally use wouldn't cost me a cent more.

immibis · on July 24, 2023

FYI, the going rate for "smallest possible VPS" is now more like 3€/mo.

bg24 · on July 23, 2023

It depends on your use case, correct? If you do not have a heavy inferencing requirement, then CPU is good enough.