Stable LM 3B: Bringing Sustainable, High-Performance LMs to Smart Devices

r3trohack3r · on Oct 2, 2023

> Developers should be mindful that Stable LM 3B is a base model. That means it needs to be adjusted for safe performance in specific applications, such as a chat interface. Depending on their use case, developers must evaluate and fine-tune the model before deployment.

I don't know how to fine tune an LLM. Does anyone have good resources on how to do this?

omneity · on Oct 2, 2023

Huggingface tutorials are a decent starting point.

The HF toolchain is pretty mature and most llm finetuning projects are a wrapper around HF models, HF Trainer and some config templates.

An LLM by default would be trained like in the example below, but it would take a lot of VRAM and time.

https://huggingface.co/docs/transformers/training

That’s where things like PEFT LoRa, gptq and accelerate contribute to make your training faster/require less VRAM so you could do it on a consumer GPU with 16-24GB.

For example: https://huggingface.co/docs/peft/quicktour

Then for the tips and tricks, either reddit as suggested by a sibling comment, or Discord communities. Huggingface, EleutherAI and LAION discord servers are all great and have super helpful, friendly and knowledgeable people.

nickthegreek · on Oct 2, 2023

/r/LocalLLaMA/ is a good resource.

2bitencryption · on Oct 2, 2023

/r/LocalLLaMA/ is such an interesting mix of academia/researchers (it was called out in a recent paper, regarding context length, IIRC) and odd anarcho-futurist weirdos. And I kind of love it for that.

antupis · on Oct 2, 2023

Yup kinda best subreddit at the moment weird mix of llm training tips and tricks and people who wanna make sex bots.

politelemon · on Oct 2, 2023

Ha Stable Diffusion is very similar I notice. People understanding and thriving in new techniques and advanced workflows, mixed in with people who want to generate boobies in new ways.

lawlessone · on Oct 2, 2023

I found they can be very unappreciative of their data sources for training. Any mention that scraping art from art sites like DA without their consent was unethical is met with arguments about how its exactly the same as a human looking at art and replicating the style.

They refuse to even acknowledge why those people might be annoyed their work was used without their consent or compensation to put them out of work.

yieldcrv · on Oct 3, 2023

you know what you’re walking into there, lawlessone

lhl · on Oct 2, 2023

For getting started, I'd recommend https://huggingface.co/autotrain - you should be able to find docs on the CLI which can let you do your first training run w/ a single command (which includes automatically pulling the model and training dataset from Huggingface). As long as you have an Nvidia GPU w/ enough VRAM should take only a few hours to tune a 3B model on a small dataset like Alpaca.

Nevin1901 · on Oct 2, 2023

I’m developing a service just for that. It’s called https://useftn.com and it’ll fine tune models off your json formatted dataset.

redox99 · on Oct 2, 2023

I'm surprised by your pricing. $1.20/hr for an A100? That's significantly cheaper than something like runpod. Are you offering this service at a loss?

omneity · on Oct 2, 2023

Which dataset formats are you planning to support? (OpenAI, ShareGPT, Alpaca, Context-based ...)

Nevin1901 · on Oct 2, 2023

Same as OpenAI, where it's either a json or jsonl file with {"input": "text", "output": "text"}

kristianp · on Oct 3, 2023

> You need to agree to share your contact information to access this model This repository is publicly accessible, but you have to accept the conditions to access its files and content.

> Login or Sign up to review the conditions and access this model content.

How does that work with just calling `tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")`

nacs · on Oct 3, 2023

The CLI will also prompt you to sign into huggingface in a browser and accept the model terms before it downloads.

brrrrrm · on Oct 2, 2023

Curious how this compares to Mistral's release? 3b->7b doesn't seem like a huge leap and Mistral has shown phenomenal results for its size

emadm · on Oct 2, 2023

It's a model 40% of the size of Mistral's designed to be transparent (full training details, datasets & evals here: https://stability.wandb.io/stability-llm/stable-lm/reports/S...) and work on edge devices.

There are improved versions coming but this is the best 3b model and Mistral is the best 7b model.

omneity · on Oct 2, 2023

Thank you for sharing a well-written and transparent training report!

Do you have plans to train/release a fine-tuned 3b chat version or other variants?

brrrrrm · on Oct 2, 2023

how much faster is 3b in practice? Seems like an uncommon size, so it would make sense to have the title of "best 3b" lol

lhl · on Oct 2, 2023

The rule of thumb is that inference speed halves with every doubling of parameter size (and obviously a doubling of memory size).

You can check out real world performances on devices here: https://llm.mlc.ai/

imjonse · on Oct 2, 2023

I wonder if it's behind the subscription, but I see no reference source code for StableLM.

e12e · on Oct 2, 2023

> 3b->7b doesn't seem like a huge leap

How so? It's on the general order that seems prevalent (and significant) with LLMs? 3, 7, 15 billion?

omneity · on Oct 2, 2023

Do you have a link/reference about Mistral's performance? I'm interested to read more about it.

brrrrrm · on Oct 2, 2023

https://mistral.ai/news/announcing-mistral-7b/

looking at the 3b results (here https://github.com/Stability-AI/StableLM#stablelm-alpha-v2 ?), it looks like Mistral (which outperforms Llama-2 13b) is far more powerful

refulgentis · on Oct 2, 2023

n.b. absolutely no tuning on 3B currently, no RLHF, no instruct, no chat

stavros · on Oct 2, 2023

Does this mean that this is straight-up text completion? That's still pretty useful, you just have to know how to write the prompt, no?

jnwatson · on Oct 2, 2023

3 billion parameters is a bit large for most smart devices, is it not? Do they use some smaller quantization?

coder543 · on Oct 2, 2023

Depends on your definition of edge or smart device.

My phone can run a 7B parameter model at 12 tokens per second, which is probably faster than most humans are comfortable reading, and definitely faster than a virtual assistant would speak.

Out of curiosity, I tested a 3B parameter model, and it runs at about 21 tokens per second on my phone.

omneity · on Oct 2, 2023

Generating text fast enough for a human to read it imo is only the bare minimum. New classes of use cases would be possible if you could generate 100s of tokens. For example, on-device classification of emails/texts, LLM-powered recommendation system based on your local data (to go beyond simply parsing dates in text for example), context-aware text or email auto-responder (I'm sorry, can't reply as I'm driving/in a meeting, I'm not home next week, can you deliver to this address instead etc.) ...

Many of these use cases are possible today with either specialized models, or are old school and rule-based. Being able to have an LLM apply soft judgment on a device that generates so much contextual information, and completely locally/privately, is bound to make smartphones an entirely new kind of device.

coder543 · on Oct 2, 2023

Many of the use cases you’re describing can be done offline, such as when the phone is charging overnight, although not all of them. An email autoresponder could still work in real time at these token rates, and it would still be faster than most humans at responding to an email.

7 hours * 3600sec/hr * 21token/sec = 530,000 tokens per night on this hardware, assuming no thermal throttling. (I don’t have data to say what the sustained rate would be, throttling could happen.)

omneity · on Oct 2, 2023

Agreed on overnight batch processing. Although my vision for it is to have a sort of local service that can provide "intelligence" on demand for other apps, which might request it concurrently, at which point double digit throughput might become limiting.

There are other reasons to want a higher throughput. To perform retrieval or for a chain-of-thought approach, you typically need to run several prompts per user prompt, effectively impacting user-perceived performance of your LLM based solution.

quaintdev · on Oct 2, 2023

How to run it on phone??

coder543 · on Oct 2, 2023

I use MLC Chat to run Llama 7b. Not incredibly useful, but it is fun to experiment with.

https://apps.apple.com/us/app/mlc-chat/id6448482937

redox99 · on Oct 2, 2023

> 3 billion parameters is a bit large for most smart devices, is it not?

Not really. A 3B model quantized to 4 bits should run in any reasonable smartphone (using around 2GB of memory).

Havoc · on Oct 2, 2023

Yes if by "smart device" you mean iot or watch. A quantized 7B can fit on a high end iphone (and perform well) though so all relative

MacsHeadroom · on Oct 2, 2023

I run 14b models locally on my Android phone. 3b would be much much faster though.

3b is also small enough to fit in a wasm runtime for browser based text local text generation.

esafak · on Oct 2, 2023

They say they are targeting edge or home PCs.

naillo · on Oct 2, 2023

Stability is so awesome, I love them

sp332 · on Oct 2, 2023

It's CC BY-SA-4.0. Anyone got a torrent?

Havoc · on Oct 2, 2023

These bigger/popular ones tend to show up on thebloke's hugging face repo fairly fast and none of his stuff has "agree to license" blocks

stavros · on Oct 2, 2023

I mean, I've got a Huggingface link, from the first line in the article?

sp332 · on Oct 2, 2023

It requires signing up for HF plus sending a name and email to Stable before allowing access.

stavros · on Oct 2, 2023

Ah yes, I, John Doe, have allowed them to email jdoe@example.com with updates, before I could download their model to my country, Country.