> Developers should be mindful that Stable LM 3B is a base model. That means it needs to be adjusted for safe performance in specific applications, such as a chat interface. Depending on their use case, developers must evaluate and fine-tune the model before deployment.
I don't know how to fine tune an LLM. Does anyone have good resources on how to do this?
That’s where things like PEFT LoRa, gptq and accelerate contribute to make your training faster/require less VRAM so you could do it on a consumer GPU with 16-24GB.
Then for the tips and tricks, either reddit as suggested by a sibling comment, or Discord communities. Huggingface, EleutherAI and LAION discord servers are all great and have super helpful, friendly and knowledgeable people.
/r/LocalLLaMA/ is such an interesting mix of academia/researchers (it was called out in a recent paper, regarding context length, IIRC) and odd anarcho-futurist weirdos. And I kind of love it for that.
Ha Stable Diffusion is very similar I notice. People understanding and thriving in new techniques and advanced workflows, mixed in with people who want to generate boobies in new ways.
I found they can be very unappreciative of their data sources for training. Any mention that scraping art from art sites like DA without their consent was unethical is met with arguments about how its exactly the same as a human looking at art and replicating the style.
They refuse to even acknowledge why those people might be annoyed their work was used without their consent or compensation to put them out of work.
For getting started, I'd recommend https://huggingface.co/autotrain - you should be able to find docs on the CLI which can let you do your first training run w/ a single command (which includes automatically pulling the model and training dataset from Huggingface). As long as you have an Nvidia GPU w/ enough VRAM should take only a few hours to tune a 3B model on a small dataset like Alpaca.
> You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
> Login or Sign up to review the conditions and access this model content.
How does that work with just calling `tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")`
Depends on your definition of edge or smart device.
My phone can run a 7B parameter model at 12 tokens per second, which is probably faster than most humans are comfortable reading, and definitely faster than a virtual assistant would speak.
Out of curiosity, I tested a 3B parameter model, and it runs at about 21 tokens per second on my phone.
Generating text fast enough for a human to read it imo is only the bare minimum. New classes of use cases would be possible if you could generate 100s of tokens. For example, on-device classification of emails/texts, LLM-powered recommendation system based on your local data (to go beyond simply parsing dates in text for example), context-aware text or email auto-responder (I'm sorry, can't reply as I'm driving/in a meeting, I'm not home next week, can you deliver to this address instead etc.) ...
Many of these use cases are possible today with either specialized models, or are old school and rule-based. Being able to have an LLM apply soft judgment on a device that generates so much contextual information, and completely locally/privately, is bound to make smartphones an entirely new kind of device.
Many of the use cases you’re describing can be done offline, such as when the phone is charging overnight, although not all of them. An email autoresponder could still work in real time at these token rates, and it would still be faster than most humans at responding to an email.
7 hours * 3600sec/hr * 21token/sec = 530,000 tokens per night on this hardware, assuming no thermal throttling. (I don’t have data to say what the sustained rate would be, throttling could happen.)
Agreed on overnight batch processing. Although my vision for it is to have a sort of local service that can provide "intelligence" on demand for other apps, which might request it concurrently, at which point double digit throughput might become limiting.
There are other reasons to want a higher throughput. To perform retrieval or for a chain-of-thought approach, you typically need to run several prompts per user prompt, effectively impacting user-perceived performance of your LLM based solution.
I don't know how to fine tune an LLM. Does anyone have good resources on how to do this?