This is fantastic work. The focus on a local, sandboxed execution layer is a hug...

doctoboggan · 2025-08-08T20:31:04 1754685064

> A vector database for years of emails can easily exceed 50GB.

In 2025 I would consider this a relatively meager requirement.

andylizf · 2025-08-08T20:54:47 1754686487

Yeah, that's a fair point at first glance. 50GB might not sound like a huge burden for a modern SSD.

However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.

For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.

The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.

notsylver · 2025-08-09T02:02:29 1754704949

You already need very high end hardware to run useful local LLMs, I don't know if a 200gb vector database will be the dealbreaker in that scenario. But I wonder how small you could get it with compression and quantization on top

wafflemaker · 2025-08-09T13:18:24 1754745504

I'm no dev either and still set up remote ssh login to be able to use LaTeX at home PC from my laptop.
Also, with many games and dual boot on my gaming PC I still have some space left on my 2TB NVME SSD. And my not enthusiast MOBO could fit two more.

It took so much time to install LaTeX and packages, and also so much space, my 128GB drive couldn't handle it.

mwcz · 2025-08-09T13:35:29 1754746529

I've worked in other domains my whole career, so I was astonished this week when we put a million 768-len embeddings into a vector db and it was only a few GB. Napkin math said ~25 GB and intuition said a long list of widely distributed floats would be fairly uncompressable. HNSW is pretty cool.

OneDeuxTriSeiGo · 2025-08-09T03:17:24 1754709444

You can already do A LOT with an SLM running on commodity consumer hardware. Also it's important to consider that the bigger an embedding is, the more bandwidth you need to use it at any reasonable speed. And while storage may be "cheap", memory bandwidth absolutely is not.

varenc · 2025-08-09T03:14:20 1754709260

> You already need very high end hardware to run useful local LLMs

A basic macbook can run gpt-oss-20b and it's quite useful for many tasks. And fast. Of course Macs have a huge advantage for local LLMs inference due to their shared memory architecture.

derefr · 2025-08-09T18:59:41 1754765981

The mid-spec 2025 iPhone can run “useful local LLMs” yet has 256GB of total storage.

(Sure, this is a spec distortion due to Apple’s market-segmentation tactics, but due to the sheer install-base, it’s still a configuration you might want to take into consideration when talking about the potential deployment-targets for this sort of local-first tech.)

felarof · 2025-08-16T15:06:39 1755356799

You should definitely checkout BrowserOS! -- https://github.com/browseros-ai/BrowserOS

derefr · 2025-08-09T19:22:09 1754767329

Question: would it be possible to invert the problem? I.e., rather than decreasing the size of the RAG — use the RAG to compress everything other than the RAG index itself.

E.g., design a filesystem so that the RAG index is part of / managed internally within the metadata of the filesystem itself; and then, for each FS inode data-extent, give it two polymorphic on-disk representations:

1. extents hold raw data; rag-vectors are derivatives and updated after extent is updated (as today)

2. rag-vectors are canonical; extents hold residuals from a predictive-coding model that took the rag-vectors as input and tried to regenerate the raw data of the extent. When extent is read [or partially overwritten], use predictive-coding model to generate data from vectors and then repair it with residue (as in modern video-codec p-frame generation.)

———

Of course, even if this did work (in the sense of providing a meaningful decrease in storage use), this storage model would only really be practical for document files that are read entirely on open and atomically overwritten/updated (think Word and Excel docs, PDFs, PSDs, etc), not for files meant to be streamed.

But, luckily, the types of files this technique are amenable to are exactly the same types of files that a “user’s documents” RAG would have any hope of indexing in the first place!

PeterStuer · 2025-08-09T06:12:59 1754719979

While your aims are undoutably sincere, in practice for the 'local ai' target people building their own rigs usually have. 4TB or more fast ssd storage.

The bottom tier (not meant disparagingly) are people running diffusion models as these do not have the high vram requirements. They generate tons of images or video, going form a one-click instally like Easydiffusion to very sophisticated workflows in comfyui.

For those going the LLM route, which would be your target audience, they quickly run into the problemm that to go beyond toying around, the hardware and software requirements and expertise grows exponential beyong just toying around with small, highly quantized model with small context windows.

Inlight of the typical enthusiast investments in this space, the few TB of fast storage will pale in comparison to the rest of the expenses.

Again, your work is absolutely valuable, it is just that the storage space requirement for the vector store in this particular scenario is not your strongest card to play.

imoverclocked · 2025-08-09T06:31:06 1754721066

Everyone benefits from focusing on efficiency and finding better ways of doing things. Those people with 4TB+ of fast storage can now do more than they could before as can the "bottom tier."

It's a breath of fresh air anytime someone finds a way to do more with less rather than just wait for things to get faster and cheaper.

PeterStuer · 2025-08-09T06:45:09 1754721909

Of course. And I am not arguing against that at all. Just like if someone makes an inference runtime that is 4% faster, I'll take that win. But would it be the decisive factor in my choice? Only if that was my bottleneck, my true constraint.

All I tried to convey was that for most of the people in the presented scenario (personal emails etc.) , a 50 or even 500GB storage requirement is not going to be that primary constraint. So the suggestion was the marketing for this usecase might be better spotlighting also something else.

ricardobeat · 2025-08-09T10:03:24 1754733804

You are glossing over the fact that for RAG you need to search over those 500GB+ which will be painfully slow and CPU-intensive. The goal is fast retrieval to add data to the LLM context. Storage space is not the sole reason to minimize the DB size.

brookst · 2025-08-09T13:45:38 1754747138

You’re not searching over 500GB, you’re searching an index of the vectors. That’s the magic of embeddings and vector databases.

Same way you might have a 50TB relational database but “select id, name from people where country=‘uk’ and name like ‘benj%’ might only touch a few MB of storage at most.

ricardobeat · 2025-08-09T22:54:43 1754780083

That’s precisely the point I tried to clear up in the previous comment.

The LEANN author proposes to create a 9GB index for a 500GB archive, and the other poster argued that it is not helpful because “storage is cheap”.

brabel · 2025-08-09T08:05:18 1754726718

Speak for yourself! If it took me 500GB to store my vectors , on top of all my existing data, it would be a huge barrier for me.

hdgvhicv · 2025-08-09T08:17:36 1754727456

A 4tb external drive is £100. A 1TB sd card or usb stick a similar cost.

Maybe Im too old to appreciate what “fast” means, but storage doesnt seem an enormous cost once you stripe it.

mockingloris · 2025-08-09T11:40:18 1754739618

This "...doesn't seem an enormous cost once you stripe it." gave me an idea. I KNOW that I will come back to link a blog post about it in the future.

xandrius · 2025-08-09T12:53:17 1754743997

Maybe time to update your storage?

mattlutze · 2025-08-11T08:11:44 1754899904

The DGX Spark being just $3-4,000 with 4TB of storage, 128GB unified memory, etc (or the Mac Studio tbh) is a great indicator that Local AI can soon be cheap and, along with the emerging routing and expert mixing strategies, incredibly performant for daily needs.

BoredPositron · 2025-08-12T12:10:59 1755000659

That's the size of just two or three triple A games nowadays.

snoman · 2025-08-09T16:27:44 1754756864

Take whatever you're indexing and make it 16-20x and that’s a good approximation of what the vector db’s total size is going to be.

jononor · 2025-08-10T13:55:19 1754834119

Why is it like that, currently? There is no information added by a vector index compared to the original text. And the text is highly redundant and compressible with even lossless functions. Furthermore a vector index is already lossy and approximate. So conceptually it is at least possible to have an index that would be a fraction of the size of what is indexed?

snoman · 2025-08-12T15:25:37 1755012337

There is some information added, depending on the vector db and context (some systems will add permissions related metadata so that the LLM won’t pull chunks that the user didn’t have access to).

The vector itself is pretty large (512 dimensions).

The chunks have an overlap (iirc 30% but someone feel free to correct me).

I don’t _think_ the data is typically compressed (not sure why but I assume performance).

mccoyb · 2025-08-09T01:15:08 1754702108

That can't be the correct paper...

I think you meant this: https://arxiv.org/abs/2506.08276

johnfn · 2025-08-09T01:17:13 1754702233

No no, getting your entire workflow local requires solving P=NP.

antoniojtorres · 2025-08-09T06:29:13 1754720953

Wait how?

janderson215 · 2025-08-09T07:18:31 1754723911

Just some sarcasm. You can safely disregard if you didn’t get a chuckle.

andylizf · 2025-08-09T07:01:10 1754722870

Yeah that's it. My bad lol

oblio · 2025-08-08T21:25:24 1754688324

It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?

andylizf · 2025-08-08T21:29:58 1754688598

Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.

iezepov · 2025-08-09T04:26:41 1754713601

Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.

catlifeonmars · 2025-08-09T05:06:29 1754715989

Why would the embeddings be higher dimensionally than the data? I imagine the embeddings would contain relatively higher entropy (and thus lower redundancy) than many types of source data.

cm228 · 2025-08-09T10:07:55 1754734075

depends on the chunk-size used to create the embedding.

yichuan · 2025-08-08T21:37:26 1754689046

I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me

brookst · 2025-08-09T13:49:11 1754747351

Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.

psychoslave · 2025-08-09T10:02:54 1754733774

Why is that considred relevant to get a RAG of people digital traces burdening them in every single interactions they have with a computer?

Having locally distributed similar grounds is one thing. Push everyone to much in its own information bubble, is an other orthogonal topic.

When someone mind recall about that email from years before, having the option to find it again in a few instants can interesting. But when the device is starting to funnel you through past traces, then it doesn't matter much whether it the solution is in local or remote: the spontaneous thought flow is hijacked.

In mindset dystopia, the device prompts you.

solarkraft · 2025-08-09T13:48:36 1754747316

Since it’ll be local, this behavior can be controlled. I for one find the option of it digging through my personal files to give me valuable personal information attractive.

wfn · 2025-08-08T22:38:03 1754692683

Thank you for the pointer to LEANN! I've been experimenting with RAGs and missed this one.

I am particularly excited about using RAG as the knowledge layer for LLM agents/pipelines/execution engines to make it feasible for LLMs to work with large codebases. It seems like the current solution is already worth a try. It really makes it easier that your RAG solution already has Claude Code integration![1]

Has anyone tried the above challenge (RAG + some LLM for working with large codebases)? I'm very curious how it goes (thinking it may require some careful system-prompting to push agent to make heavy use of RAG index/graph/KB, but that is fine).

I think I'll give it a try later (using cloud frontier model for LLM though, for now...)

[1]: https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...

OldfieldFund · 2025-08-09T11:06:18 1754737578

I'm gonna put it here for visibility: Use patchright instead of Playwright: https://github.com/Kaliiiiiiiiii-Vinyzu/patchright

bamboozled · 2025-08-09T12:23:16 1754742196

What problem does patchright solve?

jdelsman · 2025-08-09T13:32:28 1754746348

Not being detected by things like bot detection.

wy1346 · 2025-08-09T07:04:42 1754723082

This looks incredibly useful for making large-scale local AI truly practical.

jychang · 2025-08-09T09:14:14 1754730854

This is annoyingly Apple-only though. Even though my main dev machine is a Macbook, this would be a LOT more useful if it was a Docker container.

I'd still take a Docker container over an Apple container, because even though docker is not VM-level-secure, it's good enough for running local AI generated code. You don't need DEFCON Las Vegas levels of security for that.

And also because Docker runs on my windows gaming machine with a fast GPU with WSL ubuntu, and my linux VPS in the cloud running my website, etc etc. And most people have already memorized all the basic Docker commands.

This would be a LOT better if it was just a single docker command we can copy paste, run it a few times, and then delete if necessary.

glhaynes · 2025-08-10T14:14:01 1754835241

I’m no expert on these things, but since Apple Containerization uses OCI images, I’d think you’d be able to sub in Docker (or Podman, etc) as the runtime pretty trivially. Like Podman, it uses a very similar command line interface to Docker’s.

Edit: Oh, I see now that Coderunner is Apple Containerization-specific.

sebmellen · 2025-08-08T20:24:35 1754684675

I know next to nothing about embeddings.

Are there projects that implement this same “pruned graph” approach for cloud embeddings?

NJL3000 · 2025-08-09T05:31:34 1754717494

It’s in the works… been meaning to do a show HN moment to see if it flies or I Fall on my face..

unixhero · 2025-08-09T01:43:07 1754703787

I have 26tb hardrives, 50gb doesnt scare me. Or should I be?

technocratius · 2025-08-09T07:34:34 1754724874

I think you'd want things in RAM for performance reasons but would love to be corrected by people with more knowledge/experience on the subject

unixhero · 2025-08-09T09:14:18 1754730858

Oh the number was memory space? That changes the maths a little bit. But I do have 50gb available for a model no problem whatsoever. 384gb is the new 32gb.