Hacker Newsnew | past | comments | ask | show | jobs | submit | tredre3's commentslogin

American car manufacturers have extremely small market shares outside N.A., and many (all?) of them required multiple government bailouts over the past few decades.

If you think that keeping China is good for the consumer, you'll have to present a stronger case than "we must protect our companies".


Try to look outside your bubble. Millions of people still have monthly quotas on their internet plan, and 4GB can be a big chunk of it (possibly over all of it, in some cases).

Google should have asked.


The model is loaded once and can be used for multiple sessions, and even parallel requests.

llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.

If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.

So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.

Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).


In the first three questions asked I could see reasonable people argue either way, with legitimate papers backing their "side", so I'm happy to at least see an even split.

For the autism question I agree with you, people simply believing their government is reasonable.

But I am quite worried about the last question. That 25% of people believe vaccines are used for population control is worrisome, no matter how you spin it.


Note that while it is not a vaccince, there is an injectable birth control, and immunocontraceptives which would actually be vaccines are in development. Further there have been well documented cases where people were sterilized against their will under the guise of other medical procedures, including receiving such shots. It really doesn't take too much whisper down the lane for that to be misunderstood, and once you get it in your head that a vaccination program could be used for population control I imagine it would be pretty hard to find evidence to convince yourself otherwise.

I'm sure you have ways to entirely purge a crate. And the situation will arise that you need to do so. In which case all the old code will, indeed, break.

Vendoring is the only solution to this but it's really discouraged in rust-land and there is no first-party support for it. You can kind of manually vendor your deps with cargo, and there are third party tools. But compare that to go-land where `go mod vendor` gets you 95-100% of the way there.


> I'm sure you have ways to entirely purge a crate.

No, the lesson from left-pad that every centralized package manager learned was that you cannot allow users to remove uploaded packages at their leisure. All outright code removal can only be done manually by the admins themselves, and it's unlikely to happen outside of some legal compulsion.

> Vendoring is the only solution to this but it's really discouraged in rust-land and there is no first-party support for it.

This is completely incorrect. Cargo ships with `cargo vendor` out of the box, it's neither discouraged nor unsupported by first-party tools: https://doc.rust-lang.org/cargo/commands/cargo-vendor.html


What you are describing happens all the time. Usually the toolchain provider will continue updating a list of known issues for some time after EOL. Beyond that you have third parties that do it for decades, if the platform is big enough. They collect bug reports from the industry, investigate them, then create lists that you subscribe to. Those lists include detailed examples, explanations, and usually linter rules to detect code that could trigger the bug.

The truth is: If the toolchain was good enough to ship your product, has time to go EOL, and then you do a patch that surfaces an esoteric toolchain bug, then the odds are that you'll know exactly what triggered the bug and you can work around it by writing different code.

Because even if the newer shinier compiler/toolchain had the issue fixed, most companies wouldn't upgrade to it at that point. It's almost never desirable to change your toolchain for a shipping product, you're just introducing more unknowns.


> Because even if the newer shinier compiler/toolchain had the issue fixed, most companies wouldn't upgrade to it at that point. It's almost never desirable to change your toolchain for a shipping product, you're just introducing more unknowns.

This reaction to toolchain stability is quite defensive, and was needed for C, but isn't universally needed. C toolchain updates could break your product because of how loose the C language can be; I've had code that had benign undefined behaviour, until a toolchain update brought in an optimisation that broke it.

Another outcome of a toolchain update could be "no bugs introduced, existing bugs in your codebase now found by diagnostics".


Do you have any reasons to believe that granite is more immune to the effects of quantization than other tiny models? Otherwise it seems odd to judge a tiny model true capabilities by using its 4bit quant.

This model is small enough that it might be sensible to try the same prompts against all of the quant sizes to try and spot any differences.


That was interesting almost like a weird little modern art gallery. I’m surprised that the BF16 one looks so bad…

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

All the Q quants from big quant providers are importance-weighted (imatrix) nowadays.

The main (possibly only?) difference between Q and IQ today is that IQ uses a lookup table to achieve better compression. That is also why IQ suffers more when it can't fully fit into VRAM.

It's important to teach people the distinction and not perpetuate wrong assumptions of the past. If one needs/wants static quants, ignoring IQ_ isn't enough.


Thanks for bringing this up I looked into it, and if I understood correctly:

- Q4_0 (not K quant) is the traditional flat quantization - Q4_K (4-bit K quant) uses an imatrix and important weights get higher precision (5-6 bits instead of 4, but still largely 4 bits) - IQ4 uses an imatrix and important weights get an optimized scale to avoid clipping at 4-bit, but all the weights are still 4-bit

And yeah most quants nowadays are K quants which are importance weighted


An idle GPU consumes almost nothing, a loaded (server-class) GPU can consume over 2kW.

Admittedly a single request isn't a full load, but claiming that a request makes no difference vs idle is misguided, in my opinion.


OpenAI GPU wont be idle for long because they have all other requests to serve. Over time there will be a certain % of idle GPUs, amortized across all hundreds of millions of requests they receive.

And idle% is causally connected to whether you make a request or not, surely? I don't understand how your mental model works.

I agree with you a LLM is perfectly capable of explaining its actions.

However it cannot do so after the fact. If there's a reasoning trace it could extract a justification from it. But if there isn't, or if the reasoning trace makes no sense, then the LLM will just lie and make up reasons that sound about right.


So it is equal to what neuroscientists and psychologists have proven about human beings!

How was it proven?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: