Hacker Newsnew | past | comments | ask | show | jobs | submit | lhl's commentslogin

Anecdotally, I will say that for my toughest jobs GPT-5+ High in `codex` has been the best tool I've used - CUDA->HIP porting, finding bugs in torch, websockets, etc, it's able to test, reason deeply and find bugs. It can't make UI code for it's life however.

Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...

Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.


We do live in an age of frontier LLMs... For fun, I'll just use Kimi K2 (on Kagi Assistant).

> Can you explain what this means and its significance? Assume that I'm a layperson with no familiarity with LLM jargon so explain all of the technical terms, references, names. https://github.com/MoonshotAI/Kimi-Linear

Imagine your brain could only “look at” a few words at a time when you read a long letter. Today’s big language models (the AI that powers chatbots) have the same problem: the longer the letter gets, the more scratch paper they need to keep track of it all. That scratch paper is called the “KV cache,” and for a 1 000 000-word letter it can fill a small library.

Kimi Linear is a new way for the AI to read and write that throws away most of that scratch paper yet still understands the letter. It does this by replacing the usual “look at every word every time” trick (full attention) with a clever shortcut called linear attention. The shortcut is packaged into something they call Kimi Delta Attention (KDA).

What the numbers mean in plain English

    51.0 on MMLU-Pro: on a 4 000-word school-test set, the shortcut scores about as well as the old, slow method.
    84.3 on RULER at 128 000 words: on a much longer test it keeps the quality high while running almost four times faster.
    6 × faster TPOT: when the AI is writing its reply, each new word appears up to six times sooner than with the previous best shortcut (MLA).
    75 % smaller KV cache: the scratch paper is only one-quarter the usual size, so you can fit longer conversations in the same memory.
Key pieces explained

    Full attention: the old, accurate but slow “look back at every word” method.
    KV cache: the scratch paper that stores which words were already seen.
    Linear attention: a faster but traditionally weaker way of summarising what was read.
    Gated DeltaNet: an improved linear attention trick that keeps the most useful bits of the summary.
    Kimi Delta Attention (KDA): Moonshot’s even better version of Gated DeltaNet.
    Hybrid 3:1 mix: three layers use the fast KDA shortcut, one layer still uses the old reliable full attention, giving speed without losing smarts.
    48 B total, 3 B active: the model has 48 billion total parameters but only 3 billion “turn on” for any given word, saving compute.
    Context length 1 M: it can keep track of about 1 000 000 words in one go—longer than most novels.
Bottom line Kimi Linear lets an AI read very long documents or hold very long conversations with far less memory and much less waiting time, while still giving answers as good as—or better than—the big, slow models we use today.


I'd previously been giving Hyprland a try, but after lots of customization work, there were still a bunch of things I wasn't happy with and ended back on GNOME as a "just let me get work done" thing (I use multiple workspaces, have always have dozens or hundreds of browser windows open, depend on a bunch tray extensions). That being said, GNOME just updated versions and broke all my extensions again so I've decided to recommit to work on fixing anything that isn't working for my workflow and ditching GNOME forever (I was previously much happier on Openbox, but well, Wayland).

With this latest go I gave River, QTile, and Niri a try. After a bit of swapping back and forth, I've settled on Niri and am slowly adding functionality I'm missing.

- I like multiple dynamic workspaces (grouped by function) and don't see much point beyond a split or two so Niri worked pretty well, and I was able to largely config all the keyboard shortcuts to something that made sense to me

- I'm using waybar and swaync for my other DE bits

I've also been using long running Claude Code/Codex in a workspace to build a number of custom scripts:

- niri-workspaces - dynamically generate a workspace display on my waybar showing windows, activity

- niri-workspace-names - integrate w/ fuzzel to let me rename workpaces

- niri-alttab - getting app cycling working in a way that makes sense to me, this is a larger project probably if I want live thumbnails and the like

- niri-terminal-below - I often want to have a new vertical terminal split and it's a bit hacky but works (have to punch out a new terminal, then bring it below, and move back if on the right side)

I haven't gone through all the docs, done much looking around, but one nice thing with these new coding agents is that they can just go and do a passable job to tweak as I want.


Re: app cycling, you might also be interested in https://github.com/isaksamsten/niriswitcher.


Looks great, thanks for the suggestion!


In Linux, you can set it as high as you want, although you should probably have a swap drive and still be prepared for you system to die if you set it to 128GiB. Here's how you'd set it to 120GiB:

    # This is deprecated, but can still be referenced
    options amdgpu gttsize=122800

    # This specifies GTT by # of 4KB pages:
    #   31457280 * 4KB / 1024 / 1024 = 120 GiB
    options ttm pages_limit=31457280


RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):

    512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS
Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.

If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.

On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.


Apple actually makes a lot more acquisitions than you think, but they are rarely very high profile/talked about: https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...


For fungible things, it's easy to cost out. But not all things can be broken down just in token cost, especially as people start building their lives around specific models.

Even beyond privacy just the availability is out of your control - you can look at r/ChatGPT's collective spasm yesterday when 4o was taken from them, but basically, you have no guarantees to access for services, and for LLM models in particular, "upgrades" can completely change behavior/services that you depend on.

Google has been even worse in the past here, I've seen them deprecate model versions with 1 month notices. It seems a lot of model providers are doing dynamic model switching/quanting/reasoning effort adjustments based on load now.


You're right on ratios, but actually the ratio is much worse than 6:1 since they are MoEs. The 20B has 3.6B active, and the 120B has only 5.1B active, only about 40% more!


A few people have mentioned looking a the vLLM docs and blog (recommended!). I'd also recommend SGLang's docs and blog as well.

If you're interested in a bit of a deeper dive, I can highly recommend reading some of what DeepSeek has published: https://arxiv.org/abs/2505.09343 (and actually quite a few of their Technical Reports and papers).

I'd also say that while the original GPT-4 was a huge model when it was originally released (rumored 1.7T-A220B), these days you can get (original release) "GPT-4-class" performance at ~30B dense/100B sparse MoE - and almost all the leading MoEs have between 12-37B activations no matter how big they get - Kimi K2 (1T param weights) has only 32B activations). If you do a basic quants (FP8/INT8) you can easily push 100+ tok/s on pretty bog standard data center GPUs/nodes. You quant even lower for even better speeds (tg is just MBW) for not much in quality loss (although for open source kernels, usually without getting much overall throughput or latency improvements).

A few people have mentioned speculative decoding, if you want to learn more, I'd recommend taking a look at the papers for one of the (IMO) best open techniques, EAGLE: https://github.com/SafeAILab/EAGLE

The other thing that is often ignored, especially for multiturn that I haven't seen mentioned yet is better caching, specifically prefix caching (radix-tree, block-level hash) or tiered/offloaded kvcaches (LMCache as one example). If you search for those keywords, you'll find lots there as well.


In Linux, you can allocate as much as you want with `ttm`:

In 4K pages for example:

    options ttm pages_limit=31457280
    options ttm page_pool_size=15728640
This will allow up to 120GB to be allocated and pre-allocate 60GB (you could preallocate none or all depending on your needs and fragmentation size. I believe `amdgpu.vm_fragment_size=9` (2MiB) is optimal.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: