Hacker Newsnew | past | comments | ask | show | jobs | submit | slacka's commentslogin

You're missing the point. The effective tax rate of many billionaires is lower than ours. "Musk with a fortune of $244 billion, paid an average effective tax rate of 24% from 2018 to 2020". In other years it was as low as 3%[2].

[1] https://www.cbsnews.com/news/income-taxes-billionaire-tax-ra...

[2] https://www.propublica.org/article/the-secret-irs-files-trov...


I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?


Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.

Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.


Apple is in this NPU boat because they are optimized for mobile first.

Nvidia does not optimize for mobile first.

AMD and Intel were forced by Microsoft to add NPUs in order to sell “AI PCs”. Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak like you said.

AMD and Intel both have matmul acceleration directly in their GPUs. Only Apple does not.


Nvidia's approach works just fine on mobile. Devices like the Switch have complex GPGPU pipelines and don't compromise whatsoever on power efficiency.

Nonetheless, Apple's architecture on mobile doesn't have to define how they approach laptops, destops and datacenters. If the mobile-first approach is limiting their addressable market, then maybe Tim's obsessing over the wrong audience?


MacBooks benefit from mobile optimization. Apple just needs to add matmul hardware acceleration into their GPUs.


Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.


I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says

“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”


If it uses a lot less power it could still be a win for some use cases, like while on battery you might still want to run transformer based speech to text, RTX voice-like microphone denoising, image generation/infill in photo editing programs. In some use cases like RTX-voice like stuff during multiplayer gaming, you might want the GPU free to run the game even if it still suffers some memory bandwidth impact from having it running.


There is no NPU "standard".

Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.

Even Nvidia GPUs often have breaking changes moving from one generation to the next.


I think OP is suggesting that Apple / AMD / Intel do the work of integrating their NPUs into popular libraries like `llama.cpp`. Which might make sense. My impression is that by the time the vendors support a certain model with their NPUs the model is too old and nobody cares anyway. Whereas llama.cpp keeps up with the latest and greatest.


I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters.


I am running Ollama with 'SimonPu/Qwen3-Coder:30B-Instruct_Q4_K_XL' on a M4 pro MBP with 48 GB of memory.

From Emacs/gptel, it seems pretty fast.

I have never used the proper hosted LLMS, so I don't have a direct comparison. But the above LLM answered coding questions in a handful of seconds.

The cost of memory (and disk) upgrades in apple machines is exorbitant.




Very interesting model. Some key points from the blog:

* NVIDIA is also releasing most of the data they used to create it, including the pretraining corpus

* The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL.

At this size and with only 4 attention layers, it should run very fast locally on cheap 12GB GPUs.


Higgs Audio V2 is an advanced, open-source audio generation model developed by Boson AI, designed to produce highly expressive and lifelike speech with robust multi-speaker dialogue capabilities.

Some Highlights:

* Trained on 10M hours of diverse audio — speech, music, sound events, and natural conversations

* Built on top of Llama 3.2 3B for deep language and acoustic understanding

* Runs in real-time and supports edge deployment — smallest versions run on Jetson Orin Nano

* Outperforms GPT-4o-mini-tts and ElevenLabs v2 in prosody, emotional expressiveness, and multi-speaker dialogue

* Zero-shot natural multi-speaker dialogues — voices adapt tone, energy, and emotion automatically

* Zero-shot voice cloning with melodic humming and expressive intonation — no fine-tuning needed

* Multilingual support with automatic prosody adaptation for narration and dialogue

* Simultaneous speech and background music generation — a first for open audio foundation models

* High-fidelity 24kHz audio output for studio-quality sound on any device

* Open source and commercially usable — no barriers to experimentation or deployment

Model on Huggingface: https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-...


How is this the trolley problem for someone with a terminal disease? I assume the sick population are the people in the trolley and the experimental patient is the person on the track? In this scenario, by not pulling the lever you just extend the life of the people on the trolley to the end of the ride for a dangerous drug. Where as, pulling the level could save the life of the person on the track and the people in the trolley if the drug is successful.

What am I missing? For non-terminal diseases, it's a bit murkier, but still I don't follow the analogy.


Some people do go into remission from a terminal cancer diagnosis, either because the diagnosis was wrong or because they live long enough for an approved treatment to come on the market. Also, that you have terminal cancer doesn’t say anything about how long you’re going to live. You can live for many years with terminal cancer.

I do think we’re overly cautious with drug approvals and I think we should be more open to leaving the decision to patients and their medical teams, but it’s not as simple as saying someone’s terminally ill, so just do whatever. Reducing it down to the trolley problem makes it seem much more black and white and immediate than it really is.


Why don't recommend atop? When a system is unresponsive, I want a I want a high-level tool that immediately shows which subsystem is under heavy load. It should show CPU, Memory, Disk, and Network usage. The other tools you listed are great, once you know what the cause is.


My preference is tools that give a rolling output as it let you capture the time-based pattern and share it with others, including in JIRA tickets and SRE chatrooms, whereas top's generally clear the screen. atop by default also sets up logging and runs a couple of daemons in systemd, so it's more than just a handy tool when needed, it's now adding itself to the operating table. (I think I did at least one blog post about performance monitoring agents causing performance issues.) Just something to consider.

I've recommended atop in the past for catching short-lived processes because it uses process accounting, although the newer bpf tools provide more detail.


All of the features descried in the article are supported by clang except maybe for the constexpr keyword. By your own list, neither one supports all the features. Also, gcc only supports about 4 or 5 features more than gcc, and clang supports a few gcc doesn't. Hardly "much better".


For the Internet as a whole, yes, but not for this narrow case of legitimate news organizations. If Facebook could have building full of moderators to review flagged content on their site, Google could easily fix this issue with a team to handle reports of low quality AI content. Not to mention, how most of these sites sprung up out of nowhere. That should already be a red flag for the algorithm to de-rank until a human could review. And the example of reworded blog spam with identical photos, posted at a later date, again, should be trivial for a team that cared to catch manually(flagged) or through an algorithm.

It's not over, unless Google doesn't care enough about Google News to fix it.


So you prefer, the DSL or GPL over GPLv2? The Doom Source License (or DSL) is the original source code license under which the Doom source code was released in late 1997.[1] I'll take this over the years and years the past licensing has caused any day.[2]

[1]https://doom.fandom.com/wiki/Doom_Source_License#:~:text=The....

[2] https://www.doomworld.com/forum/topic/44358-doom-license-amp...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: