Software people, in my very direct experience, are terrible at hardware... While in jest, I do think most software engineer's understanding of hardware abstractions is pretty poor and does disservice to the hardware they run on.
As a software dev that started at a hardware focused company... I don't think it need be in jest, nor need be offensive? Hardware and software are different disciplines, even when they do overlap in embedded. It just seems to me - having been at a hardware company that failed to pivot to software, and went out of business (while a new competitor, software first, became Zoom), that the mindset is too different. Hardware requires far more planning; software far faster iteration. In software too much planning is a death sentence. In hardware insufficient planning is a death sentence. I think a single person absolutely could do both well but in my relatively basic estimation, I don't see it being a common trait. Hardware is cool and impressive, but I could never do it. And in my experience, many of the hardware folks I know don't seem to like how software development works either.
I don't think it means anything for this particular move; good leaders know what they know and what they don't know; they know how to motivate and select the right people, they know what to delegate and what to control. Having a track record of success of any kind is IMHO always the best start. I'm excited to see what kind of changes the transition from an operations person to a more technical leader may bring. Especially given how awesome Apple's hardware has consistently been.
Generally speaking, I think both are true. Most people seem to have an affinity for either hardware or software, but rarely for both. Those who do are extremely unique. I don't mean that as an insult to anyone, just as an observatin having worked in both (and personally am much better at software than hardware, even though I enjoy both).
And then there is IC versus leadership. They're opposites. Lead times and supply chains are a headache in hardware, but tangible deadlines are great for keeping the project grounded. In software you have to invent your own discipline to keep the team on pace and bend over backwards to explain to physical-minded stakeholders why you can't build something with no lead times overnight.
Hardware and software have VERY different deployment cost functions and lifecycles. Having "affinity" for one requires a mindset not really suitable for the other and being able to juggle mindsets, especially short vs long term focus is rare in itself.
My experience studying 'Computing and Electronics' - a combined degree - was that we could get practically any extensions or leniency we wanted by blaming the other specialism. To each the other was mistrusted and magic.
I agree - at university there were software people and hardware people and a small number who studied mechatronics (hardware and software). But even the mechatronic people were really hardware people who just tolerated software.
I find both interesting but have been working in software for over a decade now.
Honestly, the thing that pushed me into software dev was the fact that hardware tools were absolutely garbage. Verilog felt like a joke of a language designed to torment rather than help the user.
Verilog is not the best and that’s not even the worst part - tools like ISE/Vivado and Quartus are even worse!
It’s really amazing that at least there are some fully open flows for FPGAs these days, unfortunately they don’t support system Verilog. (I think this is still the case?)
Yeah at university we had to do some hardware stuff in our software course. I know there were better debug tools available as some students purchased them but playing with microprocessors was no fun.
It actually stands for "lizard brain"... it is (or at least was) an Infineon Aurix control and monitoring microcontroller, they may have changed to a newer one.
I was put on it in 2015 after an acquaintance of mine that was previously on the list recommended me… I only heard from Forbes a few days before the list came out, they asked me for a photo and asked if I approved the 2 sentence blurb they prepared, and that was it. For years afterwards they would try to get me to come to their events, but I never had any interest, and I assume that was how they made money… but I never paid anything to be on the list or had real interest in being on it, and I don’t think it led to anything other than my technically illiterate parents thinking that it was impressive.
Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.
The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B
Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.
> For 969 tok/s in int8, you need 392 TB/s memory bandwidth
I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.
Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.
Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.
From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.
As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.
Do you think that the 16k GPUs get used once and then are thrown away? Llama 405B was trained over 56 days on the 16k GPUs; if I round that up to 60 days and assume the current mainstream hourly rate of $2/H100/hour from the Neoclouds (which are obviously making margin), that comes out to a total cost of ~$47M. Obviously Meta is training a lot of models using their GPU equipment, and would expect it to be in service for at least 3 years, and their cost is obviously less than what the public pricing on clouds is.
+1 this commenter. I just visited the UK for the first time at the beginning of this month and had a fantastic ~3 hours at Bletchley Park, but felt I had to cram TNMOC and the amazing Colossus live demonstration (where I asked a million questions) and everything else in the museum in the 90 minutes I was there. If I assume other HN readers are like me, I would dedicate at least 2.5-3 hours for TNMOC to actually get a chance to actually see and play around with their extensive collection of vintage machines.
Founder of REX Computing here; I highly recommend checking out my interview on the Microarch Club podcast linked elsewhere on the thread; will also answer questions on this thread if anyone has them.
The teaser reminds me a lot of other *failed* high performance/high efficiency architecture redesigns that failed because of the unreasonable effort required to squeeze out a useful fraction of the promised gains e.g. Transputer and Cell. Can you link to written documentation of how existing code can be ported? I doubt you can just recompile ffmpeg or libx264, but level of toolchain support can early adopters expect? Does it require manually partitioning code+data and mapping it to the on-chip network topology?
We had a basic LLVM backend that supported a slightly modified clang frontend and a basic ABI. We tried to make it drastically easier for both the programmer and compiler to handle memory by having all memory (code+data) be part of a global flat address space across the chip, with guarantees being made to the compiler by the NoC on the latency of all memory accesses across one or multiple chips. We tested this with very small programs that could fit in the local memory of up to two chips (128KB of memory), but in theory it could have scaled up to the 64 bit address space limit. Compilation time for programs was long, but fully automated, specifically to improve upon problems faced by Cell and other scratchpad memory architectures… some of our original funding in 2015 from DARPA was actually for automated scratchpad memory management techniques on Texas Instruments DSPs and Cell (our paper: https://dl.acm.org/doi/pdf/10.1145/2818950.2818966)
This was all designed a decade ago, and REX has been in effectively hibernation since the end of 2017 after successfully taping out our 16 core test chip back in 2016, but being unable to raise additional funding to continue. I have continued to work on architectures that have leveraged scratchpad memories in different ways, including on cryptocurrency and machine learning ASICs, including at my current startup, Positron AI (https://positron.ai)
I know between Moore's Law and Gate's Law which one I would prefer to be the industry standard... https://en.wikipedia.org/wiki/Andy_and_Bill%27s_law