Hacker Newsnew | past | comments | ask | show | jobs | submit | mota7's commentslogin

Is there really that big a different in TFLOPS between the GB100 and GB202 chips? The GB100 has fewer SMs than the GB202, so I'm confused about where the 10x performance would be coming from?


You're asking a really good question but it's not a question with an easy answer.

There's a lot more to performance computing than FLOPs. FLOPs are you good high level easy to understand metric but it's a small part of the story when you're in the weeds.

To help make sense of this, look at CPU frequencies. I think most people on HN know that two CPU with the same frequency can have dramatically different outcomes on benchmarks, right? You might know how some of these come down to things like IPC (instructions per cycle) or the cache structures. There's even more but we know it's not so easy to measure, right?

On a GPU all that is true but there's only more complexity. Your GPU is more similar to a whole motherboard where your PCIe connection is a really really fast network connection. There's lots of faults to this analogy but this closer than just comparing TFLOPs.

Nvidia's moat has always been "CUDA". Quotes because even that is a messier term than most think (Cutlass, CuBLAS, cuDNN, CuTe, etc). The new cards are just capable of things the older ones aren't. Mix between hardware and software.

I know this isn't a great answer but there is none. You'll probably get some responses and many of them will have parts of the story but it's hard to paint a real good picture in a comment. There's no answer that is both good and short.


No, GPUs are a lot simpler. You can mostly just take the clock rate and scale it directly for the instruction being compared.


There's a 2x performance hit from the weird restriction on fp32 accumulation, plus the fact that 5090 has "fake" Blackwell (no tcgen05) which limits the size and throughput of matrix multiplication through the tensor cores.


Land the western australia wheat belt sells for less than $1000/acre. Is that very expensive?


Yes that bugged me too. If you replace 'precisely' with 'approximately' everywhere in the article it becomes much improved ;)


There's basically a difference in philosophy. GPU chips have a bunch of cores, each of which is semi-capable, whereas TPU chips have (effectively) one enormous core.

So GPUs have ~120 small systolic arrays, one per SM (aka, a tensorcore), plus passable off-chip bandwidth (aka 16 lines of PCI).

Where has TPUs have one honking big systolic array, plus large amounts of off-chip bandwidth.

This roughly translates to GPUs being better if you're doing a bunch of different small-ish things in parallel, but TPUs are better if you're doing lots of large matrix multiplies.


The hand-waving explanation: The slower you're going, the easier (cheaper) it is to change direction. And for eliptical orbits, the outer-most part of the orbit is where you're going slow.

So to make a drastic change in direction (aka, a very different orbit):

1. First burn to move far away from the center of the orbit so that you're going very slow.

2. Then burn to make a large change in direction (orbit).

3. Then wait until you cross your desired final orbit, and burn again to close it.

The tradeoff is that these types of orbit change are very slow (because you want to be going very slow for the middle burn, which means you take ages to get there).


Not quite: It's taking advantage of (1+a)(1+b) = 1 + a + b + ab. And where a and b are both small-ish, ab is really small and can just be ignored.

So it turns the (1+a)(1+b) into 1+a+b. Which is definitely not the same! But it turns out, machine guessing apparently doesn't care much about the difference.


You might then as well replace the multiplication by the addition in the original network. In that case you're not even approximating anything.

Am I missing something?


They're applying that simplification to the exponent bits of an 8 bit float. The range is so small that the approximation to multiplication is going to be pretty close.


Plus the 2^-l(m) correction term.

Feels like multiplication shouldn't be needed for convergence, just monotonicity? I wonder how well it would perform if the model was actually trained the same way.


This trick is used a ton when doing hand calculation in engineering as well. It can save a lot of work.

You're going to have tolerance on the result anyway, so what's a little more error. :)


I had the same thought: Just eye-balling the graphs, the result of the subtraction looks very close to just reducing the temperature.

They're effectively doing softmax with a fixed temperature, but it's unclear that this work is going to do better than just learning a per-head temperature parameter.

c.f. https://arxiv.org/abs/2010.04245 which shows an improvement by learning per-head temperature.

The other way to think about this is that it looks like a hacked-up kinda-sorta gated attention. If that's the case, then doing softmax(alphaq_1k_1^T - log_sigmoid(betaq_2k_2^T)) might be better? (where alpha,beta are learned temperatures).


The paper says "... optimized on next-word prediction only". Which is absolutely correct in 2023.

ChatGPT (and indeed all recent LLMs) using much more complex training methods than simply 'next-word prediction'.


This passage makes two claims

* one, applicable to current language models (which ChatGPT is one of them), claim that they "they fail to capture several syntactic constructs and semantics properties" and "their linguistic understanding is superficial". It gives an example, "they tend to incorrectly assign the verb to the subject in nested phrases like ‘the keys that the man holds ARE here", which is not the kind of mistake that ChatGPT makes.

* Another claim, is that "when text generation is optimized on next-word prediction only" then "deep language models generate bland, incoherent sequences or get stuck in repetitive loops". Only this second claim is relative to next-word prediction.


Yeah, that struck me too. I followed one of the refs at random and it was to a 2020 paper about RNNs.


Like most things, it's more complex than that, and as a result it can be either faster or slower than 'median(RTT to each DC in quorum)'.

It's a delicate balance based on the locations that rows are being read and written. In the case where a row being repeatedly written from only one location and not being read from different location, the writes can be significantly faster than would be naively expected.


> Like most things, it's more complex than that,

Sure, no doubt. My point wasn't really about the particularities. It was around the mistaken idea that I see sometimes where people believe that TrueTime allows for synchronized global writes without any need for consensus.


The speed of light in vacuum is a hard upper limit. Most signal paths will be dominated by fibre optics (about 70% of C) and switching (adding more delay).

But, yes TrueTime will not magically allow data to propagate at faster-than-light speeds.


> Gradient accumulation doesn't work with batch norms so you really need that memory.

Last I looked, very few SOTA models are trained with batch normalization. Most of the LLMs use layer norms which can be accumulated? (precisely because of the need to avoid the memory blowup).

Note also that batch normalization can be done in a memory efficient way: It just requires aggregating the batch statistics outside the gradient aggregation.


wav2vec2, whisper, HifiGAN, Stable Diffusion, and Imagen all use BatchNorm.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: