Hacker Newsnew | past | comments | ask | show | jobs | submit | kevmo314's commentslogin

Is this an Android thing? My US iOS works fine with digital Suica.

Apple doesn't make regional variants of the phone, so all models have the technology built-in, even if it's disabled by default. Android phones outside of Japan lack Suica support.

And Pixel phones have the tech, but you need to flash a Japanese ROM to be able to use it.

Ahh interesting. I wonder why they (non-Apple) did that?

Suica uses the Sony FeliCa NFC standard (which predates the international NFC standards) which they charge a license fee per device for.

Apple has the margins to just pay the license fee for every iPhone, Android makers try to keep their costs down more.


There's a chip required for it that they cheaped out on.

Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.

We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.

We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.

Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!


We are the maintainers of https://github.com/rust-gpu/rust-gpu and https://github.com/Rust-GPU/Rust-CUDA FWIW. We haven't upstreamed the VectorWare work yet as it is still being cleaned up and iterated on.

> It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread.

This was overwhelmingly true ten years ago, not so much now.

Modern GPU threads are about 3Ghz, CPUs are still slightly faster in theory but the larger amounts of local fast memory makes GPU threads pretty competitive in practice.


Are you writing this from the future? The latest gen nvidia gpus sit at around 2-2.5 GHz and the latest gen amd cpus sit 4-5 GHz.

That matches my personal experience too, writing naive cuda code that doesn’t take advantage of parallelism is roughly half the speed of running it on cpu.


I've seen this objection pop up every single time and I still don't get it.

GPUs run 32, 64 or even 128 vector lanes at once. If you have a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence, etc how is it supposed to be slower?

Consider the following:

You have a hyperoptimized matrix multiplication kernel and you also have your inference engine code that previously ran on the CPU. You now port the critical inference engine code to directly run on the GPU, thereby implementing paged attention, prefix caching, avoiding data transfers, context switches, etc. You still call into your optimized GPU kernels.

Where is the magical slowdown supposed to come from? The mega kernel researchers are moving more and more code to the GPU and they got more performance out of it.

Is it really that hard to understand that the CUDA style programming model is inherently inflexible and limiting? I think the fundamental problem here is that Nvidia marketing gave an incredibly misleading perception of how the hardware actually works. GPUs don't have thousands of cores like CUDA Core marketing suggests. They have a hundred "barrel CPU"-like cores.

The RTX 5090 is advertised to have 21760 CUDA cores. This is a meaningless number in practice since the "CUDA cores" are purely a software concept that doesn't exist in hardware. The vector processing units are not cores. The RTX 5090 actually has 170 streaming multiprocessors each with their own instruction pointer that you can target independently just like a CPU. The key restriction here is that if you want maximum performance you need to take advantage of all 128 lanes and you also need enough thread copies that only differ in the subset of data they process so that the GPU can switch between them while it is working on multi cycle instructions (memory loads and the like). That's it.

Here is what you can do: You can take a bunch of streaming processors, lets say 8 and use them to run your management code on the GPU side without having to transfer data back to the CPU. When you want to do heavy lifting you are in luck, because you still have 162 streaming processors left to do whatever you want. You proceed to call into cuDNN and get great performance.


> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

But the library is using a warp as a single thread


Each SM should have 4 independent SMSPs (32 lanes each), no? Effectively a "4-core" task-parallel system per SM.

SMSP = Streaming Multiprocessor Sub-Partition, found in recent nVidia architectures - effectively partitioning each Streaming Multiprocessor into multiple complete sub-cores with separate register files and program counters, but accessing the same local memory. (AMD architectures have a similar development, with 'dual' compute units.) This creates overhead when running very large warps, since they can only have access to a fraction of the complete SM. But warps under the VectorWare model should be fairly small (running CPU-like code with fairly limited use of lane parallelism), so this doesn't have that much impact from that POV.

> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

Sure, if you have that then of course it would be fast. But that’s not what this library is proposing.


I really appreciate the way you've explained this. Are there any resources you recommend to reach your level of understanding?

Additionally there is still too much performance left on the table by not properly using CPU vector units.

SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.

This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.

The shared resources are often involve floating point registers and compute, so it's a double whammy.


Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.

The comparison is often just plain old linear code.

For example, one simd instruction vs multiple arithmetic instructions.

  x1 += y1
  x2 += y2
  x3 += y3
  x4 += y4
  
We have fifty years of CPU design optimizing for this. More often than not, you'll find this works better than vector instructions in practice.

The concept behind vector instructions is great, and it starts to work out for larger widths like 512 bits. But it's extremely tricky to take advantage of that much SIMD with a compiler or manually.


Hello, where can I read more about this? It's the first time I hear SIMD has drawbacks and I'm interested in hearing more different opinions.

Yet there are gains of doing e.g. string searches with SIMD, which you naturally aren't going to do in CUDA.

For sure, it makes sense for nice well defined problems that execute in isolation.

Think of the situation where the string search is running on a system that has hyper threading and a bunch of cores, and a normal amount of memory bandwidth.

It'll be faster, but at the same time make everything else worse if you overuse vector instructions.

(also cherry on top: some modern CPUs automagically lower the clock when they encounter vector instructions!!!)


If you are interested in doing this in golang I wrote a library to avoid needing cgo: https://github.com/kevmo314/go-usb

I use this to access UVC devices correspondingly without cgo: https://github.com/kevmo314/go-uvc



Was this shoveled out with Claude/Codex to try to ride off the Bonsai release?


If someone else finds a cryptocurrency vulnerability, they too will reallocate as much of your allocation as they can and cash it out.


often, a mind capable of doing something like this is not the kind that gives a lot of sh*t about things like "money" so I would put a chance of your statement being true at ... 12.78% :)


A fool and their money are easily parted.


This might be some journalistic confusion. If you go to the CERN documentation at https://twiki.cern.ch/twiki/bin/view/CMSPublic/AXOL1TL2025 it states

> The AXOL1TL V5 architecture comprises a VICReg-trained feature extractor stacked on top of a VAE.


What is git if not a database for source code?


Meh, then filesystems are databases for bytes. Airplanes are buses for flying.

I could make that argument, but I wouldn’t believe it.


Both of those statements are true.


It is funny that the AI's counterarguments amount to "you're hallucinating"


Hahaha, probably right though.


Someone with the ability to register .gov domains is trying to make a sneaky buck.


It used to be anyone with a fax machine could register any .gov but they fixed that when there was a news report about it.


I was curious about this, so for anyone else who’s interested, here’s a KrebsOnSecurity article from 2019 about how easy it was to fraudulently register a .gov domain:

https://krebsonsecurity.com/2019/11/its-way-too-easy-to-get-...

I haven’t seen any follow-up reporting, but it looks like the process now requires some semblance of identity verification:

https://get.gov/domains/before/


Not a lawyer, but as I understand copyright is bound to distribution so if the person's perfect memorization of a book results in them reproducing it verbatim then probably yes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: