You're likely going to have a rough time with 64-bit arithmetic in any GPU. (At ...

ryao · 2025-02-10T00:00:36 1739145636

The PTX ISA for Nvidia GPUs supports 64-bit integer arithmetic:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...

LegionMammal978 · 2025-02-10T02:19:20 1739153960

Hmm, it looks like the underlying SASS does have 64-bit integer instructions now, but only with the 12.0 capability level in the recent Blackwell processors. Older versions emulate it via chained 32-bit instructions. Take this example kernel:

  __global__ void add(uint64_t *res, uint64_t x) {
      *res = x + 0x12345;
  }

Compiled with -arch=sm_120, I get the SASS:

  /*0000*/  LDC R1, c[0x0][0x37c]                      ?trans8;
  /*0010*/  LDC.64 R2, c[0x0][0x388]          &wr=0x0  ?trans1;
  /*0020*/  LDCU.64 UR4, c[0x0][0x358]        &wr=0x1  ?trans7;
  /*0030*/  LDC.64 R4, c[0x0][0x380]          &wr=0x1  ?trans1;
  /*0040*/  IADD.64 R2, R2, 0x12345           &req={0} ?WAIT6_END_GROUP;
  /*0050*/  STG.E.64 desc[UR4][R4.64], R2     &req={1} ?trans1;
  /*0060*/  EXIT                                       ?trans5;
  /*0070*/  BRA 0x70;

But with -arch=sm_100, the IADD.64 is broken up into a UIADD3 and a UIADD3.X (contrast with the IADD3 that a regular 32-bit addition would produce):

  /*0000*/  LDC R1, c[0x0][0x37c] ;
  /*0010*/  LDCU.64 UR4, c[0x0][0x388] ;
  /*0020*/  LDC.64 R2, c[0x0][0x380] ;
  /*0030*/  LDCU.64 UR6, c[0x0][0x358] ;
  /*0040*/  UIADD3 UR4, UP0, UPT, UR4, 0x12345, URZ ;
  /*0050*/  UIADD3.X UR5, UPT, UPT, URZ, UR5, URZ, UP0, !UPT ;
  /*0060*/  MOV R4, UR4 ;
  /*0070*/  MOV R5, UR5 ;
  /*0080*/  STG.E.64 desc[UR6][R2.64], R4 ;
  /*0090*/  EXIT ;
  /*00a0*/  BRA 0xa0;

So if you want real 64-bit support, have fun getting your hands on a 5070! But even on sm_120, things like 64-bit × immediate 32-bit take a UIMAD.WIDE.U32 + UIMAD + UIADD3 sequence, so the support isn't all that complete.

(I've been looking into the specifics of CUDA integer arithmetic for some time now, since I've had the mad idea of doing 'horizontal' 448-bit integer arithmetic by storing one word in each thread and using the warp-shuffle instructions to send carries up and down. Given that the underlying arithmetic is all 32-bit, it doesn't make any sense to store more than 31 bits per thread. Then again, I don't know whether this mad idea makes any sense in the first place, until I implement and profile it.)

Agentlien · 2025-02-09T20:12:38 1739131958

I'm less concerned about it being 32-bit and more about them being exclusively scalar instructions, no vector instructions. Meaning only useful for uniforms, not thread-specific data.

[Update: I remembered and double checked. While there are only scalar 32-bit integer instructions you can use 24-bit integer vector instructions. Essentially ignoring the exponent part of the floats.]

ryao · 2025-02-10T00:03:02 1739145782

The programming model is that all threads in the warp / thread block run the same instruction (barring masking for branch divergence). Having SIMD instructions at the thread level is a rarity given that the way SIMD is implemented is across warps / thread blocks (groups of warps). It does exist, but only within 32-bit words and really only for limited use cases, since the proper way to do SIMD on the GPU is by having all of the threads execute the same instruction:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.

Agentlien · 2025-02-10T07:09:27 1739171367

I know all of that. I was talking about RDNA2, which is AMD. There, instructions come in two flavours:

1. Scalar - run once per thread group, only acting on shared memory. So these won't be SIMD.

2. Vector - run across all threads, each threads accesses its own copy of the variables. This is what you typically think of GPU instructions doing.

LegionMammal978 · 2025-02-10T13:32:41 1739194361

That does sound like it would be a pretty big limitation. But there appear to be plenty of vector instructions for 32-bit integers in RDNA2 and RDNA3 [0] [1]. They're named V_*_U32 or V_*_I32 (e.g., V_ADD3_U32), even including things like a widening multiply V_MAD_U64_U32. The only thing missing is integer division, which is apparently emulated using floating-point instructions.

[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 259, Table 83, "VOP3A Opcodes"

[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 160, Table 85, "VOP3 Opcodes"