That's why we need a row of ALUs in RAM chips. Read a row of DRAM and use it in a vector operation. With the speed of row reading, the ALU could take many cycles per operation to limit area.
GPUs that pull a kilowatt when running yes. This might actually work on an FPGA if the addition doesn't take too many clock cycles compared to matmuls which were too slow.