what is a typical workload that you speak of, where CPUs are better? We've been ...

tmostak · 2025-10-14T20:05:43 1760472343

Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to host, you have 128GB/sec in theory and 100+ GB/sec in practice bidirectional bandwidth (half that in each direction), so still come out ahead with pipelining.

Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.

ashtonsix · 2025-10-14T20:34:49 1760474089

Haha... GPUs are great. But do you mean to suggest we should swap a single ARM core for a top-line GPU with 10k+ cores and compare numbers on that basis? Surely not.

Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?

It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.

With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.

_zoltan_ · 2025-10-15T13:11:01 1760533861

GH200 is $1.5/hr at lambda and can do 450GB/s to the GPU. seems still cheaper?

ashtonsix · 2025-10-14T18:56:59 1760468219

By typical I imagined adoption within commonly-deployed TSDBs like Prometheus, InfluxDB, etc.

GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.

The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.

_zoltan_ · 2025-10-14T19:24:13 1760469853

GB is CPU+2xGPU.

GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.

ashtonsix · 2025-10-14T19:47:04 1760471224

Each Grace CPU has multiple cores: https://www.nvidia.com/en-gb/data-center/grace-cpu-superchip

This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...