H100 can do well over 1500 TFLOPS in fp16.

nulltype · on Nov 12, 2024

Which H100 and how much over 1500 TFLOP/s?

The datasheet for the H100 SXM seems to indicate that it can only do ~1000 TFLOP/s peak.

saagarjha · on Nov 12, 2024

I just went to Nvidia’s site and downloaded the data sheet: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor.... It says 1600/1900 in half precision?

wtallis · on Nov 12, 2024

Read the fine print: "With sparsity". They double the claimed throughput by assuming that half of the FLOPs can be skipped.

menaerus · on Nov 12, 2024

I also recently went through the specs and noticed "with sparsity" but I didn't quite understand what it specifically refers to - the premise is that a lot of weights in matmul operations will be zero or insignificant - also known as sparse matrices - and in that case A100/H100 has a circuitry that can boost the throughput up to 2x, essentially "skipping" half of the FLOPS as you say.

I am not an expert in LLM but I don't think you can end up having a significant amount of zeroed weights (~50%) in a converged network so I think it is safe to say that the theoretical throughput for 99% of cases is really ~800 TFLOPS and not ~1600 TFLOPS as advertised.

saagarjha · on Nov 12, 2024

Oh, that is really annoying. Thanks for catching that!

pama · on Nov 12, 2024

There are two populations of people reading the NVIDIA specs (and now you switched groups). If NVIDIA ever changes their marketing strategy and the asterisk denotes something else, there might be a third population because I know a lot of people that I suspect will keep dividing those starred FLOPS/s by two :-)