Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

H100 can do well over 1500 TFLOPS in fp16.


Which H100 and how much over 1500 TFLOP/s?

The datasheet for the H100 SXM seems to indicate that it can only do ~1000 TFLOP/s peak.


I just went to Nvidia’s site and downloaded the data sheet: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor.... It says 1600/1900 in half precision?


Read the fine print: "With sparsity". They double the claimed throughput by assuming that half of the FLOPs can be skipped.


I also recently went through the specs and noticed "with sparsity" but I didn't quite understand what it specifically refers to - the premise is that a lot of weights in matmul operations will be zero or insignificant - also known as sparse matrices - and in that case A100/H100 has a circuitry that can boost the throughput up to 2x, essentially "skipping" half of the FLOPS as you say.

I am not an expert in LLM but I don't think you can end up having a significant amount of zeroed weights (~50%) in a converged network so I think it is safe to say that the theoretical throughput for 99% of cases is really ~800 TFLOPS and not ~1600 TFLOPS as advertised.


Oh, that is really annoying. Thanks for catching that!


There are two populations of people reading the NVIDIA specs (and now you switched groups). If NVIDIA ever changes their marketing strategy and the asterisk denotes something else, there might be a third population because I know a lot of people that I suspect will keep dividing those starred FLOPS/s by two :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: