I also recently went through the specs and noticed "with sparsity" but I didn't quite understand what it specifically refers to - the premise is that a lot of weights in matmul operations will be zero or insignificant - also known as sparse matrices - and in that case A100/H100 has a circuitry that can boost the throughput up to 2x, essentially "skipping" half of the FLOPS as you say.
I am not an expert in LLM but I don't think you can end up having a significant amount of zeroed weights (~50%) in a converged network so I think it is safe to say that the theoretical throughput for 99% of cases is really ~800 TFLOPS and not ~1600 TFLOPS as advertised.
There are two populations of people reading the NVIDIA specs (and now you switched groups). If NVIDIA ever changes their marketing strategy and the asterisk denotes something else, there might be a third population because I know a lot of people that I suspect will keep dividing those starred FLOPS/s by two :-)