Just because their tools are the best doesn't mean they are designed well.

jacquesm · 2025-09-08T11:28:34 1757330914

I've used DSPs, custom boards with compute hardware (FPGA image processing), and various kinds of GPUs. I would have a very hard time trying to point to ways in which the NVIDIA toolkit could be compared to what's out there and not come away with a massive sense of relief. For the most part 'it just works', the models are generic enough that you can actually get pretty close to the TDP on your own workloads with custom software and yet specific enough that you'll find stuff that makes your work easier most of the time.

I really can't complain, now, FPGAs, however... And if there ever is a company that comes out and improves substantially on this I'll be happy for sure but if you asked me off the bat what they should improve I honestly wouldn't know, especially not taking into account that this was an incremental effort over ~2 decades and that originated in an industry that has nothing to do with the main use case today and some detours into unrelated industries besides (crypto, for instance).

From fluid dynamics, FEA, crypto, gaming, genetics, AI and many others with a single generic architecture and delivering very good performance is no mean feat.

I'd love to hear in what way you would improve on their toolset.

programjames · 2025-09-08T12:43:22 1757335402

Not the guy you replied to, but here are some improvements that feel obvious:

1. Memory indexing. It's a pain to avoid banking conflicts, and implement cooperative loading on transposed matrices. To improve this, (1) pop up a warning when banking conflicts are detected, (2) make cooperative loading solved by the compiler. It wouldn't be too hard to have a second form of indexing memory_{idx} that the compiler solves a linear programming problem for to maximize throughput (do you spend more thread cycles cooperative loading, or are banking conflicts fine because you have other things to work on?)

2. Why is there no warning when shared memory is unspecified? It isn't hard to check if you're accessing an index that might not have been assigned a value. The compiler should pop out a warning and assign it to 0.0, or maybe even just throw an error.

3. Timing - doesn't exist. Pretty much the gold standard is to run your kernel 10_000 times in a loop and subtract the time from before and after the loop. This isn't terribly important, I'm just getting flashbacks to before I learned `timeit` was a thing in Python.

jacquesm · 2025-09-08T13:35:06 1757338506

Those are good and actionable suggestions. Have you passed these on to NVIDIA?

https://forums.developer.nvidia.com/c/accelerated-computing/...

They regularly have threads asking for such suggestions.

But I don't think they rise to the general conclusion that the tooling is bad.