The 754 spec has somewhat moved past the 8087 at this point (3 revisions later). A lot of things have been fixed, including the whole language around exceptions (which used to define "traps" - a very processor-specific idea rather than an arithmetic-centered one). I am hoping we can be free of (required) exceptions in 2028.
As I understand it, your other complaints tend to center around overflow to infinity and precise summation of vectors. For applications that really care about that precision, there are ways to do it in floating point without a quire register - sorting before summing is the naive approach, but look into ReproBLAS for some better algorithms.
Also, I can't help but wonder if the memory wall idea here is centered only around synthetic benchmarks like gigantic dot products. A lot of code leans heavily on caches these days, which make the energy cost of operations a lot lower, and pretty much everything short of massive dot products uses them. I imagine you would have to make a very nuanced argument about why a 1k fixed point sum is saving energy here. Even matmuls are pretty cache-efficient now.
Elsewhere in computing, we are actually generally moving away from tightly-packed structs in performance-sensitive code despite the memory retrieval cost, because they are just easier to deal with in both hardware and software, and locality picks up all the slack.
As I understand it, your other complaints tend to center around overflow to infinity and precise summation of vectors. For applications that really care about that precision, there are ways to do it in floating point without a quire register - sorting before summing is the naive approach, but look into ReproBLAS for some better algorithms.
Also, I can't help but wonder if the memory wall idea here is centered only around synthetic benchmarks like gigantic dot products. A lot of code leans heavily on caches these days, which make the energy cost of operations a lot lower, and pretty much everything short of massive dot products uses them. I imagine you would have to make a very nuanced argument about why a 1k fixed point sum is saving energy here. Even matmuls are pretty cache-efficient now.
Elsewhere in computing, we are actually generally moving away from tightly-packed structs in performance-sensitive code despite the memory retrieval cost, because they are just easier to deal with in both hardware and software, and locality picks up all the slack.