In the HPC world, things are not as clear cut benchmarks, or the vendors' own marketing materials/numbers.
First of all, the application you're running may be developed for a specific compiler, and the code sometimes depends on optimization behavior of a compiler. So, changing compilers changes a lot of things. This is why we have both Intel's tools, and GCC toolchain fully supported. For example, LAPACK and its siblings take compiler behavior and CPU specifications into consideration while compiling in an optimized way to maximize its performance IIRC.
Also, there's no guarantee that Intel's compilers are fastest on Intel hardware. In the days of Opteron 6100s, using Intel compilers, we were able to beat Intel processors of the same era. You heard it right: Compile using intel compiler with specific flags, run on AMD CPUs, get higher performance, profit!
Intel's AVX512 is well used and abused in HPC world, however AMD's HPC performance is not as bad as jandrewrogers implied in his comment [0]. AMD is originally an FPU company, and while their scalar instructions may lack on paper, they run really fast.
In HPC world, the CPU/board architecture becomes irrelevant after some point. SpecCPU benchmarks are the ultimate benchmarks, because their behavior is compiler agnostic and push every aspect of the CPU very very hard. If you can get the same SpecFP with an Intel part, you can get the more or the less same performance on real workloads.
If you have any other questions, you can AMA. I'll try my best to answer.
Funny addenda: We have some applications used widely by users, and when fully optimized, some older Intel CPUs outpace the newer ones by a significant margin. This is some heavy handed, exotic optimization.
> Intel's AVX512 is well used and abused in HPC world, however AMD's HPC performance is not as bad as jandrewrogers implied in his comment [0].
I know earlier AMD processors didn't actually have a 256-bit support, so AVX instructions were actually implemented by soaking up two 128-bit lanes (it helps that AVX doesn't have many instructions that actually permit you to move data between the two 128-bit slices of a 256-bit vector). For their AVX-512 performance to not be absolutely horrible, I take it they've actually built real AVX-512 units at some point?
They're doing the same thing they are now with AVX2, but with AVX-512. So 512b instructions will translate into two uops.
It's still useful to implement the AVX-512 instructions because they fill in some holes in the existing AVX instruction sets (eg lack of scatter/broadcast instructions) and implement a new SIMT-like op-masking functionality.
From what I've read now [0], it looks like AMD still uses 2 x 128bit AVX units to execute AVX2 instructions. Also, AMD is always coming a generation behind Intel in terms of FP instructions sets, so Zen doesn't support AVX512.
According to WikiChip [4], Zen 2 actually has 256 bit FPU paths. I was unable to find a credible benchmark for Zen 2, so I can't talk about its performance. However, when analyzed from the perspective I've given below, it's not hard to assume that Zen 2 is a heavy hitter in terms of floating point performance.
However, the interesting part is, when you look to SpecCPU 2017 FP Rate [1], AMD Epyc 7601 [2] system has a similar per core performance with a much bigger Intel Xeon Platinum 8180 [3] system.
Why interesting?
* AMD's per core base (lowest) rate is 4.1875.
* Intel's per core base (lowest) rate is 4.3482.
* AMD is running GCC compiled code.
* Intel is running Intel compiled code.
* Intel has higher clock speed.
Intel has some CPUs (like Gold 5118, Gold 6148) which have per core base rate of ~5.125. These are the CPUs are considered as HPC processors, and used by a lot of people.
As I said before, it looks like Zen 2 is going to be a better HPC processor than Zen. Zen looks like a very good Enterprise processor now.
So with my hat, I can conclude that not having 512 bit hardware is not a crippling omission.
Addenda: I forgot to say that Intel has something called "AVX frequency". Since AVX, AVX2 and AVX512 has tremendous power requirements when compared to other operations, Intel lowers CPU to an undisclosed frequency. When I last checked, AVX frequencies of Intel CPUs that we use weren't in the technical guides and were not public in any way. So, the peak SpecFP Rate is not very different from the base ones.
Also, since the CPUs thermal budget is very constrained during AVXx operations, other ports' speed is also reduced. So at the end of the day, AVX512 is not a free turbo boost in HPC environments and heavy/continuous loads.
A large part of the reason why AMD can reach similar sustained real throughput to Intel despite having a fraction of the FPU throughput is that they run the FPU as a separate unit on different issue ports, and their core is slightly wider when you measure the amount of instructions it can retire.
So even though the Intel CPU can in theory do 4x the computation AMD can in the vector units, in reality even the tightest real vector code does all kinds of things other than vector computation, in the middle of that vector stuff, like computing addresses for loads and stores and managing loop variables. On AMD, those intermixed scalar instructions go into separate scalar ports, on Intel CPUs they take space in the same issue slots that the vector code uses.
Then on top of that, the memory bandwidth is a great equalizer. Doesn't matter how many multiplies you can compute if you cannot load the operands, and the AMD systems are much closer there than they are in the pure computation, especially as they have a lot more L3 cache per core.
On Zen 2, AMD does two big things that are going to really help them in HPC loads. They are doubling vector unit width, and they are doubling the amount of L3 per core. I honestly think the second change will help more than the first.
You're right. Also Intel's AVX implementation is very power heavy, and they need to lower CPU frequency to fit into their thermal budget (see "Addenda:" in my previous comment).
Also yes, AMD's memory subsystem has much lower latency, and has higher bandwidth. Also their direct-attach approach is better than Intel. I forgot that advantage TBH :)
However, I can argue about L3s effect on speed. In some cases, the code and the data is so small, but the computation is so heavy that, you can fit almost everything into the caches. I had a 2MB binary which required 200MBs of memory at most, but it completely saturated the CPU in every way imaginable.
So, in some cases caches have great affect on speed. Especially if the data you're invalidating and pulling in is huge. However, if the circulation is slow, a faster FPU always trumps a bigger cache.
> Also yes, AMD's memory subsystem has much lower latency
No, AMD's latency is generally worse than Intel's on Zen chips. Here's the first example I could Google [1], but the same trends repeat themselves across many benchmarks.
My overall impression is that the typical gap is 5-10 ns.
> Also yes, AMD's memory subsystem has much lower latency
Out of interest, what do you mean by this? Are you talking Zen1 or Zen2, because in my experience playing with Zen1 EPYC the memory latency was worse than Xeon Broadwells, and on top of that you had worse NUMA issues that could affect certain cores which weren't directly attached to the memory and this added additional latency more than on the Xeons I was comparing against.
Unfortunately, neither. The last AMDs I was able to play with Opteron 6xxx series. The later ones weren't as fast, and Zen 1 was not easy to obtain, so we were unable to acquire them.
The last ones I used were better from their competitors of the era. I also had a desktop system from that era which was way better, at least for my workloads.
I'd love to play with Zen 1/2 and compare "benchmarks" to "real workloads", because as I said before, in HPC, benchmarks are just numbers.
e.g. Your memory bandwidth may be low, but if it's low latency & you're hammering the bus, bandwidth may not be limiting. OTOH if you're streaming something continuously, your latency becomes moot, because the bus has already queued up everything you need and can continue piling up stuff you need until you process the ones at hand. For the second scenario, I have listened to a talk about an embedded system, which the developers were able to accelerate the system 10x by using an in-cpu accelerator unit to copy required memory segments to cache independently from the CPU.
That's true, universities for example will often use Intel compilers on academic licenses, other sites may use other performance-oriented commercial compilers such as PGI.