I don't think ARM really has any inherent "power advantage" over x86; this paper...

Symmetry · on May 6, 2014

Quite true, the one ISA advantage that ARM has it's that it's much easier to decode. That does make a difference, but it's not a huge factor. On the other hand x86 tends to be a bit more compact but not as much as you might think.

ISA doesn't make a huge difference here, but Intel currently has a pretty big lead in both process and architecture. Either because of superior research, the ability to impose more constraints than a merchant silicon house can, or both Intel has been reaching new process nodes sooner and getting better performance on them than their competitors.

And Intel currently has a big architecture advantage too. Owning the whole PC market gives you the money to hire a lot of good engineers. It might be that there's a second disadvantage to x86 in that it's complicated enough that you need more engineering to get an equivalent architecture. I say this because IBM has managed to keep parity despite the fact that, IIRC, they're able to invest less engineering. Power 7 had higher single threaded performance than anything Intel had when it came out, and it looks like Power 8 is doing a similar leapfrog. If you look at SPEC int_rate and divide by the number of threads you'll find that Intel comes out a factor of 3 better today and 2 when Power 7 came out, but that's due to IBM having 4 threads per core to Intel's 2.

In theory there isn't anything preventing an ARM-64 chip from having performance as good as an x86 or POWER chip, but in practice Intel and IBM have a lot of experience in designing high performance chips but ARM doesn't. AMD sort of does, but they haven't been executing well at the high end recently and these will be their first ARM cores.

*http://www.spec.org/cpu2006/results/cpu2006.html

userbinator · on May 6, 2014

Power 7 had higher single threaded performance than anything Intel had when it came out

IBM Power 795 (4.25 GHz, 128 core, SLES) 5350 base, 512 threads -> 2.46 result/thread/GHz

The Power 7 came out in 2010, so we can look at the x86 that were available around that time - the Nehalem era; e.g. this one

IBM BladeCenter HX5 (Intel Xeon E7540 - 2GHz) 490 base, 48 threads, 5.10 result/thread/GHz

For AMD, this one I picked turns out better than Intel:

IBM System x3755 M3, AMD Opteron 6134 (2.3GHz) 638 base, 48 threads, 5.78 result/thread/GHz

The Power 7's single-threaded efficiency is less than half that of competitive x86 CPUs at the time. The TDP is 200W+ as well - around double that of the x86s which are ~100W - so power efficiency isn't that great. The high clock frequencies probably have something to do with it.

In theory there isn't anything preventing an ARM-64 chip from having performance as good as an x86 or POWER chip

True, but as the paper I linked suggests, power efficiency is going to suffer if they're optimising for raw performance. There hasn't really been aggressively high-performance ARM chips before unlike the other traditional RISCs (SPARC, POWER), so that's why I'm really interested to see what AMD does with it.

Symmetry · on May 7, 2014

You're not looking at single-threaded performance with those numbers you're looking at, um, multi-threaded performance per thread which is a metric people don't use for very good reason.

In computer architecture, it's very rare for anything to scale linearly. If you take a chip and double the frequency it runs at you won't get double the performance, because there are all sorts of latencies you haven't improved. If you double the number of cores in your chip you won't get double the performance, because they're contending for the same limited pool of off-socket memory bandwidth. If you add more sockets then some memory accesses will be to other sockets, increasing latency. And if you double the number of threads per core, you're lucky to get even a 20% increase in performance because now your threads are in contention for both the same execution and memory resources.

So when you compare a 32 socket, 4 thread per core system to a 4 socket, 2 thread per core system on the basis of thread performance you're being ludicrously unfair. Would you claim that a non-hyperthreaded Intel i5 has much better single threaded performance than a hyperthreaded Intel i7?

If you follow the link I gave you can find the actual single threaded SPECint results at the top, you'll find two base results for Power 780 (29 and 44) and many results for E7540s which seem to be around 24 for the first ten results I checked.

Sure, the SPECint rate results show a different story if you divide by the number of threads, but that isn't what people mean when they talk about single threaded performance.

userbinator · on May 8, 2014

multi-threaded performance per thread which is a metric people don't use for very good reason.

Maybe I used the wrong term but I'm referring to the idea of how much work can be done by a single instruction stream (thread) in a fixed number of clock cycles.

but that isn't what people mean when they talk about single threaded performance.

Then what do they mean?

I understand what you mean about scaling not being linear with the number of threads, but even with the same (very large) number of threads:

POWER7@3.44GHz, 384 threads (16 chips, 6 cores/chip, 4 threads/core) result 3560

Xeon X7542@2.67GHz, 384 threads (64 chips, 6 cores/chip, 1 threads/core) result 8190

Symmetry · on May 8, 2014

How much work an instruction an instruction stream can do in a fixed number of clock cycles is going to be hugely dependant on what other instruction streams executing at the same time might be doing. That's why the convention is, when measuring single threaded performance, to only use a single thread.

Nothing says that you have to run the same number of threads in your workload as you have hardware threads. Operating systems are there to multiplex software threads over hardware threads, and part of SPEC is a test of the operating system and compiler as well as the chips and motherboards and memory. There's nothing to prevent someone from taking the Xeon system in your your post with 30,000 threads, producing a system with a performance per thread result much much lower than running it with 384 threads.

The interesting results are which systems can achieve the absolutely highest throughput and single thread performance, and which can achieve more throughput or single thread performance per unit price or unit power consumption.

auselen · on May 6, 2014

'I don't think ARM really has any inherent "power advantage" over x86'

paper states ~ISA being CISC or RISC doesn't matter... performance differences are generated by ISA-independent microarchitecture differences".

So being ARM or x86 matters for "power advantage", but being "CISC or RISC" doesn't.

fulafel · on May 6, 2014

The estabilished terminology is "architecture" for instruction set architecture and "microarchitecture" for the user-invisible implementation details. So, microarchitecture means for example AMD K7 vs Intel P6.

So the quote is saying that ARM vs x86 doesn't matter, but eg Cortex-A8 vs Cortex-A10 does.

auselen · on May 6, 2014

I was using paper as a basis since that's what is linked (which I found as a poor one)

"We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant."

So if you think as X and Y of course they don't matter. However we are talking about ARM vs x86 which have measurable properties in this whole discussions context+time, and that matters as "a phone powered with ARM" vs "a phone powered with x86" or "a server powered with ARM/x86". That's the level of terminology we are using.