Overclocking isn't just limited by heat dissipation, and dealing with the additi...

klodolph · on Jan 27, 2022

Transition times decrease as voltage increases. The amount of power being dissipated is roughly proportional to V^2 f, and V is roughly proportional to f, so the tradeoff kinda sucks (double frequency = 8x power consumption), but you can eke out some additional performance if you are willing to dissipate more heat, within limits.

omegalulw · on Jan 27, 2022

> have certain amounts of time which they need to take in order to tie voltages high or low, to either read or output a bit.

When you overclock you add voltage which helps with this which goes a long way. You would often hit thermal limits first, so heat dissipation is a major factor.

bayindirh · on Jan 27, 2022

This is not my experience in the olden Athlon XP days. Even though you give the extra voltage and give enough cooling, the CPU starts to encounter many internal errors and needs to re-run a lot of instruction trains to get sane results.

This causes heat and load spikes, and while you get no errors, the increased frequency doesn't give you any real world performance gains. Returning a to a slower configuration actually gives a much snappier and performant system.

zht · on Jan 27, 2022

Athlon was 20 years ago.

It feels unfair to generalize across time and architectures

bayindirh · on Jan 27, 2022

It's not about architecture. It's about physics. Once you pass a certain threshold, electron/current leaks and escapes start to wreak havoc inside any silicon. So beyond voltage, current and temperature, you're limited by the silicon itself.

I stopped overclocking systems after Athlon XP. This is why I gave that example.

Even without overclocking and overheating, I've seen and still seeing partially cooked processors which shut-down half of their FPU pipelines to stay reliable albeit with orders of magnitude lower performance.

CPUs are much more complex with their MCE and more advanced microcode structures ever, and there's much more than meets the eye.

jorvi · on Jan 27, 2022

> This is part of why if you're overclocking your computer, you end up experiencing errors or unexpected crashes even when temperature doesn't go too high. Past reasonable tolerances, you can't just cool it more and expect it to work.

People push CPUs to absurd limits with LN2 cooling so they’d just not true.

timpattinson · on Jan 27, 2022

It's not like Apple would ever do it, but the M1 should have heaps of OC headroom. Given that the RAM is already on package, all they need to do is turn up the CPU multiplier.

AMD chips are doing similar clocks to Intel on TSMC N7, so Apple could (but won't) have a chip running way higher than the clocks they are currently shipping with.

Also, it's kinda inaccurate to imply any overclocked setup will crash, there's plenty of room unless they come turned up to the max from stock like the 12900k.

throwawaylinux · on Jan 27, 2022

Overclocking headroom means you have timing slack, and timing slack means you have faster ~= leakier circuits than necessary, or stages which aren't filled with work which is also an inefficiency.

I expect Apple has an extreme focus on power efficiency and especially idle / leakage power, much more than Intel considering the core basically the same as they use in their phones. They also have a different approach to turbo / dvfs. So I would expect M1 to actually be a lot tighter than Intel and not have so much OC headroom.

Obviously you can buy timing with voltage to some degree, so there would be something there probably. Modern nodes are running into more problems with voltage induced breakdown though so the OC limit looks very different to what you can ship in a product. Has anyone measured M1's VDD?

> AMD chips are doing similar clocks to Intel on TSMC N7, so Apple could (but won't) have a chip running way higher than the clocks they are currently shipping with.

Not their existing microarchitecture though. They do nearly 2x the work per clock as AMD chips which necessitates more logic per stage. Getting a microarchitectural edge means making less logic do more work and it's very possible Apple have some edge there, it just wouldn't be near 2x IMO.

The silicon technology of course plays into it, but when you look at how fast individual transistors and the shortest poly to connect them can switch, speeds over 100GHz have been possible on 90nm. Today's cutting edge is probably over 200GHz (e.g., search ring oscillator). So it's not a fundamental switching speed limit of the tech that gets you.

I would say Apple could probably redo the physical design and synthesis work and minimal logic changes to target a faster and leakier device that's not suitable for phones but might be a little fairer comparison. It wouldn't put it at a 5-6GHz frequency, but could easily be enough to re-take these benchmarks and still be ahead on efficiency.

e4e78a06 · on Jan 27, 2022

> They do nearly 2x the work per clock as AMD chips which necessitates more logic per stage

Not all work is created equal. Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with. And your backend ports are also parallel (although maybe scheduling isn't?). The only places where wider always means more logic per stage is caches - register file, L1/L2/L3, BTB. For example Apple managed to work some magic with a 3 cycle 192kB L1. AMD and Intel are at 4 and 5 cycles for much smaller L1's. Part of the reason for that is probably because Apple doesn't need to hit 5GHz and can afford more logic per stage.

And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher, since the current 3.2GHz is very far from what we know TSMC N7/5 can do. I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.

throwawaylinux · on Jan 27, 2022

> Not all work is created equal.

> And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher,

Yes.

> since the current 3.2GHz is very far from what we know TSMC N7/5 can do.

N7/N5 can "do" 200GHz. 90nm could do 100GHz. The limit a device can do depends most highly on the logic.

> I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.

3.2->4.4? I doubt it with any reasonable voltage that could actually ship in a device. Very hard to predict these things unless you've at least got basic shmoo plots and things like that in front of you.

moonchild · on Jan 27, 2022

> Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with

x86 decode is parallel too

e4e78a06 · on Jan 27, 2022

It is parallel in the sense that you can decode 4-6 instructions in parallel, yes. It is not parallel in the sense that variable length decoding requires each of your decoders to talk to the other ones to coordinate on instruction length boundaries, which means there is going to be a lot of serial logic in your decoder circuit.

throwawaylinux · on Jan 27, 2022

It doesn't, if your L1$ predecodes at fill time and stores instruction length.

Something, somewhere does have to do a serial length decoding of course. But when you look at the L2 access latency and throughput (which is the minimum L1 fill latency), it's clear you could afford to do that part of the decode over more cycles.

New designs are not just predecoding lengths but entire uops now into the first level instruction cache which is the same concept they just call it a L0 and L1 rather than L1 and L2.

marcan_42 · on Jan 27, 2022

The M1 family already can only sustain 3.2GHz with only one core per cluster active, so there is some bottleneck there, either due to power dissipation/density or power delivery. If the latter, upping the voltage would give you more oomph, but that comes with diminishing returns. I wouldn't expect Apple to give the chips more than a trivial clock bump (3.4?) this generation, if that.

nielsbot · on Jan 27, 2022

cryogenic cooling tho?

chias · on Jan 27, 2022

Wouldn't solve the problem. Basically you need the integrated components to have shorter transition times. These are an inherent property of the component: speeding up the transition time basically means "replace this component with a different component that has a shorter transition time"

dragontamer · on Jan 27, 2022

Overvolting the component and sending more current to any transistor tends to speed it up, though at the cost of severely more watts used.

Overclocking is often a combination of overvolting (leading to more amps), allowing you to increase clock speeds.

chias · on Jan 27, 2022

As they say, "the candle that burns twice as bright burns half as long", or, in this case, "the transistor that runs at thrice the voltage runs for about 1.3 seconds"

R0b0t1 · on Jan 27, 2022

Overvolting is how this is solved and works with adequate cooling.

zeusk · on Jan 27, 2022

Google propogation delay, your pipeline's length and delay of devices within that pipe determine the max frequency your circuit can run at (a very high level overview).

nsteel · on Jan 27, 2022

How is pipeline length related to max frequency?

amluto · on Jan 27, 2022

It’s related indirectly. A longer pipeline tends to result in shorter stages within the pipeline. (This is the whole point of pipelining.)

nsteel · on Jan 27, 2022

OK, yes there's often correlation but that's it. A longer pipeline doesn't affect max circuit frequency.

The other sibling comment regarding distance isn't true. regardless of how insane your feedbacks are. Physical placement tools have always handled this in my experience (as a chip designer for 10+ years).

u320 · on Jan 27, 2022

There is no point of lengthening the pipeline if you don't get higher frequency from it. It would just add complexity and and a higher branch mispredict penalty.

enchiridion · on Jan 27, 2022

Physical length the signal has to travel.