Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Area. Golden Cove (12-series) and Raptor Cove (13-series, except for some of the lower SKUs which get rebranded Golden Cove) are obscenely massive. It is something close to 2x the area of zen3 per core, which is not even on a 5nm tier node like Intel 7! and logic density went up 1.8x between TSMC N7 and N5, so this means something like 3.2x the transistor count at the high end. Achieved shrink will be a bit lower, but let's say 3x the transistor count of zen3.

And probably this tends to understate things because that's Golden not Raptor cove. Intel went obscenely huge on caches with Raptor Cove too - they did the same thing as NVIDIA with Ada, and dumped an assload of L1/L2 cache on it too. I don't know off the top of my head but let's say 10-20% bigger for Raptor cores.

https://www.reddit.com/r/hardware/comments/qlcptr/m1_pro_10c...

In contrast Gracemont is much smaller - it is not quite "4x" as advertised, 4 is actually the number of cores in a Gracemont CCX/cluster, but the cluster is somewhat bigger than a Golden Cove core. So the actual core area is 3.26 Gracemont cores per Golden Cove core, and again, Raptor Cove is significantly bigger.

--

So the tradeoff is like - they could have done a 12P0E or something like that, for about the same area as a 8P12E. Which would still lose to a 16P0E Zen3 in multithreaded workloads, most likely.

That's the game Intel is playing - 8P is generally enough for games, but it's area-inefficient to keep scaling like that. But you have these other bulk tasks that just like tons of cores and don't care about peak performance, so, you have a mix of both. The E-cores give you more perf/area and the p-cores give you more peak perf for games/etc. So theoretically it's the best of both worlds, it's not as slow as a full e-core chip would be but it has a lot more MT performance than an all-P-core would.

Unspoken underlying problem being that Intel's P-cores are much, much, much bigger than the competition. Hence they have a much greater need to come up with a "compact" alternative than AMD does. Using that 1.8x logic scaling factor (which is optimistic), a Zen3-on-5nm design would be 1.72mm2 which is just about the same size as Gracemont. So Intel's "e-core" is about as big as AMD's p-core! Hence why they are much more focused on a whole new core design, where AMD just densifies the existing one (high efficiency/high-density libraries, reduced cache, back to 4-core CCX, etc). Squeeze that last 30% and call it a day.

On a more tactical level, I think it also is a move to force people to use Gracemont and start writing code for it. Long-term, your P-core being 3x the size of your competitors' is not sustainable and they need to pivot away from the existing P-core design (lakes/coves), it obviously is just a mess internally from 3 decades of tech-debt. Nobody really cares about the atom chips, despite them being pretty good for a long time now (my J5005 NUC made a great thin client during the pandemic, I use them for HTPCs, etc). Well, now you have to care, or you're leaving performance on the table on the mainstream intel chips. It's not just "intel loves big.little" or "needs big.little for area" but also "big.little" is a way for them to start getting the "little" cores into running real-world code, because in the long term they need to kill the coves off (and perhaps replace them with a mont-derived alternative).

(my suspicion is that this is a case of Conway's Law in action, and the architecture of the Lake/Cove family resemble the Intel organizational chart, and since Intel is a giant knot, that's the processor architecture they produce, and they've been doing that for at least 20 years. In hindsight Pentium 4 was the warning sign of the internal rot, they got it back together for a while but after the sandy bridge era they collapsed and everything since then is probably just more and more tech debt and kludges stacked on.)

--

Also, frankly, the e-core's "CCX" design makes sense. Tiering your interconnect/cache is what AMD has done very successfully - you have 2 CCXs per CCD (on zen2), 8 CCDs per socket. And that lets them decompose the interconnects into manageable pieces - 4 cores per CCX is a simple all-connected topology. Those talk to 4 quadrants on the IO die, which is a simple topology. If you want to talk to the other CCX, you have to go through the quadrant/IO die, so there is no "special case" there. It's all just a composition of simple pieces.

Ringbuses get annoying/inefficient past about 10-12 cores, which is why Intel abandoned them in server after broadwell-EP (with its "dual ring" design). But a mesh of individual cores also has this huge latency penalty, and consumes a bunch more area, and (in practical configurations) still tends to be very bottlenecked unless you spend an even higher amount of area on it.

What's the middle-ground? You group the cores into clusters/CCXs and you either have a mesh-of-CCX or a ring-of-CCX or some other tiered structure. And you can break the "tile" idea down into tiers too - a tile is a ring or mesh of cores, and then you have a mesh of tiles, but these are separate logical tiers and don't need to interact.

It is the usual HPC networking problem - connecting 1,2, or 4 nodes is easy, with simple all-connected or hypercube topologies, with a small number of links. A hypercube requires only 2 links per node for 4 nodes. An all-connected topology requires only 3 links. And you can solve for modestly higher numbers with something like a ringbus (which gets a lot of flak but it's an extremely performant network structure, and AMD uses them too for their 8-core CCX). But that falls apart with higher numbers of nodes too, and big switched-fabric networking chips or backbone switches are some of the largest and most expensive chips manufactured, a 32-port 400gbe switch (idk, whatever) is gonna be a big beefy boy in itself, that type of thing often hits 750mm2+ of silicon on the latest nodes.

You need something that both scales in terms of network hardware/area/power, and also performs in terms of actual latency and throughput. That's super difficult (and the best topologies are de-facto "tiered" anyway like hypercube or butterfly), so the best strategy is to introduce this tiering. And AMD has meticulously stayed in the limit of "the IO die is a simple hypercube topology of quadrants" and "the CCX is a 4C all-connected or a 8C ringbus", and just composed these simple things together with tiering.

I think low-key Gracemont is important because it's Intel tinkering with the same concept - and they're doing mesh-of-tiles with sapphire rapids too (not sure what topology is inside each tile though). Because they can't have 14+ stops on the ringbus (memory controller, 12 cores, iGPU, etc) and the purist "mesh of single cores" topology obviously didn't work with skylake-SP.

https://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-re...

https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/2

https://www.anandtech.com/show/16529/amd-epyc-milan-review/4

--

Anyway, I wish they would do all-P-core too, and theoretically that exists, it's Sapphire Rapids, and there is a workstation/HEDT variant, it's just super expensive and has massive power transient problems (you need 500W of headroom, literally, you will crash if you don't have a 1kw+ PSU, they are not kidding about 1300W being the recommended) that might be sending them back to the drawing board for another stepping. And there is also all-e-core chips too, that's Sierra Forest... but it seems like a tentpole customer pulled out (rumored to be facebook iirc) because Bergamo, AMD's compact-core based Epyc server chip, is more attractive. And so they have reduced the scope of Sierra Forest, it now tops out at 2 of the medium chiplets and the big chiplets are canceled entirely (where they planned to use up to 4 of the big ones).

MLID is such an unreliable source that I hesitate to recommend him, I like his content and listen to him a lot, but you really need to understand the broader context of the market/etc to know whether what he's saying makes sense. But he does tend to have some interesting guests who are usually way better than he is, and one of his recent guests was a boutique PC builder who specializes in digital audio workstations (which need to be super low latency/etc). And they talk about Sapphire Rapids workstation and some of the things being discussed around it.

https://www.youtube.com/watch?v=_HJu5xt43iQ&t=3603s (and the previous segment too)

Sierra forest discussion: https://www.youtube.com/watch?v=QlTZCDEFUFg&t=4200s

General intel discussion: https://www.youtube.com/watch?v=BNXlRdAKWTE



As someone who knows only the very basics of CPU architecture, this was extremely informative, thank you. It's given me lots to read up on.


I stayed at a holiday inn express last night

(PS don't forget agner fog's microarchitecture, you might be one of today's lucky 10,000! https://www.agner.org/optimize/microarchitecture.pdf )

Consider also scihub:9780849337581 for a general overview of architecture approaches in general.

It is funny that we keep reapproaching this "barrel processor" design that the CDC 6600 started so long ago. That "Vector processor + peripheral processor" design is super interesting. Nerds are attracted to this design like moths to a fly - it has been repeated and echoed in Sun Niagara, AMD Bulldozer, Xeon Phi, and now Royal Cores/rentable units/zen4c/5c/etc. We can just make one thing run fast, and have a bunch of workers servicing it, that are simple and slow and cheap, right?

https://cs.uwaterloo.ca/~mashti/cs850-f18/papers/cdc6600.pdf

https://archive.computerhistory.org/resources/text/CDC/cdc.6...

http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/60045000_660...

http://ygdes.com/CDC/cdc6600.html

Very interesting source material etc, see how they talked about their own processors. A lot of older systems were exhaustively documented and the info is available now.

--

https://en.wikipedia.org/wiki/UltraSPARC_T1

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing


Thanks - this will take me a while to digest :).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: