AMD’s “heterogeneous Uniform Memory Access” coming this year in Kaveri

vardump · on April 30, 2013

Currently there's a big problem in GPGPU computing - high latency until the computation results are available. That can be tens of milliseconds. This significantly limits type of tasks you can efficiently offload to GPU. I understand AMD's hUMA/HSA is supposed to address this problem.

But there's another problem: currently CPU memory buses are connected to two or more DDR3 memory channels. And DDR3 doesn't simply have sufficient bandwidth for high performance graphics and GPGPU computing, especially when shared with CPU.

Intel Haswell will have CPU and GPU on-package together with shared 128MB of eDRAM 64 GBps "L4 cache". I believe that should enable low latency high performance memory sharing.

I don't understand AMD's bandwidth story. Does the GPU share one memory controller with CPU and have another private one, for example connected to GDDR5? I don't see how hUMA could work efficiently over PCIe bus either, so I guess hUMA is about APU + CPU only.

How does AMD provide the bandwidth?

fulafel · on April 30, 2013

> DDR3 doesn't simply have sufficient bandwidth for high performance graphics and GPGPU computing, especially when shared with CPU

This is an overgeneralization. Real-time 3d rendering is bandwidth hungry but there are plenty of cache-friendly or latency-bound codes in areas that can benefit from GPGPU.

Though one problem is that we don't yet have a lot of popular GPGPU apps on the mass market, and the ones that we do have exploit the strengths and weaknesses of the old school GPUs (= high bandwidth, limited communication needs with CPU)

Narishma · on April 30, 2013

AMD currently uses DDR3 for their APUs but the next generation will support DDR3 or GDDR5. The PS4 for example uses an APU with GDDR5 memory.

seanmcdirmid · on April 30, 2013

I think this hUMA/HSA is a value proposition and not meant for high-end graphics or GPGPU, which can easily stream through GBs of data very quickly (high end cards have > 4 GB on card). Even Haswell strikes me as a value product; everything great until your problem doesn't fit into your cache.

The transparent memory hierarchy is still quite expensive, and there are lots of performance benefits, at least at the high end, to managing it yourself.

AnthonyMouse · on April 30, 2013

>I think this hUMA/HSA is a value proposition and not meant for high-end graphics or GPGPU, which can easily stream through GBs of data very quickly (high end cards have > 4 GB on card).

I think the use of it in gaming consoles goes against that a bit. I also don't see why you're combining the architecture with the specific hardware. You're thinking x86 CPU with an embedded-class GPU in it. What happens if they build a high end GPU with a CPU in it? Nothing stops them from putting 4GB of DRAM on the same package as that like they do with GPUs (or did historically with SRAM on Slot-based CPUs). You can certainly imagine a market for both, the later would just be the expensive / high end model of the former.

And if you really, actually need a dedicated GPU for some highly specialized workload, I imagine you'll still be able to buy one. But now you're talking about the market where people currently put four GPUs into one box, which is not exactly mass market.

scott_s · on April 30, 2013

There are a lot of performance benefits to managing the memory hierarchy yourself. But I think that the demise of the Cell demonstrated that not enough people are willing to do it to justify it as architectural decision.

seanmcdirmid · on April 30, 2013

In the video game market sure, but in the HPC market, CUDA rules.

scott_s · on April 30, 2013

I was talking about for a single chip. There was a lot of interest in Cell in the HPC market, but it also died.

For applications that can tolerate the latency, offloading computation to the GPUs is such a multiplier in performance that they have to put up with manually managing the hierarchy - but even there, it's a rather coarse memory management.

pedrocr · on April 30, 2013

I'm surprised it's taken this long to get unified memory access across CPU and GPU. Carmack has been asking for it for a while now and that's just for the traditional application of a GPU (graphics). For GPGPU this would be massive. It's also probably one of the few places AMD can really compete with Intel too as they have both CPU and GPU tech.

raverbashing · on April 30, 2013

Well, you could always write to 0xA000:0000 (yikes)

mtgx · on April 30, 2013

I think Nvidia will do it, too, starting with Tegra 6 (Denver/Maxwell) and beyond, but in mobile devices. I think ARM can already share the memory between Cortex A15 CPU's and their Mali T600 line of GPU's right now (ARM is also part of the HSA Foundation).

http://regmedia.co.uk/2011/08/19/hc_cohere_small.jpg

pedrocr · on April 30, 2013

That's a good point. I was thinking about this in terms of the traditional x86 fight between Intel and AMD. I could see Nvidia going beyond mobile with it though. Consider this playbook for Nvidia:

1)Build an ARM 64bit chip with the latest GPU tech (not the generations behind stuff they've been using on mobile), and unified addressing with the memory controller on die.

2)Stick a bunch of fast ram, some fast flash for storage and a gigabit chip to build a blade server.

3)Start selling these to people that today have GPGPU type loads. This should easily be cheaper and more power efficient than the equivalent Intel solution.

4)As 64bit ARM becomes more performance competitive with x86 (is that happening?) and people get used to developing for ARM, move to take over more traditional CPU workloads.

It would make sense for them as they've already tried to enter the x86 market with their motherboard/chipset business before and been mildly successful. Back then they were just trying to build a better x86 and got squeezed by Intel, here they'd be using the classic Innovator Dilemma strategy of coming from a lesser product (ARM) to dominate the market.

One of the most interesting points in Ars previous articles about AMD was that before they decided to buy ATI they were actually considering Nvidia but the sticking point was that Nvidia's CEO wanted to be CEO of the joint company. It makes you wonder what could have come out of AMD+Nvidia with Jen-Hsun Huang at the helm.

new299 · on April 30, 2013

>As 64bit ARM becomes more performance competitive with x86 (is that happening?)

Correct me if I'm wrong, but I don't think there are actually any 64bit cores available for consumers yet. A quick google suggests they are just getting these cores into ICs now:

http://hexus.net/tech/news/cpu/53661-arms-64-bit-cortex-a57-...

vardump · on April 30, 2013

Not for consumers, right. But ARMv8 64-bit X-gene should be available: http://www.apm.com/products/x-gene

new299 · on April 30, 2013

Interesting, I'd not seen that before thanks. Not clear to me if I can actually buy one yet, there contact page doesn't list the 64bit servers which is weird. Anyway I've pinged them for more information would be interested in playing with one.

fulafel · on April 30, 2013

The low barrier intermixing of GPU/CPU code is a pretty ambitious project on the software side, they need to implement it well and get developers onboard. I hope AMD can pull it off.

Their fate depends on leveraging their GPU lead over Intel, and it's much easier to port code over to GPU if you don't have to completely rewrite it around the old school GPU data shuffling requirements.

This might have a better chance on the PS4/Xbox 720 side, with one size fits all hardware and hopefully less driver problems.

nkurz · on April 30, 2013

hUMA addresses this, too. Not only can the GPU in a hUMA system use the CPU's addresses, it can also use the CPU's demand-paged virtual memory. If the GPU tries to access an address that's written out to disk, the CPU springs into life, calling on the operating system to find and load the relevant bit of data, and load it into memory.

Is demand-paging actually relevant, or just a poorly chosen example of what would theoretically be possible? I'd think that in an application where the worry is memory transfer speed, one wouldn't ever want to be swapping to disk. Better a swift death by the OOM killer than drowning in molasses.

More generally, do swap files still have a useful role to play in high performance computing? I'd think the window between "fits in RAM so no need for a swap file" and "ever so slightly larger than RAM so we can quickly page in what we need" is thin and growing thinner.

Sharing a virtual address space, transferring directly to and from RAM, and hardware cache synchronization sound like real advantages, though.

scott_s · on April 30, 2013

It allows people to treat memory on the GPU just as they do on CPUs. You are correct, you will probably not achieve high performance if you are paging in from disk all the time. But, what if you want it done once every ten minutes? Consider the amount of programmer effort required to do that manually. This is part of sharing the virtual address space.

nly · on April 30, 2013

Doesn't sharing a virtual memory context with the GPU increase the cost of context switching? Also, which CPU core shares context with the GPU? Or are we talking about a fixed mapping (like the kernel)?

Symmetry · on April 30, 2013

I'd assume they'd handle sharing memory between a CPU and a part of the GPU the exact same way they'd handle sharing between two CPU cores.

api · on April 30, 2013

This sounds really, really, really freaking cool. I am overjoyed to see AMD not throwing in the towel and conceding the entire high-end CPU market to Intel. A monopoly there would threaten Moore's law.

I can think of a lot of cool things to do with hUMA. I might have to get one and dust off my once very strong interest in evolutionary computation (strongly biomorphic genetic algorithms, artificial life, etc.). EC can do very interesting things -- its the only "AI" technique I am aware of that can be genuinely creative -- but it eats CPU cycles for breakfast.

It would also be great for creating a practical fully homomorphic cryptosystem based virtual machine for "blind cloud computing"-- where the VM host has no idea what the VM is doing. All kinds of neato stuff is waiting on this kind of computing platform to be practical.

alisnic · on April 30, 2013

Pardon me, but this is fucking awesome!

_vafj · on April 30, 2013

Off Topic - AMD codenames it Kaveri; let me guess the better part of development is done by South Indians..

glaze · on April 30, 2013

Kaveri also means "buddy" in Finnish.

ignostic · on April 30, 2013

AMD (Intel too, sometimes) often goes on talking about its new technology, yet customers have shown they don't care. There are literally only two things that matter to buyers: price and "speed." Anything else is just PR hype for the investors.

lucian1900 · on April 30, 2013

"Speed" is most certainly PR hype. There is no single objective measure of the "speed" of any particular hardware.

Unified CPU and GPU memory is very interesting.

jetru · on April 30, 2013

Unified memory is possibly going to have performance gains because you won't need to all the crazy memory shuffling that people do on GPUs these days. Also, making programming simpler makes it cheaper. So, it seems that this is a win if they can get the price point right.