Currently there's a big problem in GPGPU computing - high latency until the computation results are available. That can be tens of milliseconds. This significantly limits type of tasks you can efficiently offload to GPU. I understand AMD's hUMA/HSA is supposed to address this problem.
But there's another problem: currently CPU memory buses are connected to two or more DDR3 memory channels. And DDR3 doesn't simply have sufficient bandwidth for high performance graphics and GPGPU computing, especially when shared with CPU.
Intel Haswell will have CPU and GPU on-package together with shared 128MB of eDRAM 64 GBps "L4 cache". I believe that should enable low latency high performance memory sharing.
I don't understand AMD's bandwidth story. Does the GPU share one memory controller with CPU and have another private one, for example connected to GDDR5? I don't see how hUMA could work efficiently over PCIe bus either, so I guess hUMA is about APU + CPU only.
> DDR3 doesn't simply have sufficient bandwidth for high performance graphics and GPGPU computing, especially when shared with CPU
This is an overgeneralization. Real-time 3d rendering
is bandwidth hungry but there are plenty of cache-friendly
or latency-bound codes in areas that can benefit
from GPGPU.
Though one problem is that we don't yet have a lot of popular GPGPU apps on the mass market, and the ones that
we do have exploit the strengths and weaknesses of the old school GPUs
(= high bandwidth, limited communication needs with CPU)
I think this hUMA/HSA is a value proposition and not meant for high-end graphics or GPGPU, which can easily stream through GBs of data very quickly (high end cards have > 4 GB on card). Even Haswell strikes me as a value product; everything great until your problem doesn't fit into your cache.
The transparent memory hierarchy is still quite expensive, and there are lots of performance benefits, at least at the high end, to managing it yourself.
>I think this hUMA/HSA is a value proposition and not meant for high-end graphics or GPGPU, which can easily stream through GBs of data very quickly (high end cards have > 4 GB on card).
I think the use of it in gaming consoles goes against that a bit. I also don't see why you're combining the architecture with the specific hardware. You're thinking x86 CPU with an embedded-class GPU in it. What happens if they build a high end GPU with a CPU in it? Nothing stops them from putting 4GB of DRAM on the same package as that like they do with GPUs (or did historically with SRAM on Slot-based CPUs). You can certainly imagine a market for both, the later would just be the expensive / high end model of the former.
And if you really, actually need a dedicated GPU for some highly specialized workload, I imagine you'll still be able to buy one. But now you're talking about the market where people currently put four GPUs into one box, which is not exactly mass market.
There are a lot of performance benefits to managing the memory hierarchy yourself. But I think that the demise of the Cell demonstrated that not enough people are willing to do it to justify it as architectural decision.
I was talking about for a single chip. There was a lot of interest in Cell in the HPC market, but it also died.
For applications that can tolerate the latency, offloading computation to the GPUs is such a multiplier in performance that they have to put up with manually managing the hierarchy - but even there, it's a rather coarse memory management.
I'm surprised it's taken this long to get unified memory access across CPU and GPU. Carmack has been asking for it for a while now and that's just for the traditional application of a GPU (graphics). For GPGPU this would be massive. It's also probably one of the few places AMD can really compete with Intel too as they have both CPU and GPU tech.
I think Nvidia will do it, too, starting with Tegra 6 (Denver/Maxwell) and beyond, but in mobile devices. I think ARM can already share the memory between Cortex A15 CPU's and their Mali T600 line of GPU's right now (ARM is also part of the HSA Foundation).
That's a good point. I was thinking about this in terms of the traditional x86 fight between Intel and AMD. I could see Nvidia going beyond mobile with it though. Consider this playbook for Nvidia:
1)Build an ARM 64bit chip with the latest GPU tech (not the generations behind stuff they've been using on mobile), and unified addressing with the memory controller on die.
2)Stick a bunch of fast ram, some fast flash for storage and a gigabit chip to build a blade server.
3)Start selling these to people that today have GPGPU type loads. This should easily be cheaper and more power efficient than the equivalent Intel solution.
4)As 64bit ARM becomes more performance competitive with x86 (is that happening?) and people get used to developing for ARM, move to take over more traditional CPU workloads.
It would make sense for them as they've already tried to enter the x86 market with their motherboard/chipset business before and been mildly successful. Back then they were just trying to build a better x86 and got squeezed by Intel, here they'd be using the classic Innovator Dilemma strategy of coming from a lesser product (ARM) to dominate the market.
One of the most interesting points in Ars previous articles about AMD was that before they decided to buy ATI they were actually considering Nvidia but the sticking point was that Nvidia's CEO wanted to be CEO of the joint company. It makes you wonder what could have come out of AMD+Nvidia with Jen-Hsun Huang at the helm.
>As 64bit ARM becomes more performance competitive with x86 (is that happening?)
Correct me if I'm wrong, but I don't think there are actually any 64bit cores available for consumers yet. A quick google suggests they are just getting these cores into ICs now:
Interesting, I'd not seen that before thanks. Not clear to me if I can actually buy one yet, there contact page doesn't list the 64bit servers which is weird. Anyway I've pinged them for more information would be interested in playing with one.
The low barrier intermixing of GPU/CPU code is a pretty ambitious project on the software side, they need to implement it well and get developers onboard. I hope AMD can pull it off.
Their fate depends on leveraging their GPU lead over Intel, and it's much easier to port code over to GPU if you don't have to completely rewrite it around the old school GPU data shuffling requirements.
This might have a better chance on the PS4/Xbox 720 side, with one size fits all hardware and hopefully less driver problems.
hUMA addresses this, too. Not only can the GPU in a hUMA system use the CPU's addresses, it can also use the CPU's demand-paged virtual memory. If the GPU tries to access an address that's written out to disk, the CPU springs into life, calling on the operating system to find and load the relevant bit of data, and load it into memory.
Is demand-paging actually relevant, or just a poorly chosen example of what would theoretically be possible? I'd think that in an application where the worry is memory transfer speed, one wouldn't ever want to be swapping to disk. Better a swift death by the OOM killer than drowning in molasses.
More generally, do swap files still have a useful role to play in high performance computing? I'd think the window between "fits in RAM so no need for a swap file" and "ever so slightly larger than RAM so we can quickly page in what we need" is thin and growing thinner.
Sharing a virtual address space, transferring directly to and from RAM, and hardware cache synchronization sound like real advantages, though.
It allows people to treat memory on the GPU just as they do on CPUs. You are correct, you will probably not achieve high performance if you are paging in from disk all the time. But, what if you want it done once every ten minutes? Consider the amount of programmer effort required to do that manually. This is part of sharing the virtual address space.
Doesn't sharing a virtual memory context with the GPU increase the cost of context switching? Also, which CPU core shares context with the GPU? Or are we talking about a fixed mapping (like the kernel)?
This sounds really, really, really freaking cool. I am overjoyed to see AMD not throwing in the towel and conceding the entire high-end CPU market to Intel. A monopoly there would threaten Moore's law.
I can think of a lot of cool things to do with hUMA. I might have to get one and dust off my once very strong interest in evolutionary computation (strongly biomorphic genetic algorithms, artificial life, etc.). EC can do very interesting things -- its the only "AI" technique I am aware of that can be genuinely creative -- but it eats CPU cycles for breakfast.
It would also be great for creating a practical fully homomorphic cryptosystem based virtual machine for "blind cloud computing"-- where the VM host has no idea what the VM is doing. All kinds of neato stuff is waiting on this kind of computing platform to be practical.
AMD (Intel too, sometimes) often goes on talking about its new technology, yet customers have shown they don't care. There are literally only two things that matter to buyers: price and "speed." Anything else is just PR hype for the investors.
Unified memory is possibly going to have performance gains because you won't need to all the crazy memory shuffling that people do on GPUs these days. Also, making programming simpler makes it cheaper. So, it seems that this is a win if they can get the price point right.
But there's another problem: currently CPU memory buses are connected to two or more DDR3 memory channels. And DDR3 doesn't simply have sufficient bandwidth for high performance graphics and GPGPU computing, especially when shared with CPU.
Intel Haswell will have CPU and GPU on-package together with shared 128MB of eDRAM 64 GBps "L4 cache". I believe that should enable low latency high performance memory sharing.
I don't understand AMD's bandwidth story. Does the GPU share one memory controller with CPU and have another private one, for example connected to GDDR5? I don't see how hUMA could work efficiently over PCIe bus either, so I guess hUMA is about APU + CPU only.
How does AMD provide the bandwidth?