Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A dive into the AMD driver workflow (geohot.github.io)
153 points by tikkun on July 30, 2023 | hide | past | favorite | 76 comments


A possible prescription for AMD regarding AI and CUDA:

1) Open-source driver development as mentioned in this post

2) Set up 24/7 free software tech support on discord. Maybe for all use cases, maybe only AI use cases. Do the tech support via screen sharing, and have a writer join for all calls, so that every issue once solved gets a blog post

3) Have employees run all popular AI tools and get them working on AMD hardware, publish written guides and videos showing how to do it.


The problem is not that people within the company don't have good ideas of how to improve deep learning research end user experience, it's just not a priority for AMD. It's annoying as a potential customer but arguably whatever their overall strategy is it's working.


4) Release a consumer GPU with 32-80 GB of VRAM.


It sounds silly, but people would endure a lot of pain to fit significantly larger models into ram.


It could even be 2-3 generations behind and that thing would still sell.


Or at least make it so that you could rent the larger gpus by the hour.

As far as I know, there isn't a single service that offers bare metal access to MI210/MI250's.


AMD has CPUs with built-in GPU with hundreds of ALUs and unified memory (GPU uses system memory). Cannot such CPU be used for ML tasks without purchasing expensive graphic cards? 64 Gb of ordinary RAM will be cheaper than 64 Gb of VRAM.

Currently such built-in GPUs seem to be optimized for games only.


VRAM speed is a major factor. Even ddr5 with 4 channels would be slow compared to current GDDRs.


That's correct, but if you are, for example, multiplying two 1024x1024 matrices (typical for ML loads) then you need to perform roughly 1B operations on 1M numbers. There is no need for loading lot of data. So we can do without fast memory in this case, can't we? Hopefully the cache and registers of GPU can handle this.


What are you planning to play on that thing?


Llama 2.


I think Hogwart's Legacy could benefit from it. It crashes on my 16GB system due to being out of video memory regularly. It might be lying about the cause but that's what it claims.


Yes, let's build an open source community on top of a closed platform. It's not like twitter, reddit, Facebook etc have taught us anything.

Open a mailing list like every project that survives more than 5 years. Hell the barrier to entry will ensure you get people who can use a text editor and save you half the question of 'how do I install this on on an Intel a2 Mac???'


Yes, absolutely do not use Discord for this, Matrix would be better


> A note culturally, I do sadly feel like what they responded to was george is upset and saying bad things about us which is bad for our brand and not holy shit we have a broken driver released panicing thousands of people’s kernels and crashing their GPUs. But it’s a start.


George is right.

But

He has a history of being an arrogant prick. That will color people's perceptions of you even if it's not relevant to the immediate interaction.


I found his video pretty dumb. Everybody is assuming the driver is shitty, but he tried is an officially unsupported[1] GPU with ROCm and it failed. Big whoop.

He is a serial shit stirrer and this is no exception.

Did he try Windows?

As much as I hate that AMD is not really supporting consumer GPUs for compute, presenting his problems as some variant of production drivers breaking is a stretch at best.

[1]: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...


Leaving consumer GPUs unsupported is part of the problem and AMD deserves to have its shit stirred for it.

They need to be better than nvidia, not "you'll take what you can get." We can get nvidia. That's the bar. NVidia charges a premium but their shit works (comparatively speaking). AMD has been half-assing their compute offerings for 15 years, it finally became important, and now they need to play catch-up not drag their heels and toss out excuses as to why that's OK. My prediction: AMD won't be second place in this race for long, someone who actually wants the #2 spot will take it from them and AMD will sit around wondering how they blew a 15 year head start.


Remember, geohot was going to write the driver -- he came out of the gate saying "AMD has ROCm but you can't use it on consumer cards, so I'm going to write my own software stack that works with AMD's consumer grade GPUs". He raised $5.1m on that promise.

So far, with all that money, he has compiled their driver, ran it in an unsupported configuration, and then had a complete public mental breakdown because it didn't work. Something he already knew going in.

Should AMD support ROCm on its consumer grade GPUs? Probably. But that's really not geohot's choice to make, unless he wants to get his hands dirty and actually write the software he promised to write.

Having worked on GPU drivers, it's not a couple-lines fix situation, it would be a pretty big investment to add stable ROCm support for consumer GPUs. AMD higher-ups responding with anything other than "lol what it's unsupported what did you expect" is extending a pretty long olive branch here.


The problem is mismatch between spec sheet and customer needs, not mismatch between spec sheet and card. I don't know whether this is a management problem at AMD or an engineering problem at AMD or both. I don't really care, either -- a "wans't me!" from engineering is completely uninteresting to me. The problem is that AMD's consumer cards don't run ROCm while NVidia's consumer cards all run CUDA.

I am amazed that Geohot reached out to AMD to extend an olive branch so far as to force their failure of a product to work despite itself, because frankly I'd just have expected AMD to spin excuses like you did. I am encouraged to see that they took a higher road; hopefully that translates to actual execution. We'll see.


We tried ROCm on the MI50 years ago. Complete shit show, crashes, lackluster support by AMD. I guess it was because we would've just bought a few dozen, so maybe their competent staff was doing support for some startup with huge VC backing buying a few hundred dozen for a new game streaming service.


> "AMD has ROCm but you can't use it on consumer cards, so I'm going to write my own software stack that works with AMD's consumer grade GPUs". He raised $5.1m on that promise.

Having not followed this, I don't understand what the promise was. ROCm works just fine on at least some consumer cards like the 6xxx, and by fine I mean "as bad as on AMD's pro cards", but at least it works out of the box.

Certainly it's not supported, and so therefore they are not shipping precompiled binaries, but it does seem to work...


To give him credit I don't think he was saying he was going to write new drivers, just build a ML software stack that worked on whatever they ship for the consumer cards. Certainly a large project by itself with questionable value at this point, but not writing new drivers.


The AI race is so hot right now that AMD will sit in #2 and pick up all of the people who can't get access to NVIDIA cards. There is a huge centralization issue around only writing your code for NVIDIA... and that is a huge business risk. People are going to wake up to that fast as the supplies of NVIDIA go to zero.

That said, I don't think people realize that there is literally no more large scale tier 3 (redundant) data center power in the US. Even if you are sitting on 1000 or 10k NVIDIA cards, you can't deploy them anywhere.

They also need to be in the same data center for speed. You can't just colo in 2-3 data centers to get what you want. If you want to train a large model across 1000 gpus? You're screwed.

Think you can just go to the cloud? Go try to sign up for coreweave. They are full and not taking any more customers. A lot of the other sites advertising nvidia gpus are just reselling coreweave under the covers.

Forget the software problems. There are far bigger issues and they are not getting better any time soon.


The problem is AMD is not #2 in this field, not by a long shot.

Google and Amazon offer their own hardware that is both better supported and actually available for customers. Apple has fast inference hardware in every computer and mobile device they offer.

I raised this issue with AMD 8 years ago in a technology conference, and the answer I got back then was a shoulder shrug "we don't think this is an important market." 8 years later and they have all but lost the war.


G/A can't be used for large scale training, nobody is going to give their data to them. Major trust issues there.

Apple is Apple. Not public. Let's also not mix consumer needs with enterprise.

You're correct, 8 years ago and up until recently, AMD only cared about gamers. They are waking up fast though.

ROCm 5.6 is a visible first step in that regard. MI300 will blow A/H100's out of the water.

But again, hardware/software isn't the problem here. The problem is much deeper than that... even if you have those things resolved, you can't put them anywhere.


Nobody is going to give OpenAI or X or Meta access to their models but frankly Google/Amazon are at a scale where they’ve already bypassed this trust issue. People already give their code, their operations, etc to large cloud providers, it’s been that way for 10+ years now.

Your shit isn’t so good that google is going to peek under the covers and steal your shit, because that would actually implode their business when they got caught doing it. The net present value of all of google’s future decades of operation is a lot higher than your hot dog detector app, or even critical F500 business operations.


What you're saying is logical, but the perceptional reality is different. There are large scale AI customers out there who absolutely refuse to use the large public cloud providers for training on the grounds of protecting their data. They want 100% control over it and they want their own segregated data centers.


> Let's also not mix consumer needs with enterprise.

NVidia mixed them and now everything is written in CUDA. Lol.


And we now have hipcc to go back to AMD. Sweet!


Have fun with that. I burned my hand badly enough on OpenCL that I now know to wait for proof, not promises.


People are doing benchmarks on older rocm's and it is looking pretty good.

https://www.mosaicml.com/blog/amd-mi250

Waiting on the updates.

I'll add that I have learned over time to not discount motivation. If AMD it motivated, they can do it. This has been proven already with their dominance over the server cpu market.


hipcc is a joke, it doesn't handle everything CUDA is capable of, specially not the polyglot capabilities.


What do you mean by polyglot? As in multiple hardware, or mixed-source? HIP is mostly API-compatible with CUDA, so you can just mix host code and device code with it.

That said, ROCm does indeed work with machine code instead of IR. You can compile fat binaries with more than one type of machine code and they'll work on any of the chips you compiled for, at the cost of the binaries becoming, well, basically obese if you want a decent range of hardware supported.


> G/A can't be used for large scale training, nobody is going to give their data to them. Major trust issues there.

And yet, that's what a lot of the big AI startups are doing. Granted, it's not what everyday business are doing (yet). But TPUs offer pretty impressive perf/cost ratio, so I'd be surprised if it's actually "nobody".


> that's what a lot of the big AI startups are doing

They don't have any other choice or they are just dumb...

https://www.popsci.com/technology/google-ai-lawsuit/


The fact that this lawsuit exists doesn’t prove anything.

Real evidence that Google or Amazon actually introspected the contents of their cloud platform customer’s VMs, Databases, GPUs, disks, blob storage buckets etc. would be far more convincing, but such evidence doesn’t exist - because it doesn’t happen.


It is enough to scare people away and that is all that matters in the grand scheme of things. I know this for a fact.


> no more large scale tier 3 (redundant) data center power in the US.

This is interesting to me, what is the constraining factor? Raw generator output? Transmission lines getting power to the right places?


Both. Transformers are a big one.

All the large FAANGs have been sucking up availability.


AMD's AI cookie cutter business model is getting million dollar contracts with a handful of companies. The peasants don't get access to or use AMDs data center cards.


And why do you think that is?


Your comment reminds me of another comment by (SilverBirch I believe) that AMD can do this and will get away with it and that George Hotz is unimportant to AMD. Then a few weeks later Lisa Su tweeted this https://twitter.com/LisaSu/status/1669848494637735936.


As much as I hate that AMD is not really supporting consumer GPUs for compute,

Keep in mind this post is around two months old now, and since it was published AMD have already officially announced plans to support (at least some) consumer GPU's in ROCm.


Everybody saying "poor AMD is such an underdog" for years = perma-broken drivers

George being a prick one time = suddenly working driver appears?

My perception is indeed colored but in the opposite direction.


I didn't even know who Hotz was, but that YouTube rant on this very issue made me extremely skeptical.

Should someone who will just drop a GPU vendor if they get flustered really be leading a ML framework?


Hotz is known for hacking the iPhone and PS3. But you should look deeper into that before giving him any credit.

His post is mostly hot air. AMD driver development is very open, excepting only ROCm. Not knowing where things are discussed or how the various driver systems, excluding Nvidia, interact on Linux is not an excuse to rant.

The corporate amd kernel development starts amd-staging-drm-next, with internal AMD pull requests available on the mailing list or here: https://patchwork.freedesktop.org/project/amd-xorg-ddx/serie..., before going through airlied, the maintainer of drm in the main kernel, then to Linus.

Everything user-space regarding OpenGL/Vulkan and some rusty OpenCL, aside from the proprietary amdgpu-pro driver which should almost never be used, is in Mesa.

ROCm is the only thing with huge code dumps, obviously because it's a new effort, and it's said outright that it's unsupported for consumer GPUs. Yes, it's a buggy mess. Did his issue warrant the corporate response he got? No.


Hotz followed a guide published by another team’s research and made the exploit into something that could be easily accessed by the public, which the original researchers weren’t too happy about because it brought Sony down on their heads and tightened security up again.

But yes, this one sits squarely in “nobody involved looks real good” territory. AMD puts out shit software in general, and ROCm is exceptionally far behind the standard in this field, and often difficult to set up and broken even in “supported” configurations. And the ROCm team have a habit of doing this closed-door development, splitting the actual work into the AMDGPU-PRO driver while neglecting the open one (even in “supported” configurations), and generally just having a process that is not compatible with delivering high-quality open software.

And Hotz is being a manchild who melts down about a driver he knew was bad and berates the people who are trying to get him rush patches while they do their dayjob getting the ROCm release out (which hopefully fixes a lot of the complaints around lack of windows support, lack of consumer gpu support, and general (near-total) lack of polish and stability.

To their credit AMD finally posted a job listing for like 30 engineers for ROCm. It’s just something they should have done like 4 years ago when ROCm first got rolling. And we are well past the “poor lil amd couldn’t afford that” in 2019 or whatever. If you can’t afford it then don’t start the ROCm initiative and make the promises.

Even 2-3 years ago it was clear that the foundation work was done and that if you ever wanted it to go bigger you’d have to do some hiring. And frankly it’s been just as bad with OpenCL before it - support was never good, the runtime constantly had bugs and paper features, etc.

AMD didn’t care about GPGPU compute until AI turned into a billion-dollar industry overnight. And that’s a 15-year story there.


Well said.

I remember the HSA push too...


To be fair one of the issues with AMD and ML is they have been pretending it works for years now, when nothing actually works.

Calling that out publicly is probably needed at this point, or they will keep putting AI in their earnings reports without anyone actually being able to train a model on an AMD chip.


You're welcome to fork and find another leader, I guess.


I don't think some stability or a formal decision making structure is too much to ask. TVM, MLIR based projects, GGML seem to have stable leadership.


You can say a lot about nVidia, but for me all their products mostly just work on Linux(I use cuda a lot). I don't understand why AMD is having such trouble doing the same. Likewise with CPUs. It is ironic I have to be using Intel's math libraries to get good performance out of my AMD CPU.


This hasn’t been my experience on Linux. The nvidia drivers appear to “just work” but actually caused a lot of instability. My desktop no longer crashes every other day (or more often when gaming) since switching to an AMD GPU.


They work absolutely fine if you stick with what worked five years ago and don't update anything until some Nvidia blog says you should.

Just don't use Wayland, don't use too many screens, don't use a laptop, don't run a recent kernel and don't expect software features like their special sauce screen recorder, or that trick where you can get a free camera inside a game, or anything else packed into their gaming toolkit on Windows. Oh, and accept a very high idle power draw if you choose not to go with Windows.

With all of that, most games work out of the box. CUDA works, video encoding and decoding works (though you're severely limited in terms of the number of simultaneous streams without hex editing the driver).

I do get the occasional Nvidia related crash, but it's been a while. Still, I don't think I'll ever consider buying Nvidia for a Linux device again. I was a fool to think "Nvidia has come a long way, I'm sure a laptop with an Nvidia GPU will work fine with a few tweaks".


I assume that how an Nvidia GPU works on a laptop depends a lot on the laptop.

On a Lenovo gaming laptop, I have lost two days until finding how to configure Linux on it against Nvidia Optimus, but after that it has worked fine.

On the other hand, on several Dell Precision laptops with Nvidia GPUs (sold with Ubuntu, but I have wiped their Ubuntu and I have installed another Linux distribution from scratch) the Nvidia GPUs have worked perfectly, out of the box, without any effort.

I have not tried Wayland, but I have used 3 monitors for many years, with various Nvidia GPUs and without any problems. The configuration of multiple monitors with the NVIDIA X Server Settings program is much simpler than in Linux systems with AMD or Intel GPUs.

Because almost every new Linux kernel version breaks the out-of-tree device drivers, it is unavoidable that some time must pass until NVIDIA releases a compatible driver version, though that might change soon, when their new open-source kernel module will be integrated in the kernel sources.

Nevertheless, the NVIDIA driver has always supported the latest long-term kernel, so if you update only between long-term versions there are no problems with incompatible NVIDIA drivers.


> Just don't use Wayland

Still reasonable

> don't use too many screens,

Is there too many?

> don't use a laptop

Never bought one with a dGPU due to NVH concerns

> don't run a recent kernel

Using nvidia drivers on Arch for years now, what seems to be the problem, officer?

> don't expect software features like their special sauce screen recorder, or that trick where you can get a free camera inside a game, or anything else packed into their gaming toolkit on Windows.

Yeah their desktop-software on Linux is clearly the same stuff they had in 2004, right down to the Qt 3.1 version it's built with and the gamma ramp editor.

> Oh, and accept a very high idle power draw if you choose not to go with Windows.

Power management worked exactly identically to Windows with every nVidia card I've ever seen on Linux.

> a laptop

I think this is the actual salient issue here. Firmware quality varies wildly on laptops, not just for stuff like this, but even much more... basic things. Intel AX2xx wifi cards cause crashes and freezes in both Windows and Linux when used in some laptops (Thinkpads, namely), but are perfectly fine and dandy in others, or on desktops.

I suspect firmware quality correlates strongly with how much the OEM decided to write, which is to say, it gets worse with every line of code added by the OEM. I think that's why Thinkpads, HPs, Dells etc. are so notorious for their shitty firmware and trashy ECs, while practical no-names have none of these problems, simply because they're more or less just sticking the Intel (or AMD) reference platform in a box - and suddenly it "just works".


I always see this comment, but I have used NVidia GPUs for gaming on Linux for a decade and they have always worked perfectly with the proprietary drivers.


The thing with Linux is that it’s extremely fragmented. We could be running totally different setups from both a hardware and a software perspective.

When using Linux i’ve regularly encountered the “it just works” argument and then been disappointed when it in fact did not.

The only thing I can do is accept that my own experience with Linux is unique and try to optimize for my own situation. Nvidia has historically been decent for me but AMD has been significantly better.


I've found them to be rock solid stable, and of course unlike AMD they are accelerated.

NVidia are literally the only game in town for video editing, because AMD won't provide compute acceleration in Linux.


For gaming and desktop usage, situation is completely the opposite. Nvidia is plagued by having not upstreamed drivers, and AMD just works.


really on linux you would be better off having a primary amd or intel gpu and then a secondary nvidia gpu just for compute/cuda tasks. But alas, many new am5 motherboards don't even have a secondary 16x pci-e slot capable of even pcie-4 4x bandwidth anymore. I guess we should thank this cloud compute craze for that...


Some like Asrock reduced the number of PCIe slots and increased the number of USB 4 ports. So there is an option at least to use an external GPU now, since that's where they are re-allocating available bandwidth it seems.


Having a Nvidia Linux gaming setup, from my experience it doesn't work well at all : broken wayland, no hw video enc, etc.

Sure some of those could be solved by sending weeks trying hacks things like VA-API over NVDEC emulation, but at the end of the day this is waste of time, tl:dr everything works on Nvidia Windows, many things broken on Nvidia Linux.

NVidia on Linux is only good for non desktop use : cuda/compute/ml over ssh.

AMD is the exact opposite, desktop things work better except there is no cuda/compute/ml

So if you want something that works over the whole spectrum : desktop, wayland, compute/ml, video enc/dec, screen sharing, etc. it doesn't exist on Linux


> And in order to beat them, you must be playing to win, not just playing not to lose. Combine the driver openness with public hardware docs and you have a competitive advantage.

This is so immensely true. I’ve been so enthusiastically loyal to intel gpus in the last ten years or so (avoiding laptops with any kind of discrete gpus) because dealing with closed drivers is so much pain.

I’m still skeptical about amd gpus, even though i hear good stuff about it.

I kist want hardware i can trust, knowing I won’t have dumb driver issues.


I knew for years that AMD’s development model leaves a lot to be desired.

They like to throw big balls of source over the wall, and very soon after the bugs that keep haunting the previous generation of hardware just stop getting fixed. You’re SOL unless Dave Airlie himself runs into identical problems on his personal gear and gets angry enough about it to make a fix.


Yeah, I've ranted about this here in the past. I'm glad someone high-profile enough is finally doing the same, as that might actually lead somewhere.

Their closed development process is so moronic. It means their code is always ahead and out of sync with what is in public. There have been user provided fixes and improvements to ROCm on GitHub with no reaction from AMD. Probably because it wouldn't apply cleanly to whatever they currently have. It's sad to see your customers having to fix your drivers. It's even sadder to see you ignore it.


This is the greatest travesty of closed/open models, IMHO.

I get exactly what happens -- devs are shipping their sprint pipeline internally, and pull requests or issues from the current open source head never make it onto their radar.

This gatekeeps bugs visibility behind whatever PM is running sprint planning, which always leads to delivering what management wants instead of what users are dealing with.

In complex, diverse environments, you're always going to have a myriad of bugs that are caused by a specific configuration.

Absent a path to actually fix that, you end up with an enterprise-only product that's only stable in extremely specific configurations.

And the greatest evil... you're also ignoring helpful reports of configurations where it's broken!


I'm so thoroughly confused about why AMD wouldn't be falling over themselves to enable geohot and his followers to build an alternative to CUDA and NVIDIA. This feels like a conversation that geohot is attempting with feckless product and software managers who certainly can't make bold decisions. Has the CEO of AMD effectively spoken about this problem?


Why would they spend limited resources on a fight they don't want?

Part of leading a company is knowing which markets to pass up.


You say that like CEOs are infallible. Thorsten Heins and Antonio M. Perez would like to have a word with you.


ROCm is a bit of a mess. I have used it successfully on a 6700 XT without issue but it was confusing because I thought I would need a "pro" driver but I did not.


Knowledgable people, please explain. Vulkan allows executing code on GPU. Can it be used for typical ML tasks (e.g. large matrix multiplication)? Why do we need drivers like CUDA or ROCm then?


To answer your second question directly: CUDA predates Vulkan.

So if your question then becomes: Why did we need Vulkan if we have CUDA? OpenGL, DirectX, Metal, and Vulkan all have lots of graphics-specific concepts built in to their APIs. While you could probably implement most of this in CUDA, at some level you need hardware support for your graphics APIs because in the end the data in your framebuffer goes from the GPU's memory to your monitor via HDMI or DisplayPort.

CUDA has no interface for that graphics-specific stuff (technically there are CUDA-OpenGL interop APIs inside CUDA) because it has to be able to run on AI acceleration hardware in datacenters (although that hardware didn't exist when CUDA was invented either, to be fair)


I think we'll get there but it will take years to arrive at a robust solution, because of the complexity of the ecosystem. The cooperative matrix extension, the portable and standard way to do what is usually marketed as "tensor cores," just landed in the spec and drivers a couple months ago. It will take people time to figure out how to use them effectively, partly because you have to query which options are supported at runtime.

There's still a driver and you're still at mercy of it (bugs and all), it's just much more likely to be installed on your Windows or Linux machine than a ROCm stack.


That's why "ROCm support" is pure marketing and FUD. And people rightfully won't buy into it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: