HPC is dying, and MPI is killing it (2015)

nerpderp82 · on April 8, 2023

This is a nothing burger, I don't mean that in a derisive way though.

MPI and Spark solve very different problems. The overlap is basically zero and the fact that MPI is flat while Spark fluctuates shows this. HPC is a small small fraction of the job market, the number of folks involved in HPC is tiny compared to all the click-log analyzing Spark and Hadoop programmers.

Both of those systems are bulk parallel systems, with the majority of folks aggregating low information density data. This is not what the MPI clusters are doing modeling weather and calculating subatomic interactions or simulating wind tunnels.

> The idea that the people at Google doing large-scale machine learning problems (which involves huge sparse matrices) are oblivious to scale and numerical performance is just delusional.

Scale is not latency! Google scales to sizes many orders of magnitudes larger than MPI clusters, but it does not run workloads with the same connectivity needs that MPI workloads need. It doesn't run Super Computers, it runs massively parallel bulk embarrassingly parallel computers.

Chapel is a language, MPI is a transport. The author obviously has skills, but they shouldn't be conflated. Chapel can use MPI.

Chapel supports OFI MPI uGNI GASNet. This is not unlike saying don't use SCTP use Python! I am not being charitable.

bertday · on April 8, 2023

I think one of the most glaring problems with the argument is HPC maintains tons of legacy simulations which may not even have the original authors around. Adding Spark support wouldn’t make this easier.

sfpotter · on April 8, 2023

Not sure why this was downvoted. This is absolutely true.

dekhn · on April 8, 2023

Google absolutely runs workloads that need supercomputer tight scaling. That's why they built TPUs with ICI networks.

nerpderp82 · on April 13, 2023

You got me on a technicality, but nothing of practicality.

This isn't exposed to the larger Google, nor to customers. Google itself as an organism doesn't HPC. It is an ad company that runs ad company systems.

Why are there no top-k super computers running on Google?

markhahn · on April 8, 2023

ICI is basically nvlink, not like IB.

dekhn · on April 8, 2023

no, ICI is a fiber optic (or electrical) network between TPUs and it doesn't have any switch functionality. I used to work for Google (on TPUs and ICI). Anyway, the comparisons don't matter that much. The long and short of it is that Google built their own supercomputers (finally) that can do allreduce and alltoall patterns, as well as low-latency broadcast.

markhahn · on April 8, 2023

medium isn't the point, and duh, torus networks are distributed-switching.

dekhn · on April 8, 2023

Do you have a point? There's nothing technically wrong with saying that TPUs with ICI are a supercomputer- they are very similar in design to the T3E I used decades ago.

Please make your point instead of trying to negate whatt I said.

Medium matters a lot- because supercomputers are fundamentally constrained by physics in terms of power dissipation, transistor density, and speed of light, all-optical networks have slightly lower latencies than electrical, and also let you build larger systems (longer cables).

HyperSane · on April 8, 2023

what is ICI networks?

rlupi · on April 8, 2023

ICI is interchip interconnect in TPUv4 pods.

You can read more on paper that my collegues in Google platforms recently published https://arxiv.org/abs/2304.01433 https://cloud.google.com/blog/topics/systems/tpu-v4-enables-...

I am so happy this paper is finally out :-) I led the SRE work during their NPI. It's a pain when you have to wait 3 years to discuss what you worked on.

They are based on OCS, which powers Google datacenter networks https://arxiv.org/abs/2208.10041

dekhn · on April 8, 2023

https://dl.acm.org/doi/pdf/10.1145/3360307 Very fast (hundreds of gigabits per connection) networks that attach TPUs on the same board, as well as between boards. It's wired up point to point, forming a logical grid (technically a 2D or 3D torus with wraparound links).

bscphil · on April 8, 2023

I think a central element of the article is its focus on the needs of genomics researchers. In this field there are many tasks that are highly parallel but not embarrassingly parallel. Ideally you want to scale out to as many processes as possible, but some basic IPC is mandatory.

Though not in the HPC field myself, I've been in a position of helping researchers working with an HPC cluster. On this hardware something like 16 cores per node was typical, so you run out of scaling room quickly using traditional threading libraries. And you really want to scale beyond this - jobs could take weeks or months otherwise. This means using MPI because that's all the HPC supported. This was a massive pain, because MPI is complicated in ways that are unfamiliar to many researchers, most of whom have no low-level language experience at all.

What the article calls for are HPC techniques that move communication to a lower level API, rather than exposing them to the programmer / end-user. I think that's exactly the right idea for genomics.

Disclaimer: I'm neither an expert in HPC nor genomics.

morelandjs · on April 8, 2023

What compute environments are used for LLMs like ChatGPT? Do you see this recent frenzy driving demand for HPC engineers?

hackandthink · on April 8, 2023

"Our biggest jobs run MPI, and all pods within the job are participating in a single MPI communicator."

https://openai.com/research/scaling-kubernetes-to-7500-nodes

cavisne · on April 8, 2023

LLM training (at least the ones we have research details on) sync weights every step so they have very high networking and latency needs. That’s why every cloud vendor competes on interconnect speeds for GPU machines.

So they are basically classic supercomputing workloads.

cavisne · on April 8, 2023

I don't see a lot of traditional "HPC engineers" in the ml infrastructure space though.

I do wonder if in hindsight using a Slurm cluster for scheduling, Lustre for data, MPI for any connectivity thats not covered by NCCL, would have been better than trying to make object storage, grpc, kubernetes, ray etc work.

the_svd_doctor · on April 8, 2023

“HPC engineering” and “Optimizing LLM training” are similar. It’s about profiling, performance modeling, finding bottlenecks, rewriting what’s needed, etc. Lots of overlap, and people from more traditional HPC doing it too. Obviously if you’re a 50yo university professor doing airplane simulations you won’t switch to LLMs today though…

adw · on April 8, 2023

Some mixture of MPI, NCCL, Gloo, and whatever proprietary stuff TPU clusters do. All of these are basically either trad-HPC in style or literally from the supercomputing community. Interconnects tend to be Infiniband or the like, which, again, straight out of big iron.

osigurdson · on April 8, 2023

They use Kubernetes and MPI

https://openai.com/research/scaling-kubernetes-to-7500-nodes

p_l · on April 8, 2023

CUDA has support for MPI, including MPI done from GPU itself (over nvlink or with support for accessing host infiniband adapter, iirc).

phkahler · on April 8, 2023

>> HPC is a small small fraction of the job market, the number of folks involved in HPC is tiny compared to all the click-log analyzing Spark and Hadoop programmers.

You're basically saying HPC = MPI users, which is really dismissive of a whole bunch of other people making use of vast compute resources.

I for one can't wait for these masses to convince chip makers that ieee754 floating point really has a lot of crap that we don't need.

prpl · on April 8, 2023

A lot of MPI’s (ab)use in HPC boils down to distributed task management in lieu of a work queue system available to users. People have embarrassingly parallel jobs but need to coordinate on the task management because many HPC centers either don’t provide resources for a long-lived service to execute near the cluster (or even general connectivity outside the cluster)

The problem is that you do have to support true parallel MPI jobs in those shared clusters though, so MPI just becomes the hammer for everyone else.

Managing the resources a level higher (all resources live in a k8s cluster, slurm under k8s) seems to be the best way to really accommodate both types of loads but most HPC centers are far off from implementing that.

https://slurm.schedmd.com/SC22/Slurm-and-or-vs-Kubernetes.pd...

(I think that presentation has some misconceptions about k8s - most k8s clusters are elastic to a max size - and it sounds like the really want to control most scheduling - but it gives an overview of merging the systems)

saltcured · on April 8, 2023

Roughly 20 years ago, the Condor high-throughput computing system gained "glide-ins" to do this sort of repurposing. Before that, Condor was mostly persistent runners on desktop fleets. After, you would submit a batch job to an HPC cluster and for the duration of the job, those HPC nodes became additional runners for an existing Condor scheduler.

Around that time, there was also a period of reservation-based "advanced scheduling" where HPC centers were flirting with making future scheduling promises. The right way to think of these would be like guarantees to get bare metal machine capacity during a certain wall-clock period. In my opinion, the commercial pre-cloud/cloud/virtualization stuff then infected everyone and regressed to time-sharing with fuzzy QoS and lots of over-subscription and dynamic rescheduling.

Of course, these different approaches will all be isomorphic in the end if they explore the full space of application requirements. The traditional paths were just approaching from very different economic priorities. The IaaS folks are incrementally adding more QoS and pricing options which could eventually provide HPC IaaS if carried to full fruition. I.e. future guarantees of significant hardware resources. But as far as I know, those are still in the realm of "talk to a sales rep" and not some automated IaaS request flow at this point.

dekhn · on April 8, 2023

During the years I was active in grid computing (before cloud computing became huge) Miron Livny (creator of Condor) would basically attend every talk and explain how "condor already does this, why are you reinventing the wheel?"

saltcured · on April 8, 2023

Hah, yes. And Oracle reps used to say the grid was inside their cluster. A lot of folks did not appreciate the decentralization of the grid. The grid computing concepts were about federation of disparate organizations and their resources, not about how sprawling of a system a single vendor or HPC center could build.

prpl · on April 8, 2023

The PITA part of condor/grid was software management before containers. Sure, everyone was running at least RHEL4/5/6 (Or SL4/5/6) and in many cases AFS worked and the more advanced operators were adding VM execution, but it was (and still is) annoying to deal with. Most annoying right now is that nobody can agree on a container runtime - there’s Docker/Shifter, singularity, Charlie. It should just all be podman now but everybody is still holding on.

(I have worked with Slurm, condor, Torque/PBS, gridEngine, DIRAC, and LSF)

Plugging a different scheduler into k8s might be an interesting way of solving this - it seems like there’s a lot of work on scheduler plugins when I last looked. Some of the issues are similar in cloud too - coscheduling by latency.

There’s at least some incentive to not solve this - I remember k8s on Mesos being popular and of course we know how that played out.

tannhaeuser · on April 8, 2023

I believe the terminology is/was "advance reservations" (as in reservations of resources such as CPU, mem, disk space, i/o and net bandwith in advance on clusters otherwise freely available to ad-hoc jobs) rather than "advanced" anything, or at least it was with the Torque scheduler I reviewed for a clickstream analysis customer project.

saltcured · on April 8, 2023

Yes, I think so too. I blame the predictive typing in my hands.

RcouF1uZ4gsC · on April 8, 2023

With regards to Spark, consider if you could possibly do the processing on a single machine. For many workloads, a single threaded program running on a laptop will beat an entire Spark cluster.

https://www.frankmcsherry.org/assets/COST.pdf

kristjansson · on April 8, 2023

This was a great point to make at the time, when people thought “my days has exceeded Excel’s row limit, therefore I should set up a Hadoop cluster and run Spark jobs against it”

Since then … it’s become a bit of a meme, unfortunately. Definitely there still exist workloads assigned to Spark clusters that could run on a laptop, especially if the data happens to be there already. But the space as a whole provides immense value, both enabling jobs that really don’t fit on laptops, and moving the compute for laptop sized jobs to where the data happens to be.

lamp_book · on April 8, 2023

The datasets they test against are 6gb and 15gb, and I get that those are the two one of their references uses, but that’s clearly not multi-node territory. Also as they point out graph computation is not trivially parallelized. Spark is more for doing long running transformations on independent data in a fault tolerant way.

osigurdson · on April 8, 2023

The article is a little out of date. MPI is used for LLM training so has made a bit of a comeback.

kergonath · on April 8, 2023

Also, HPC is not dead, so obviously MPI did not kill it.

aqme28 · on April 8, 2023

A trend piece from 8 years ago that didn't get its prediction right. Not sure why this was reposted.

markhahn · on April 8, 2023

blast from the past eh?

since then, it's become obvious that the 99% of the ML world is using high-level interfaces, blissfully unaware of MPI.

if you want to make a lasting contribution, work on PIM, not MPI ;)