This is a nothing burger, I don't mean that in a derisive way though.
MPI and Spark solve very different problems. The overlap is basically zero and the fact that MPI is flat while Spark fluctuates shows this. HPC is a small small fraction of the job market, the number of folks involved in HPC is tiny compared to all the click-log analyzing Spark and Hadoop programmers.
Both of those systems are bulk parallel systems, with the majority of folks aggregating low information density data. This is not what the MPI clusters are doing modeling weather and calculating subatomic interactions or simulating wind tunnels.
> The idea that the people at Google doing large-scale machine learning problems (which involves huge sparse matrices) are oblivious to scale and numerical performance is just delusional.
Scale is not latency! Google scales to sizes many orders of magnitudes larger than MPI clusters, but it does not run workloads with the same connectivity needs that MPI workloads need. It doesn't run Super Computers, it runs massively parallel bulk embarrassingly parallel computers.
Chapel is a language, MPI is a transport. The author obviously has skills, but they shouldn't be conflated. Chapel can use MPI.
Chapel supports OFI MPI uGNI GASNet. This is not unlike saying don't use SCTP use Python! I am not being charitable.
I think one of the most glaring problems with the argument is HPC maintains tons of legacy simulations which may not even have the original authors around. Adding Spark support wouldn’t make this easier.
no, ICI is a fiber optic (or electrical) network between TPUs and it doesn't have any switch functionality. I used to work for Google (on TPUs and ICI). Anyway, the comparisons don't matter that much. The long and short of it is that Google built their own supercomputers (finally) that can do allreduce and alltoall patterns, as well as low-latency broadcast.
Do you have a point? There's nothing technically wrong with saying that TPUs with ICI are a supercomputer- they are very similar in design to the T3E I used decades ago.
Please make your point instead of trying to negate whatt I said.
Medium matters a lot- because supercomputers are fundamentally constrained by physics in terms of power dissipation, transistor density, and speed of light, all-optical networks have slightly lower latencies than electrical, and also let you build larger systems (longer cables).
I am so happy this paper is finally out :-) I led the SRE work during their NPI. It's a pain when you have to wait 3 years to discuss what you worked on.
https://dl.acm.org/doi/pdf/10.1145/3360307
Very fast (hundreds of gigabits per connection) networks that attach TPUs on the same board, as well as between boards. It's wired up point to point, forming a logical grid (technically a 2D or 3D torus with wraparound links).
I think a central element of the article is its focus on the needs of genomics researchers. In this field there are many tasks that are highly parallel but not embarrassingly parallel. Ideally you want to scale out to as many processes as possible, but some basic IPC is mandatory.
Though not in the HPC field myself, I've been in a position of helping researchers working with an HPC cluster. On this hardware something like 16 cores per node was typical, so you run out of scaling room quickly using traditional threading libraries. And you really want to scale beyond this - jobs could take weeks or months otherwise. This means using MPI because that's all the HPC supported. This was a massive pain, because MPI is complicated in ways that are unfamiliar to many researchers, most of whom have no low-level language experience at all.
What the article calls for are HPC techniques that move communication to a lower level API, rather than exposing them to the programmer / end-user. I think that's exactly the right idea for genomics.
Disclaimer: I'm neither an expert in HPC nor genomics.
LLM training (at least the ones we have research details on) sync weights every step so they have very high networking and latency needs. That’s why every cloud vendor competes on interconnect speeds for GPU machines.
So they are basically classic supercomputing workloads.
I don't see a lot of traditional "HPC engineers" in the ml infrastructure space though.
I do wonder if in hindsight using a Slurm cluster for scheduling, Lustre for data, MPI for any connectivity thats not covered by NCCL, would have been better than trying to make object storage, grpc, kubernetes, ray etc work.
“HPC engineering” and “Optimizing LLM training” are similar. It’s about profiling, performance modeling, finding bottlenecks, rewriting what’s needed, etc. Lots of overlap, and people from more traditional HPC doing it too. Obviously if you’re a 50yo university professor doing airplane simulations you won’t switch to LLMs today though…
Some mixture of MPI, NCCL, Gloo, and whatever proprietary stuff TPU clusters do. All of these are basically either trad-HPC in style or literally from the supercomputing community. Interconnects tend to be Infiniband or the like, which, again, straight out of big iron.
>> HPC is a small small fraction of the job market, the number of folks involved in HPC is tiny compared to all the click-log analyzing Spark and Hadoop programmers.
You're basically saying HPC = MPI users, which is really dismissive of a whole bunch of other people making use of vast compute resources.
I for one can't wait for these masses to convince chip makers that ieee754 floating point really has a lot of crap that we don't need.
A lot of MPI’s (ab)use in HPC boils down to distributed task management in lieu of a work queue system available to users. People have embarrassingly parallel jobs but need to coordinate on the task management because many HPC centers either don’t provide resources for a long-lived service to execute near the cluster (or even general connectivity outside the cluster)
The problem is that you do have to support true parallel MPI jobs in those shared clusters though, so MPI just becomes the hammer for everyone else.
Managing the resources a level higher (all resources live in a k8s cluster, slurm under k8s) seems to be the best way to really accommodate both types of loads but most HPC centers are far off from implementing that.
(I think that presentation has some misconceptions about k8s - most k8s clusters are elastic to a max size - and it sounds like the really want to control most scheduling - but it gives an overview of merging the systems)
Roughly 20 years ago, the Condor high-throughput computing system gained "glide-ins" to do this sort of repurposing. Before that, Condor was mostly persistent runners on desktop fleets. After, you would submit a batch job to an HPC cluster and for the duration of the job, those HPC nodes became additional runners for an existing Condor scheduler.
Around that time, there was also a period of reservation-based "advanced scheduling" where HPC centers were flirting with making future scheduling promises. The right way to think of these would be like guarantees to get bare metal machine capacity during a certain wall-clock period. In my opinion, the commercial pre-cloud/cloud/virtualization stuff then infected everyone and regressed to time-sharing with fuzzy QoS and lots of over-subscription and dynamic rescheduling.
Of course, these different approaches will all be isomorphic in the end if they explore the full space of application requirements. The traditional paths were just approaching from very different economic priorities. The IaaS folks are incrementally adding more QoS and pricing options which could eventually provide HPC IaaS if carried to full fruition. I.e. future guarantees of significant hardware resources. But as far as I know, those are still in the realm of "talk to a sales rep" and not some automated IaaS request flow at this point.
During the years I was active in grid computing (before cloud computing became huge) Miron Livny (creator of Condor) would basically attend every talk and explain how "condor already does this, why are you reinventing the wheel?"
Hah, yes. And Oracle reps used to say the grid was inside their cluster. A lot of folks did not appreciate the decentralization of the grid. The grid computing concepts were about federation of disparate organizations and their resources, not about how sprawling of a system a single vendor or HPC center could build.
The PITA part of condor/grid was software management before containers. Sure, everyone was running at least RHEL4/5/6 (Or SL4/5/6) and in many cases AFS worked and the more advanced operators were adding VM execution, but it was (and still is) annoying to deal with. Most annoying right now is that nobody can agree on a container runtime - there’s Docker/Shifter, singularity, Charlie. It should just all be podman now but everybody is still holding on.
(I have worked with Slurm, condor, Torque/PBS, gridEngine, DIRAC, and LSF)
Plugging a different scheduler into k8s might be an interesting way of solving this - it seems like there’s a lot of work on scheduler plugins when I last looked. Some of the issues are similar in cloud too - coscheduling by latency.
There’s at least some incentive to not solve this - I remember k8s on Mesos being popular and of course we know how that played out.
I believe the terminology is/was "advance reservations" (as in reservations of resources such as CPU, mem, disk space, i/o and net bandwith in advance on clusters otherwise freely available to ad-hoc jobs) rather than "advanced" anything, or at least it was with the Torque scheduler I reviewed for a clickstream analysis customer project.
With regards to Spark, consider if you could possibly do the processing on a single machine. For many workloads, a single threaded program running on a laptop will beat an entire Spark cluster.
This was a great point to make at the time, when people thought “my days has exceeded Excel’s row limit, therefore I should set up a Hadoop cluster and run Spark jobs against it”
Since then … it’s become a bit of a meme, unfortunately. Definitely there still exist workloads assigned to Spark clusters that could run on a laptop, especially if the data happens to be there already. But the space as a whole provides immense value, both enabling jobs that really don’t fit on laptops, and moving the compute for laptop sized jobs to where the data happens to be.
The datasets they test against are 6gb and 15gb, and I get that those are the two one of their references uses, but that’s clearly not multi-node territory. Also as they point out graph computation is not trivially parallelized. Spark is more for doing long running transformations on independent data in a fault tolerant way.
MPI and Spark solve very different problems. The overlap is basically zero and the fact that MPI is flat while Spark fluctuates shows this. HPC is a small small fraction of the job market, the number of folks involved in HPC is tiny compared to all the click-log analyzing Spark and Hadoop programmers.
Both of those systems are bulk parallel systems, with the majority of folks aggregating low information density data. This is not what the MPI clusters are doing modeling weather and calculating subatomic interactions or simulating wind tunnels.
> The idea that the people at Google doing large-scale machine learning problems (which involves huge sparse matrices) are oblivious to scale and numerical performance is just delusional.
Scale is not latency! Google scales to sizes many orders of magnitudes larger than MPI clusters, but it does not run workloads with the same connectivity needs that MPI workloads need. It doesn't run Super Computers, it runs massively parallel bulk embarrassingly parallel computers.
Chapel is a language, MPI is a transport. The author obviously has skills, but they shouldn't be conflated. Chapel can use MPI.
Chapel supports OFI MPI uGNI GASNet. This is not unlike saying don't use SCTP use Python! I am not being charitable.