A Comparison of Distributed Machine Learning Platforms

agibsonccc · on July 31, 2017

I would like to encourage anyone to read these kinds of "evaluations" with a bit of criticism.

When doing evaluations there are a lot of nuances.

I will comment on a real world example from our own experience.

Disclaimer: I am biased running a company implementing my own platform. I will try to stick to topics that folks should "look for" when evaluating these things rather than inviting a debate on "my tool vs yours"

For deep learning on spark specifically, there are 2 workloads, training and inference.

Your data pipeline here matters a lot. A job that pulls data from hive will surprise be bottlenecked by hive. This by extension means the kind of data and small things like the number of records you fetch or cache on say: hdfs matters a TON.

On our platform, we implement 2 versions of distributed communications: spark driver and parameter server. The parameter server is faster. The latter is what mxnet and tensorflow do as well. Tensorflow uses grpc underneath. mxnet writes its own.

While both are data flow graphs, the latter is optimized for linear algebra and I would like to say has very different requirements (for example gpus which are another level of complexity)

When using these parameter servers, you should look at what guarantees these things provide. For example, do you need consistent latency? Fault tolerance? At what scale?

Something as simple as"do these parameter servers do p2p on each node?"

With spark, it's not well designed for that. A lot of workers are meant to be stateless.

With spark another gotcha (especially on the linear algebra part) is on heap memory. When you are using say: pyspark for example, how you handle data movement matters a lot, and again surprise ends up being your bottleneck.

We use pyjnius instead in our python interface for linear algebra: https://github.com/deeplearning4j/jumpy

We implemented this because we liked the idea of being able to pass numpy pointers straight to the JVM rather than copying data. I believe pytorch and co do this as well (which are just straight C/C++). That is the right way to do it

The default is py4j (which overall is more versatile but not meant for DL)

Don't even get me started on a normal etl pipeline vs "images", which is typically what you see.

Another big complexity with say: spark and even gpus is gpu usage. When using mesos, you can only reserve a whole gpu, not fractions.

Another thing is worker setup. For deep learning specifically, What BLAS did they use? How was it compiled? Was it optimized for the right cpu architecture?

How does the the workers communicate with the gpu, if any?

If you're using a JVM, how do they handle the FFI?

How are workers configured for cpu/cpu? Things like OMP_NUM_THREADS on each worker matter a lot.

If you're using spark, did you configure k small workers or 1 big one? How did that interact with your multi threaded linear algebra?

Hopefully this gives people an idea of how complicated these things get.

FractalNerve · on July 31, 2017

Before reading the Blog's summary of the paper, I thought hey they're going to analyze the distributed algorithms[1][2] exclusively and give conclusions on how heterogenous hardware usgage of GPU/CPU works (best) when spread over the network/cloud. My guess was that MXNet would win, because Amazon [3][4] and benchmarks favored it[5]. Unfortunately surveys like this are too complex and out of date, when released, due to the fast progress in our field.

>> @agibsonccc What exactly would YOU, your company and your team like to see reviewed and what would you make different?

A standard benchmark, like the Open-Source, Automated Benchmark provided by the Phoronix-Test-Suite[6] or the benchmarksgame[7] are what we need. Everybody should be able to contribute easily, or even anonymously like with Steam's hardware survey[8].

The main problem with distributing operations with Caffe in example is the need to learn another system for a different programing flavor.

-- References

[1] https://en.wikipedia.org/wiki/Distributed_algorithm

[2] https://en.wikipedia.org/wiki/Category:Distributed_algorithm...

[3] http://www.infoworld.com/article/3144025/cloud-computing/why...

[4] http://www.allthingsdistributed.com/2016/11/mxnet-default-fr...

[5] See Page 52 - http://alex.smola.org/talks/NIPS15.pdf

[6] https://www.phoronix-test-suite.com/

[7] http://benchmarksgame.alioth.debian.org/

[8] http://store.steampowered.com/hwsurvey/

agibsonccc · on July 31, 2017

Yes indeed! Thanks for the follow up. I just wanted to give folks a survey of the things they should look at when evaluating things like these charts to see how fast something is.

The casual reader won't know to look for these things (it's actually not their fault, a lot of this knowledge is very field specific)

I've made comparisons to deep learning flow graphs and spark's computation graphs in the past, but there are very different considerations and use cases for both.

I think what I was commenting on was more of the minute details that get lost in summaries like this.

If anything this is a pitch for having an end to end specialized system that understands your data pipeline all the way to your linear algebra calculations. That way all of these variables are taken care of.

A simple example where this matters a ton is just the ETL in to a neural network.

You're loading data from some disk (that may or may not be networked).

What's the input format of that data? What's the block size?

Say it's image blobs stored on hdfs. You know ahead of time your gpu can fit an optimal batch size of 32 in to memory because your image resolution is high. You then need to go back and prepare the data for loading. To get the optimal performance for data loading, you need that data loading to be aligned with hdfs block size (otherwise you cause an additional hdfs seek) and saved in whatever your favorite vector format is. Then to evaluate how fast a "minibatch" will be processed, you need to figure out how fast your architecture is in addition to how the communication with the gpu is handled.

I think overall this is just a hairy subject to cover in a small blog post.

While these "programming flavors" are very similar, they are optimized differently in practice.

If you think about spark, the "nodes" are more general purpose operations with a ton of JVM based code gen happening.

It knows about different "edges/ops" in addition to having a registered schema. It's lazily initialized with a DAG built with code gen after the declarations are done.

A "tensor graph" doesn't have types beyond knowing what device it's running on and perhaps a data type like float/double/half.

sandGorgon · on July 31, 2017

do you see tensorflow extending its reach into some of the stuff that spark is doing today ? For example, MLLib and GraphX ?

The problem that I see is that Spark (especially pyspark) makes it very easy to reason about dataframes in a distributed way. the primitives let you think about it.

I have a hard time doing the same in Tensorflow. Maybe it was not built for the generic computation stuff that Spark does - but it seems so close.

agibsonccc · on July 31, 2017

Not really. It's just different graph structures and optimizations.

Spark in general just isn't well suited for tensorflow graphs.

You're hitting on exactly what I'm talking about. It's very different use cases.

1 is tensors, the other is data types with schemas.

Tensors are also n dimensional. Schemas are inherently 2d. The ops are also different. What makes spark's op graph good vs tensorflows are very different.

Tensorflow in general tends to lean on pandas and the like. They are sorta trying to do that with pipelines, but at the end of the day, I think it would overextend tensorflow's scope to "care" about anything outside of math.

"ETL" and dataframes is already pretty well served.

If you want something that does this well, look at arrow which is a great communications layer. It even has a tensor type now.

Disclaimer: Biased rant below. I compete with databricks, google cloud,tensorflow, mxnet and a lot of the spark ecosystem

The other problem is..pyspark itself is just bad for linear algebra. Period. You need specialized communications for tensors because the data often lives elsewhere and is in a blob type format. To get benefits from simd and the like you shouldn't be adding networked communications on top of this.

That's why we made jumpy for our library. We interop directly at the pointer level and then use our parameter server and gpu stack.

Using arrow I think mapd is attempting to do this as well.

I don't see spark ever becoming good at linear algebra anytime soon despite what the folks at berkeley are pushing.

As for "reasoning about data frames" spark underneath basically turns your dataframe declarations into a DAG. It does this in a very different way than tensorflow's XLA though.

XLA's code gen is focused on math ops. Schemas are a whole other different animal. The great thing about spark is it already has "typed rdds" like dataframes and datasets it uses internally. It's really to rely on the JVM's type system and JIT for this.

Whereas with math, if you constrain the domain and can make assumptions you can do codegen assuming that all you're doing are "for loops" which generalizes well for cuda and cpu with simd/AVX etc.

You can't make the assumption that everything is a "for loop" as much with spark.

We write and maintain datavec: https://github.com/deeplearning4j/DataVec

and nd4j: https://github.com/deeplearning4j/nd4j

which are integrated together and have a "tensor data type" which uses nd4j directly. This is very similar to what arrow does.

We define a "transform process" which you could say: do ffts with.

That kind of ETL isn't supported with dataframes directly without a lot of hacks.

TLDR: TF will mainly stick to linear algebra

sandGorgon · on July 31, 2017

that's interesting - but you have mentioned that TF does a better job than spark at parameter servers (and grpc, etc)

So instead of pushing spark towards linear algebra, isnt it easier to push TF towards schemas ? Or are the architectures so different that they will live separate lives

agibsonccc · on July 31, 2017

TF is mainly "for loops". TF is also missing a whole schema setup.

TF is mainly a linear algebra library. I don't see google moving TF towards "schemas".

Spark (and flink and other competitors to it) are general purpose execution engines.

I think what's missing here is the target execution paradigms are different.

If you are curious about these things, I would look at how database internals are implemented. Those are a mix of "simd" and "schemas".

Generally you have to make storage format assumptions among other things.

One of the biggest advantages of spark and flink are "partitions". The DAG engine for spark has this out of the box. Partitions are a lot more generalizable than say: "model parallelism on devices" which is only for floats anyways.

There's also a usability factor here. Spark's graphs are a lot more expressive/extensible.

It's like I mentioned earlier with py4j vs pyjnius/pointers.

1 is more expressive/general purpose but has overhead. TF is specialized.

So if we kill the "TF cuz hype" discussion here and focus on "linear algebra" extending this discussion to pytorch, our library, caffe, etc, there's just not much incentive to "put lipstick on a pig" so to speak.

Allowing writing arbitrary code with a type system is by definition not going to help any of these "computation graph linear algebra" libraries.

JoeriHer · on July 31, 2017

Depends on how you implement the Parameter Server. You could, for example, fork a Python process and implement your own parameter server. No Spark required.

sgt101 · on July 31, 2017

Why would you pull data from Hive if you are running Spark on HDFS?

agibsonccc · on July 31, 2017

Legacy data sources. People tend to have a mix of technologies. You can use spark sql to do this: https://spark.apache.org/docs/latest/sql-programming-guide.h...

Spark can also access hdfs. You are typically mixing data sources as well.

JoeriHer · on July 31, 2017

To be honest, I think the blog post lags a lot of details and gives only a - very - basic overview of the field (or I'm missing the point). I've been working on distributed optimization systems for over a year now [5], and the field evolved drastically in this time. The most profound contribution is from [1] imho, where they show that asynchronous optimization can be defined in terms of implicit momentum.

Nevertheless, during my research I found that local work matters a lot! Especially when you have large communication constraints. I addressed this in my master thesis [2], where I introduce 2 novel distributed optimization techniques based on a redefinition of parameter staleness. In this redefinition of parameter staleness, I define staleness in terms of the distance (or difference) between two parameterizations. In a sense, this allows you to automatically tune the amount of work done by a worker. For instance, if multiple workers compute gradients based on older parameterizations which were close to each other, they do not inflict negative work. Imagine it as follows, at the start of an optimization process, you will have relatively large gradients compared to when you are close to an optimum. As a result, having a lot of asynchronous workers at the start of the optimization process (using DOWNPOUR) could really hurt the convergence capabilities, and could even cause divergence (for GIF see [3] and [4]).

To counter this, I introduce AGN (Accumulated Gradient Normalization) as a mechanism to compute a better single gradient update based on increased local exploration. Basically, you compute a sequence of first-order gradients, and then normalize them with the number of "exploration steps" to produce a better parameter server update (this also reduces the magnitude of the vector, reducing the amount of staleness injected in the system, see Chapter 3 of my thesis).

Nevertheless, divergent behavior can still be observed as the number of asynchronous workers increases. This can be countered by constructing a per-weight learning rate based on the difference between the parameterization of the central variable (parameterization held by the parameter server), and the - current - parameterization of the worker (see Chapter 4). Intuitively, this mechanism will nullify the contributions of workers which are "too far" from the current central variable.

[1] http://stanford.edu/~imit/tuneyourmomentum/theory/ (paper in blogpost) [2] https://github.com/JoeriHermans/master-thesis [3] Animation convergence http://joerihermans.com/media/blog/ddl/downpour.gif [4] Animation divergence http://joerihermans.com/media/blog/ddl/downpour_momentum.gif [5] https://github.com/cerndb/dist-keras

arcanus · on July 31, 2017

I'm not convinced that SGD is the answer for large scale distributed problems. It is certainly the best answer for single node applications, but it might not be the best for distributed training. The search for effective optimization algorithms in machine learning is ongoing.

The big problems for SGD at massively parallel scale:

* SGD does not parallelize well

* SGD is a first order method affected by ill-conditioning

* SGD is too noisy when high accuracy is desired