JoeriHer's comments

JoeriHer · on July 31, 2017

Depends on how you implement the Parameter Server. You could, for example, fork a Python process and implement your own parameter server. No Spark required.

JoeriHer · on July 31, 2017

To be honest, I think the blog post lags a lot of details and gives only a - very - basic overview of the field (or I'm missing the point). I've been working on distributed optimization systems for over a year now [5], and the field evolved drastically in this time. The most profound contribution is from [1] imho, where they show that asynchronous optimization can be defined in terms of implicit momentum.

Nevertheless, during my research I found that local work matters a lot! Especially when you have large communication constraints. I addressed this in my master thesis [2], where I introduce 2 novel distributed optimization techniques based on a redefinition of parameter staleness. In this redefinition of parameter staleness, I define staleness in terms of the distance (or difference) between two parameterizations. In a sense, this allows you to automatically tune the amount of work done by a worker. For instance, if multiple workers compute gradients based on older parameterizations which were close to each other, they do not inflict negative work. Imagine it as follows, at the start of an optimization process, you will have relatively large gradients compared to when you are close to an optimum. As a result, having a lot of asynchronous workers at the start of the optimization process (using DOWNPOUR) could really hurt the convergence capabilities, and could even cause divergence (for GIF see [3] and [4]).

To counter this, I introduce AGN (Accumulated Gradient Normalization) as a mechanism to compute a better single gradient update based on increased local exploration. Basically, you compute a sequence of first-order gradients, and then normalize them with the number of "exploration steps" to produce a better parameter server update (this also reduces the magnitude of the vector, reducing the amount of staleness injected in the system, see Chapter 3 of my thesis).

Nevertheless, divergent behavior can still be observed as the number of asynchronous workers increases. This can be countered by constructing a per-weight learning rate based on the difference between the parameterization of the central variable (parameterization held by the parameter server), and the - current - parameterization of the worker (see Chapter 4). Intuitively, this mechanism will nullify the contributions of workers which are "too far" from the current central variable.

[1] http://stanford.edu/~imit/tuneyourmomentum/theory/ (paper in blogpost) [2] https://github.com/JoeriHermans/master-thesis [3] Animation convergence http://joerihermans.com/media/blog/ddl/downpour.gif [4] Animation divergence http://joerihermans.com/media/blog/ddl/downpour_momentum.gif [5] https://github.com/cerndb/dist-keras

arcanus · on July 31, 2017

I'm not convinced that SGD is the answer for large scale distributed problems. It is certainly the best answer for single node applications, but it might not be the best for distributed training. The search for effective optimization algorithms in machine learning is ongoing.

The big problems for SGD at massively parallel scale:

* SGD does not parallelize well

* SGD is a first order method affected by ill-conditioning

* SGD is too noisy when high accuracy is desired