I'm interested in how they 'hibernate'/save the state of the instances within th...

vishnukool · on Oct 3, 2021

It uses a mounted EBS(Elastic Block Store) so all the checkpoints, data etc. is already in the persistence storage. This is simply be re-attached to the next spot/onDemand instance after interruption.

JZL003 · on Oct 3, 2021

Cool yeah that makes sense, makes total sense for ML where you just need to run over epochs, less clear for other workloads.

After looking around I thinking more about CRIU/docker suspend. The google stars aligned and I found this https://github.com/checkpoint-restore/criu-image-streamer + https://linuxplumbersconf.org/event/7/contributions/641/atta... which actually seems perfect. I wonder how fast it is

Edit: Also no GPU support AFAIK but https://github.com/twosigma/fastfreeze looks really nice, turnkey. I wonder if I write to a fast persistent disk if I can get higher maximum ram than over the NW

(or, hacking on a checkpoint idea, have a daemon periodically 'checkpoint' other programs so even if it's too slow over 60 seconds, revert to the last checkpoint. Even an rsync like application where only send the changes)

JZL003 · on Oct 3, 2021

Oh I didn't see much in Nimbo with a quick glance but reading more closely > We immediately resume training after interruptions, using the last model checkpoint via persistent EBS volume.

Makes sense, just save checkpoints to disk. What I'm doing is more CPU bound and not straight ML so less easily check-pointed, sadly. Cool though, it's worth jumping through hoops for 70% reduction