I'm interested in how they 'hibernate'/save the state of the instances within the shutdown time limit. I was also looking into this for myself, there are ways of using docker to save the in-process memory a-la hibernate, which would work well with this. But, especially for GCP where you only get ~60 seconds between the shutdown signal and hard stop, I was worried that it wouldn't save it fast enough. I often work on pretty high-ram instances and thought even saving from ram to disk would take too long for 150-300GB ram uses.
I hadn't heard of nimbo, maybe I can read how they're doing it since it's open sources. Does anyone have any idea how they're saving state so fast (NVME SSD disk?)
It uses a mounted EBS(Elastic Block Store) so all the checkpoints, data etc. is already in the persistence storage. This is simply be re-attached to the next spot/onDemand instance after interruption.
Edit: Also no GPU support AFAIK but https://github.com/twosigma/fastfreeze looks really nice, turnkey. I wonder if I write to a fast persistent disk if I can get higher maximum ram than over the NW
(or, hacking on a checkpoint idea, have a daemon periodically 'checkpoint' other programs so even if it's too slow over 60 seconds, revert to the last checkpoint. Even an rsync like application where only send the changes)
Oh I didn't see much in Nimbo with a quick glance but reading more closely
> We immediately resume training after interruptions, using the last model checkpoint via persistent EBS volume.
Makes sense, just save checkpoints to disk. What I'm doing is more CPU bound and not straight ML so less easily check-pointed, sadly. Cool though, it's worth jumping through hoops for 70% reduction
I hadn't heard of nimbo, maybe I can read how they're doing it since it's open sources. Does anyone have any idea how they're saving state so fast (NVME SSD disk?)