Read through the homepage, but not entirely sure --
Why not just train on Spot Instances with a retry implemented?
I see that SpotML has a configurable fall back to On-Demand instances, and perhaps their value prop is that it saves the state of your run up to the interruption + resumes it on the On-Demand instance, but why not just set a retry on the Spot Instance if its interrupted?
Interesting, thanks, we weren't aware of Metaflow.
I've read through the docs, the one difference that comes to my mind is the automatic fallback to on-Demand and resume back to spot when available. I can't readily see a way to do this yet in Metaflow, but it's possible I've missed something.
There are a few different ways to deal with spot interruptions. First, it is a good idea to specify multiple instance types in your compute environment so even if some instances types become unavailable in spot, Batch can use another type automatically.
Why not just train on Spot Instances with a retry implemented?
I see that SpotML has a configurable fall back to On-Demand instances, and perhaps their value prop is that it saves the state of your run up to the interruption + resumes it on the On-Demand instance, but why not just set a retry on the Spot Instance if its interrupted?
I'm failing to see what is different about SpotML vs Metaflow's @retry decorator and using AWS Batch: https://docs.metaflow.org/metaflow/failures#retrying-tasks-w...
If you're in the comment still, Vishnu, would love to hear your thoughts