Gatesyp's comments

Gatesyp · on Oct 3, 2021

Read through the homepage, but not entirely sure --

Why not just train on Spot Instances with a retry implemented?

I see that SpotML has a configurable fall back to On-Demand instances, and perhaps their value prop is that it saves the state of your run up to the interruption + resumes it on the On-Demand instance, but why not just set a retry on the Spot Instance if its interrupted?

I'm failing to see what is different about SpotML vs Metaflow's @retry decorator and using AWS Batch: https://docs.metaflow.org/metaflow/failures#retrying-tasks-w...

If you're in the comment still, Vishnu, would love to hear your thoughts

vishnukool · on Oct 3, 2021

Interesting, thanks, we weren't aware of Metaflow.

I've read through the docs, the one difference that comes to my mind is the automatic fallback to on-Demand and resume back to spot when available. I can't readily see a way to do this yet in Metaflow, but it's possible I've missed something.

vtuulos · on Oct 4, 2021

There are a few different ways to deal with spot interruptions. First, it is a good idea to specify multiple instance types in your compute environment so even if some instances types become unavailable in spot, Batch can use another type automatically.

Second, you can rely on Spot Fleets which handle both spot and on-demand instances seamlessly https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fle...

Gatesyp · on Aug 7, 2020

What is the link to source code?

kawicoder · on Aug 7, 2020

https://github.com/Geczy/tinder-autopilot/