Your first sentence seems unrelated to your second sentence, but it seems like you think they are related?
Any platform that advertises something like you don’t have to spend time defining your training environment, whether it’s Databricks or an out of the box deep learning VM on GCP, is a liability waiting to happen. You always need to define your own training environment, especially because you’ll almost always need specific (likely pinned) versions of all your dependencies, including system dependencies, to manage model training as part of a production life cycle. Very often you also need e.g. custom compiled TensorFlow, custom GPU settings or drivers, etc. It’s very foolish to base that environment on whatever comes out of the box.
Spark also is not universally useful. For training small models many times, like a workload that trains hundreds of small models all day (this was a production use case I had before that my company pilot tested with Databricks), the overhead of Py4J connector is insanely bad. It’s a really terrible paradigm for Python software, meanwhile Scala is a miserable ecosystem for production machine learning models. On top of all this, Spark MLib has huge gaps in functionality and whole classes of problems (e.g. large scale MCMC inference) are not solvable in a way in Spark that is seriously comparable to other tools like STAN & pymc running on not-Spark with simple multiprocessing.
Any platform that advertises something like you don’t have to spend time defining your training environment, whether it’s Databricks or an out of the box deep learning VM on GCP, is a liability waiting to happen. You always need to define your own training environment, especially because you’ll almost always need specific (likely pinned) versions of all your dependencies, including system dependencies, to manage model training as part of a production life cycle. Very often you also need e.g. custom compiled TensorFlow, custom GPU settings or drivers, etc. It’s very foolish to base that environment on whatever comes out of the box.
Spark also is not universally useful. For training small models many times, like a workload that trains hundreds of small models all day (this was a production use case I had before that my company pilot tested with Databricks), the overhead of Py4J connector is insanely bad. It’s a really terrible paradigm for Python software, meanwhile Scala is a miserable ecosystem for production machine learning models. On top of all this, Spark MLib has huge gaps in functionality and whole classes of problems (e.g. large scale MCMC inference) are not solvable in a way in Spark that is seriously comparable to other tools like STAN & pymc running on not-Spark with simple multiprocessing.