Unfortunately, in ML "public data" typically means available to the public. Even if it's pirated, like much of the data available in the Books3 dataset, which is a big part of some other very prominent datasets.
So basically youtube all over again? I.e bootstrap and become popular by using widely available whatever media (pirated by crowdsourced piracy) and then many years later, when it gets popular, dominant, it has to turn around and "do things right" and guard copyrights.