I've been taking the opposite approach by playing around with different hardware...

I've been taking the opposite approach by playing around with different hardware bottlenecks, and the models I can run on base-level (e.g. 1050 ti, i5-7th) machines are all useful for specific tasks if I deconstruct a process far enough.

If you know exactly what you want and when you need the results, calculating a hardware floor becomes deterministic. If you don't need the results right away, or if you're comfortable gluing together results yourself, pretty much any box released in the past decade can be doing work for you.

For a quick experiment grab a few images and a one-line prompt e.g. "describe these pictures", and bounce that set off every model/quant you can reach. If you record quality, rate, and cost of each request there might be regions of

`"good enough", "good enough", "electricity + sweat"`

in the resulting spreadsheet. And this is multimodal. Single-mode classification is dirt cheap.