It does not sound like a realistic capacity plan. The reason this works in the cloud is the inference can be run in parallel on a huge amount of hardware for a short time. To run those kind of models on your rinkydink computer would take forever.
An Nvidia 3090 GPU can run open ai's whisper at 17x realtime[0]. they're not exactly cheap (~$500?), but they're cheap enough that running the transcription end at home is quite feasible. And, it includes translation, so you don't have to do it in English.
Searching all of a downloaded copy of Wikipedia wouldn't be that computationally expensive either if the assistant has hot words it picks up to look up.
It's also possible to use local AI processing chips like Coral or Gyrfalcon for this.
Could just load up a pcie card full of them if necessary. A local home AI would be such a boon to people, not just the average person but the elderly as well, combined with a refined GPT etc it could conversationally respond to requests rather than most assistants' current request->response "I am a robot" scheme.
>Your son called when you were asleep to ask if you wanted to get coffee today, shall I call him back for you or put you through to him?
>X, you've fallen! Please let me know you're okay or I will call emergency services for you
It's sad that we have the technology to do this already but haven't.
One second of Google cloud TPU has roughly the same number of floating point operations then 4 hours of raspberry pi 4B time.
So 3 minutes of cloud TPU time already covers your whole month of raspberry pi usage. Pretty sure it costs them less than 5$ as well, since they have the hardware anyways.
"The cloud" is also massively parallel software. If I run a Google search, many thousands of CPUs will be brought to bear on my query, and a gazillion DIMMs, and all the throughput of a hell of a lot of SSDs, and so on. If you just happened to have a copy of the web, and an index of it, on "a computer" no matter how big, it would be impossible to get prompt answers.
If Google (or whomever) needs to run voice models, they take your query and all the other queries that arrive in the same millisecond, smoosh them all together and shove the batch into a TPU and run it. You don't have any TPUs and you also don't have any traffic you can use to amortize the cost of your infrequent queries.
The idea that you could run these kinds of ML inference tasks is economically fanciful. You would need a huge investment in hardware and the opex would be ridiculous.
> The idea that you could run these kinds of ML inference tasks is economically fanciful. You would need a huge investment in hardware and the opex would be ridiculous.
Google, Apple, Amazon and even Sonos are all releasing voice assistants that work locally on their relatively low powered speakers.
Apple seems to be ahead with what is local, while Google seems to be the smartest. (Sonos doesn’t have a cloud, but it’s not ‘general purpose’ afaik).
Sure you can’t amortize them across a bunch of TPUs BUT instead they can ship custom hardware. A tpu needs to be big and support parallel streams. A home server may only need to ever serve one stream. There are arduino style devices that can perform basic tensor flow audio models in real time now. And obviously most phones can perform this locally now, so depending on opinion that may be considered affordable.