We had speech recognition back in the 1990s on computers less powerful than a Raspberry Pi V1. We're talking 200-400mhz 32-bit Intel boxes. So yes, the cloud dependency is very dubious.
If leveraging a lot of data allows for better speech recognition, why can't your computer access a remote speech recognition data set that stores and shares the results of its machine learning algorithms rather than uploading actual audio data? Instead of sending actual audio, send and receive very non-personalized non-specific derived model data to/from a repository somewhere (or even peer to peer).
>We had speech recognition back in the 1990s on computers less powerful than a Raspberry Pi V1. We're talking 200-400mhz 32-bit Intel boxes. So yes, the cloud dependency is very dubious.
And did you ever use it? Forget sentences, it used to even struggle on a handful of keywords. Even now offline recognition are way far behind the online ones. I have pocket sphinx installed on my raspberry pi and even in a quiet room it has false positives with just a list of 10 keywords. Ohh what I would do to have an offline recognition system that is on par with Cortana/Siri/Google Now.
Not sure if you're being facetious or not but if you were right then we would just do it on our existing phones now.
In the 90s we had slow voice recognition that took a long time to train, that would only ever work for a single user, in a silent room... If it worked at all... Which wasn't very common.
> Not sure if you're being facetious or not but if you were right then we would just do it on our existing phones now.
The point is, some of us don't believe that this was an engineering choice.
> In the 90s we had slow voice recognition that took a long time to train, that would only ever work for a single user, in a silent room... If it worked at all... Which wasn't very common.
And in the 2000s we had fast voice recognition that took a little bit of time to train and that would work over a crappy microphone with loud music playing in the room, all of that running along other software on a $500 PC. I know because in 2007 I made my own Star Trek-like (with proper computer sound and voice feedback) voice recognition system I used to control music that was played on Hi-Fi speakers. It took me like 20 minutes to train it and it worked pretty much flawlessly from anywhere in the room. The voice was captured by a crappy mic I soldered myself from parts and placed on a wardrobe.
And the single-user-only mode? That's actually a feature, not a bug.
Hear hear. Of course the cloud is not necessary for good speech recognition. There is no magic there, it's just servers running against a corpus that gets updated often. No reason why this couldn't be done locally, and text queries sent out for non-local requests (such as, what's the weather gonna be).
But I gotta say, I have the feeling that the pendulum is gonna swing back pretty soon. I'm noticing more and more (regular) people being fed up and creeped out with the massive harvesting that Google, Facebook and Microsoft are doing. Opportunity awaits!
... on a Pentium I, using 1990s machine learning algorithms, sure.
Nobody's answered my question as to why The Cloud is the magic pixie dust that solves this problem, and why it could not be solved locally with modern compute power and modern ML techniques.
There are several tremendous advantages to server-based speech recognition.
Firstly, the models (particularly the language models) needed for state of the art performance are huge. It's not atypical for papers to discuss using a billion n-grams, for example ( https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm1201415/s... ). That's several gigabytes of memory and storage at the very least, and you'd need a copy of that for every spoken language you'd want to support. Plus you need to keep that up to date with new words and phrases; it's much easier to keep models fresh on a server than on everyone's computer.
Power and CPU time are also a concern. Big beefy server farms can have trouble keeping up with state of the art speech recognition algorithms; a laptop, tablet or phone is going to struggle, especially when running off a battery, is at a huge disadvantage.
But the biggest advantage to server-based speech recognition is indeed that more data is critical to improving accuracy and performance. There's no data like more data. And you don't just need more data, you need a lot more data. You can get big gains from just doing unsupervised training on 20 million utterance rather than 2 million: http://static.googleusercontent.com/media/research.google.co... There's simply no way you're going to get anything like 20 million utterances without getting data from millions of real world users.
The large data size affects the training, but the model itself is pretty small now (after some hard work on Google's part).
The thing everyone seems to be missing is that Android's (English) voice recognizer is offline[1]. While you can use the online model I suspect that is more about continual update of the model (so it understands new words and changing accents etc) rather than recognition.
Because many people speak similarly. If a number of people who speak similarly can train it, it can learn how you will say words that you haven't even said to it yet if a number of other people already have.
Machine learning algorithms haven't changed that much since the 90s, what's changed is the amount of data we have access to, and the amount of data we can process.
When you're training it yourself the data is what's limited. The fact that we can process more data doesn't matter if we don't have access to more data because you can't speak any faster.
But if you have millions of people speaking to it, then we can take advantage of the fact that we can process so much more data.
And why not have all the features that can be done locally be done locally. If it's possible for my computer to understand me entering an apportionment, why should that go to a MS sever to be stored forever?
How do you think your computer can understand 'entering an appointment'?
There's a lot more that goes into understanding than JUST speech recognition. First of all, speech recognition by itself isn't exactly trivial, and that's become more and more obvious as we've seen the smallest accent mess with the digital assistants on all the major phones. Yes, technically, Dragon Naturally Speaking existed a decade ago and worked somewhat, but needed a LOT of training, and was dumb as a brick. It doesn't compare.
But beyond that, understanding the meaning of the spoken word is difficult too. Yes, NLTs exist, and they can be very good, but you really need something that a team is administering. They can identify pain points and do regular updates to help... things like an odd band name that is ALWAYS misunderstood, some odd combination of words that confuses a question with a 911 call, etc., otherwise you're just going to end up frustrated.
I should also mention that a digital assistant really needs the power of a full search engine behind it. This allows for auto-correction of mispronounced words, but it also allows near-instant lookups for relevant information. If this was running on your local machine, not only will the processing be slow for some things, it will also be more limited in it's ability to fully process all possible meanings, and it will need to be updated CONSTANTLY.
These companies, by putting the language processing in the cloud, are throwing teams and hardware at the problem, and yet they STILL have embarrassing difficulties when it comes to actually understanding sometimes. Consider that for a moment... hundreds, even thousands of servers running the latest software for processing natural language for multi-millions of people aren't capable of getting your meaning 100% of the time.
Incidentally, I realize that there some open source projects out there that do some rudimentary voice recognition and processing, however they suffer from the same issues addressed above and are MUCH more limited in many many ways. Many of them still make use of cloud-based services for processing the audio, btw. The one advantage, I will say, is that you have to ability to add your own custom commands and actions, which the major systems obviously don't allow.
I mean more of my problem has to do with the fact that it's an open door than what they will actually do. It doesn't say they will send audio data, it at most says "associated input data" which for all I know could be a database from their algorithms, or it could be a live 24/7 stream from my webcam and audio device.
I guess the thing is that some things are not acceptable, and whether there's a disclaimer or not, people aren't going to like it if we find out that all of our audio is being recorded and uploaded to Microsoft. But it's not, not as far as anyone can tell yet.
But again, we're only worried because they're what, giving us the option to opt out? I mean, if they wanted to they could just go ahead and stick somewhere in the privacy policy something like "from time to time microsoft will upload certain input data for improvement of service quality, depersonalized information may be sent to partners." down in paragraph 24.c.iii. Or they could just not mention it at all.
The question is are you willing to trust the OS. I mean, hell, Ubuntu Linux went and sent all of your search information to Amazon without even giving you the option to opt out in the install process at one point. It could be disabled, but unless you knew about it in advance there was no option to do so. And Ubuntu is open source.
I can see use cases for it, and one actually ties into the location services. Say you're from a region with a specific accent. If the system can tell how you speak, and how other people speak around you, it might be able to create an accent subset for you based on the collective data from all of those speakers. It might be able to guess from a few sentences and your location that you're Glaswegian and start to understand you, not because you trained it, but because across the region many people have trained it a bit. Then with the location to tie the regional accent together, even if you're in the US once you've spoken a few phrases it might be able to identify you as belonging to that regional language group.
But uploading of all spoken data to Microsoft would be silly, not just because it would piss people off, but because it wouldn't be something you could hide, and it would end up being quite a lot of data that's really not that useful.
But could it be possible? Sure. But they could also do it without tipping you off or giving you the ability to opt out.
If leveraging a lot of data allows for better speech recognition, why can't your computer access a remote speech recognition data set that stores and shares the results of its machine learning algorithms rather than uploading actual audio data? Instead of sending actual audio, send and receive very non-personalized non-specific derived model data to/from a repository somewhere (or even peer to peer).