Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have an RSI and I've been coding by voice exclusively for about 7 years. I used a system built on top of Dragon for most of that and in the last year switched to Talon.

I think there are multiple reasons:

* The obvious market is dictation of natural language, but this isn't what you want for voice control. If you try to use long descriptive phrases as your command language everything takes forever. So instead you end up making your own mini command language where all of your common actions are a single syllable, but now it's no longer the English or other natural language that users already know. So now your product has substantial learning curve just like learning a new keyboard layout.

* Everything other than talon has terrible latency. Most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands.

* In order for it to be really effective you need the cooperation of applications (this is why I've written extensive emacs integration). Some tools like window speech recognition try to hook in at the UI layer in order to figure out what text is in dialog boxes and such, but in practice they seem to do a pretty terrible job. Windows speech recognition has a very hard time consistently understanding what links you are trying to get it to click on for example. There's also a long tail of applications that just do their own custom UI rendering inside a blank canvas where no hook is possible.

* Good speech recognition even if not specifically targeting computer voice control is a genuinely hard research problem, and standard benchmarks for accuracy are misleading. You see "95% accuracy" and you are like wow that's a high percentage computers almost have this speech recognition thing solved and then you think about it harder and you go wait a minute, that's one mistake every 20 words! Maybe you are still impressed, but then you have to take into account that when the computer does the wrong thing you'll need to issue more commands in order to correct it, which will are also likely be misinterpreted. When you make a typo with a keyboard the mistakes rarely cascade, you just hit backspace.



"Everything other than talon has terrible latency": False! I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend, which has extremely low latency. You can adjust how aggressive the VAD (voice activity detection) is to suit your preference, but the speech engine latency can be almost negligible, especially for voice commands (vs prose dictation). However, I agree that "most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands", and that low latency is pivotal to being productive with voice commands. I also agree with your other points.


I built a similar app using a Kaldi's nnet3 model running embedded; the thing was so responsive that our demo to an SVP went sideways: when he gave a query, the app responded nearly immediately after the sentence ended. The SVP did not realize it already responded, as the expectation for voice interaction systems was that it takes like 2-5 seconds to get an answer, which made the impression that the system did not work properly.

So, moral of the story, if you do a too good job of making a fast speech engine, especially for multi-turn dialogues, add some delays so it resembles human dialogue more.


Sorry, should have said everything I have tried :)

At some point when I have enough free time I will have to take a look at this! Thanks for putting time into this kind of thing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: