I can't really understand speech these days without the captions to go with it. But I encounter discrepancies with AI generated captions very often. As in, I heard something and from context I know I'm right and the AI is wrong. With Whisper and other deep learning based speech systems in particular - they can generate very plausible misinterpretations - sounds similar and is grammatically plausible - but not what was said. Of a kind that a person with semantic understanding of what's going on would not make. So I am a little leery of them for that reason. I rely on it every day for generating captioning to video and so on. I don't find any iteration I've tried reliable or comfortable for interactive use.
> I encounter discrepancies with AI generated captions very often. As in, I heard something and from context I know I'm right and the AI is wrong.
I've been noticing this as well. It's becoming a common problem. Also, many times I've noticed that if I hadn't heard the speech being captioned and only had the captioning to go by, I would have had little chance of correctly understanding what was actually said.
[Applause] on YouTube transcripts, short two or three syllable sentence fragments, and absolute nonsense are the only ones Iād be able to reliably detect sans audio. But doubt YouTube captions are state of the art given how poor it is.