Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Each time you make a Voice Call on Telegram, a neural network learns from your and your device‘s feedback (naturally, it doesn’t have access to the contents of the conversation, it has only technical information such as network speed, ping times, packet loss percentage, etc.). The machine optimizes dozens of parameters based on this input, improving the quality of future calls on the given device and network."

What sort of parameters are adjusted?



I have the feeling that's there soley for the "AI-Powered" hype train benefit. Versus actually being used as a genuine feedback loop to improve the service.

Logs of speed, ping times, packet loss, would likely be more useful in good old, non-AI reports...to identify regional issues, peering opportunities, etc.

I don't think you need AI for voip calls that upgrade/degrade quality based on network health.


Sounds like marketing speech for variable bitrate encoding and/or adjusting compression aggressiveness.


> variable bitrate encoding

I hope not... VBR can easily leak all sorts of information (including the actual content of the conversation)


Whoa, I had never thought of monitoring VBR as an attack vector for recovering audio.

Do you have a link discussing this?


Sure, here are a couple papers on the topic:

https://www.cs.jhu.edu/~cwright/oakland08.pdf

https://www.cs.jhu.edu/~cwright/voip-vbr.pdf

It's fundamentally very similar to the sorts of issues you end up with if you compress then encrypt. If the attacker can make some educated guesses about the plaintext prior to the compression, the compression ratio can be a very powerful tool in their arsenal.


Wire implemented CBR for their encrypted calls, upstreamed it to WebRTC and submitted a patch to Signal, https://medium.com/wire-news/call-security-constant-bit-rate...


Silent Phone has used CBR since day 1.


Correct, last article I recall reading about deciphering VBR from packet size alone was something in the neighborhood of 50% success rate.


Then why not compress really efficiently by just transmitting packet sizes?


Because 50% of them won't be understood?


On the other hand, if you could do it, you'd probably have invented a convoluted speech-to-text (where text is a index into a dictionary of words). Note that you would also likely lose things like inflection, voice, accent etc - so while it might work as a texting system with voice input - it would be a poor substitute for voice chat..


Those codecs have tons of parameters to tweak (source: private conversation with Pavel Durov)


But what sort of parameters are adjusted?

This is HN. A link to an example would be appreciated.

Edit: To clarify, I work with audio codecs too, and can't really think of parameters (other than the compression level?) that would make much sense to adjust on the fly.

If "AI" is used for more than just a buzzword here, then I imagine the answer must be quite interesting.


They probably adjust the incoming / outgoing buffer sizes (and therefore the audio delay, since it's live) to account for packet loss.

They might also prioritize traffic depending on how full your buffers are.

I can only assume Youtube and Netflix do similar parameter tweaks to optimize their video delivery based on the connection (totally filling the buffer to a max size all the time would waste bandwidth, but if the client has lots of packet loss they need a larger safety net).


Right, looks like we're up to maybe 6 parameters. The claim was dozens which i take to mean at least 24, possibly 36 as the lower bound.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: