Testing Privacy-Preserving Telemetry with Prio

oedmarap · on Oct 30, 2018

Great strategy here to aggregate results by splitting and distributing the data to different nodes.

As for them vetting third-parties as mentioned in the article, I'd actually prefer Mozilla to host different telemetry collection nodes itself and just route them round-robin anonymously within their own network.

IMO having to extend intermediary trust might be taking one step forward two steps back here if who collects the data != who analyzes it.

xcodevn · on Oct 30, 2018

You don't need to trust all third-parties. Crypto magic guarantees that what you need is at least ONE server being honest.

If you trust Mozilla, then you have nothing to worry about, except the proof or program may be wrong ;)

Third-parties are for someone who doesn't trust Mozilla. Or that Mozilla could be compromised by an unknown adversary.

betterunix2 · on Oct 30, 2018

The security model of the Prio protocol requires two or more non-colluding servers, so having Mozilla control both would be pretty questionable.

rhelmer · on Oct 30, 2018

Part of the threat model is that individual orgs can be compelled in various ways to turn over individual user data, and having different orgs holding their own private keys helps to mitigate this.

a_imho · on Oct 30, 2018

It is probable me but I don't understand what are the advantages here or what that picture supposed to convey. Data is transmitted and aggregated, albeit in a fancier way. What does it have over Telemetry?

callahad · on Oct 30, 2018

Telemetry is reported centrally to Mozilla, so we could theoretically observe individual responses as they came in, even if we only wanted to analyze the data in aggregate.

Prio uses cryptography to ensure that no one, not even the receiving servers, can see individual responses; the only way to view the data is in aggregate.

We need telemetry to make competitive products, but we also want to ensure that we don't see any of your personal data that you don't explicitly opt-in to us seeing. Similarly, we store all Firefox Sync data in a way that makes it impossible for us to decrypt: your data is your data. In contrast, Chrome defaults to allowing Google to read your synced browsing data, which Google explicitly uses to profile you for targeted advertisements.

Edit: If you're interested, you can find Mozilla's Data Collection policies and procedures at https://wiki.mozilla.org/Firefox/Data_Collection

a_imho · on Oct 30, 2018

even if we only wanted to analyze the data in aggregate.

Does moz://a do this? Analyze data in aggregate exclusively? Otherwise I understand Prio is inferior in granularity to Telemetry?

How should one interpret data aggregation of Category 4 “Highly sensitive data”? Put it in another way, is it possible to trace back data to individuals from aggregations or any other way?

We need telemetry to make competitive products

Personally I disagree with this. I consider privacy and the lack of bundled (spy|mal|ad|telemetry)ware a competitive edge.

callahad · on Oct 30, 2018

Not exclusively, but quoting the article, "for most purposes we don’t need to collect individual data, but rather only aggregates. [...] Once we’ve validated that [Prio]’s working as expected and provides the privacy guarantees we require, we can move forward in applying it where it is needed most."

So we'll be able to apply Prio where it makes sense, but stick with the current Telemetry pipeline where it doesn't.

I'm not a cryptographer, but from skimming the paper and slides at https://www.henrycg.com/pubs/nsdi17prio/, my naive understanding is that it would be practically impossible to trace back individual data from those aggregations as long as TLS works and at least one of the receiving servers is operating honestly.

betterunix2 · on Oct 30, 2018

"is it possible to trace back data to individuals from aggregations or any other way?"

It depends on what the aggregation function is and how many user inputs you collected before computing the aggregate. Strictly speaking you are asking a differential privacy question, but a typical use of something like Prio would be to compute an average, which is a very lossy function. If I tell you that among 10000 users it takes an average of 500ms to load a web page, you are not learning much about individuals (especially since I said nothing about the variance).

What Prio does is to ensure that nobody will see raw user data, while still allowing aggregate information to be computed and allowing the inputs to be validated. In other words, the fact that you are focusing on what the output will reveal is a sign that on some level you trust Prio to do exactly what it is designed to do as a mechanism for protecting user data.

deno · on Oct 30, 2018

Here’s a cool example of secret sharing from the paper[1]:

> App store. A mobile application platform (e.g., Apple’s App Store or Google’s Play) can run one Prio server, and the developer of a mobile app can run the second Prio server. This allows the app developer to collect aggregate user data without having to bear the risks of holding these data in the clear.

However I would like to know why Firefox doesn’t want to just adapt RAPPOR instead. It seems like a better fit for general telemetry, and it’s already used by Chromium.

I guess you can intersect and union Prio datasets in ways you can’t with RAPPOR?

[1] https://crypto.stanford.edu/prio/paper.pdf

xcodevn · on Oct 30, 2018

Google RAPPOR technique adds noise to the report at client (user) side. The noisy reports are then collected to a central server for analysis.

Doing this guarantees differential privacy, a strong privacy protection.

However, the major disadvantage of RAPPOR is that we get a collection of very noisy data points. Therefore, we need a very big collection to learn anything useful at all!

If the number of clients/users/records is small, RAPPOR is practically useless.

Prio, on the other hand, adds NO noise to client's reports. It uses crypto magic to compute some functions (mean/sum/...) on the collection of reports by a network of servers. Only outputs of these functions are known. No reports are leaked until at least one server is honest!

However, Prio has no guarantee that the computed output doesn't leak information about client's reports (in fact, it surely leaks some information). While, RAPPOR guarantees this protection statistically!

A current research direction is to add noise to the output of Prio as such it guarantees differential privacy.

betterunix2 · on Oct 30, 2018

RAPPOR relies on differential privacy only, including to protect the privacy of individual inputs; by necessity this limits the utility of the output. With Prio the inputs are protected cryptographically, which means you can have higher utility outputs (i.e. you can add less noise and achieve the same level of differential privacy). Combining DP and MPC techniques like Prio is an interesting research topic at the moment and comes up in a lot of "secure aggregation" settings.

rhelmer · on Oct 30, 2018

I worked on the Firefox integration (per the article) if anyone has specific questions.

If you want more detail on Prio itself, I'd suggest http://blog.ezyang.com/2017/03/prio-private-robust-and-scala... as a more gentle introduction than the research paper.