This looks like a pretty good technique, that's coming from someone who has collected 240GB+ of user:password dumps.
I certainly wouldn't get 16TB of disks just for that if it were ever leaked.
Bummer(not for me :p) that you guys went the route of patenting it and keeping it proprietary & only available through an API.
I think it would be adopted in no time if it were open source, and I'd definitely like to see something like this available as a service on clouds like GCP/AWS/Azure/etc for my day job.
The approach has an economy of scale where a shared pool can secure many sites' hashes at very low cost to individual sites, but where the sum-total can fund a very large data pool. I would love to grow this to 1PB and beyond. The idea behind the patent is to give us a chance to try to grow exactly that service.
Fundamentally the technique is quite simple and easy to copy, yet IMO it is better than computational/iterative hashing in every way -- cost, performance, scalability, and security. It seemed to me a perfect example of something worth patenting. If we're ultimately not successful in commercializing it, I would want to relinquish the patent to the public domain.
The most important part -- and what's kept me working at this for years now -- is that it protects even weak passwords after a company is breached. It takes the onus (and a lot of the blame) off the end user, and solves the usability problem with passwords.
By the way, the same technique works equally well for adding BlindHash to your KDF used to decrypt your SSH key, or your laptop or your TrueCrypt volume. We can also add additional checks when running the BlindHash call for a given AppID to enforce things like;
1. must first rely to an SMS or enter a TOTP code
2. Request must come from a certain IP range or during certain hours
3. Request only valid after date X (time lock)
So this can be used to shore up password-based encryption as well in some very interesting ways.
1. We don't partition it into fixed size blocks, but rather index directly into the array
2. The site calculates a salted hash and sends us just the hash. We recommend at least a 32 byte CS-PRNG salt
3. We HMAC the hash with a 64-byte site-specific token (AppID) to produce the seed
4. We generate 64 uniformly distributed locations from the seed and perform 64 reads of 64 bytes each to form a 4096 byte buffer which we HMAC with the AppID to produce a second salt.
5. The site uses this second salt to HMAC their original hash, and store that.
This design allows multiple sites to securely share a single data pool and also means that our service a) does not see usernames or passsords, b) does not know if a login is valid/invalid, c) cannot do anything to make an invalid login look valid to the site.
There are some additional details to handle upgrading hashes as the data pool grows, and also to provide virtual private data pools for each site (so I can give you a copy of your data pool if you ever want to self-host). This is all detailed in [1] above.
That is an excellent idea. But why 16TB of random data ? Why not encrypt some high entropy value (digits of pi, whatever) with a 100 character password and generate 16TB like that. You then use the 16TB as a password but you could regenerate and recover using a scrap of paper.
You can do either. But if you generate the data pool from a seed that you retain, then you're back to trying to protect a 256-bit value from leaking.
Generating the data pool with constantly cycled and discarded keys (i.e. /dev/urandom) means the only way to have the pool is to go and get every single bit of it.
We went the second route because I like sleeping at night and it just felt like retaining a seed would defeat the whole purpose of bounded retrieval.
Sure, but that's a 256-bit value that does not have to be present at the use point. So it's a lightweight anchor ! It's extremely heavy when someone else tries to move it, and yet when you move it yourself, it easily fits in your wallet on the tiniest of sd cards, or even on a scrap of paper.
How about this? Take the old Blowfish block encryption algorithm and eliminate the key expansion and expand it so that the s-boxes and p-array take up 16TB of data? What you'd wind up with is a block cipher that has a 16TB key. Since Blowfish is clearly "prior art," and is unencumbered by patents, this might make this approach harder to attack using patent law.
I don't think it's a bummer that they patented it. In October of 2037 (assuming they received it today), it will be available for the whole world to use. Until then, it will still be available for the whole world to use, just for a small licensing fee. In the mean time alternatives can also be developed.
This technique could have been invented and promoted starting in 1997 (20 years ago) but only through the protectionism of the patent regime do you have this beautiful write-up and promotion of it by researchers pushing it forward: it's the patent regime working in action.
It works EVEN WITH WEAK PASSWORDS. That is pretty amazing if you ask me.
I am glad they patented it and are promoting it.
"But wait, it's so simple".
Let me give you an example of a $684.23B company that you've heard of that is making a mistake in security that even a small child could detect and correct, but for which there is no proprietary solution in the space pushing them forward.
The company is Google and their silly security mistake is that when I give out "jsmith543+weeklytechupdate@gmail.com" where my true address is jsmith543@gmail.com, and I'm signing up for the Weekly Tech Update newsletter but I'm afraid they could start spamming me, or sell my address for any number of third parties to start spamming me, then this allows the creation of a gmail inbox that tags the incoming mail with "weeklytechupdate". Pretty clever. Only the issue is that it is possible to strip the +____ and spammers actually do that. Here are examples of HN people saying they actually do that: https://news.ycombinator.com/item?id=15396446
>I’ve run a fair amount of email campaigns where we strip out the + if gmail is the domain to ensure it doesn’t end up in some weird filter.
The solution is extremely simple. Allow me to specify a key-value pair from the GMail interface that generates a high-entropy key, and pairs it to a value I choose.
Deliver all address to that key to my inbox, tagged with the value I chose, until I start marking it as spam.
Very easy. Example: I go to gmail, I click "generate rescindable read address address", I am given affj3fjd and I assign it "weeklytechupdate". I see that affj3fjd@gmail.com gets assigned to weeklytechupdate and if I need to give my email address to that web site in the future I can always look it up in some list. Easy. Gmail doesn't do it, and its spam solution is broken.
The only thing is: nobody has come up with something clever enough to patent in this space, and then promote the @#$# out of. If they had, I could give my email addresses out in confidence to whoever I want.
Actually I made a full gmail email address dedicated only for spam. The problem is I can NEVER read the stuff that goes there as I just don't even look. I just looked. The last piece of spam that I got delivered to it occurred 7 days ago. There are just 2 pieces of mail in my inbox.
That means Google's spam filter is very, very, very good. Wait, what? So good that it silently filters spam that I expect to get, that I explicitly give out my email address for? (Okay, I just looked, and there are 2 messages from 4 days ago - nothing more recent - in the "promotions" tab).
No. It's not what it means. It means that some of these sites I give my address out to aren't able to email me at all. They're just not getting through, because GMail's spam fiters are too draconian.
When I give out "jsmith543+weeklytechupdate@gmail.com" I expect ALL of the mail sent to there to go through - not to be caught by the spam filter. Instead, presumably what happens is gmail throws away most mail that isn't sent to an individual by an individual.
Sorry to rant on this aside, I just wanted to show, in action, the difference between a patented solution that a company promotes, versus an EASY solution that would WORK, that GMail doesn't do. It actively does something broken. Nobody has come up with and promotes some fancy solution that works, so instead they don't use the weak solution that works; they use nothing, only a broken non-working security through obscurity solution that you can see HN'ers actively strip out in order to be able to spam effectively.
And this is Google. So this is a question as clear as day for why I don't mind patented novel algorithms with companies behind them licensing and promoting them. I kind of mind when it's a race to the patent office with new technology, but grandparent poster's technique is one that could have been done in 1997 so I don't really buy that excuse. I like that they're patenting it and promoting it. It's a good way to get companies to use better solutions. Companies just don't do it by themselves, as my Google example shows.
>Until then, it will still be available for the whole world to use, just for a small licensing fee
They don't appear to be selling right to use licenses. Most of the text on their site suggests a cloud based service, which I suspect will be usage based.
All that to say it is perhaps too soon to judge the end user cost as small. Maybe it will be, maybe not.
But my point in this case is that if they hadn't patented it and be pushing it we wouldn't even be talking about this. It promotes it OR alternatives.
The impact on consumers is positive even if they only get meager access for 20 years. (For example the patent owner could just be bad at economics and set their price too high, thinking they would get more profit than via wide adoption: they might not set it at the monopolist's profit-maximizing price point.)
Even so, everyone gets it after a while (20 years.)
Simply filter all email to a non-plus address to spam, and then only give out random oh_sigh+aslkdfjslkdjf@gmail.com addresses. Now, if a spammer strips it off, they just get put directly into your spam box. Where stupid regexes don't like the plus, you have the . allowance for gmail, where foobar@gmail, f.oobar@, f.o.obar@, foob.a.r@, etc all get routed to the first address. gmail lets you have up to 30 character user names, so you can encode 2^28 = ~268M unique emails into that. But those sites are very rare.
1. Out of curiosity, do you actually do this? (The first part you propose.)
2. As a theoretical solution it is a bit weaker than the "simple" solution I think Google should obviously do, because under your proposal different spammers can coordinate, invalidating your privacy. (You didn't tell two different unrelated sites that you're the same person, but actually you are, which they could build into a targeted profile if they coordinate or, for example, are owned by the same parent company.) Granted this is a theoretical concern but it is there.
I have two emails: super_private@gmail.com which is only handed out to people I know in real life...I've had this one since 2004 and I still get zero unwanted emails on that address. Then, I have another address, super_public@gmail.com which mass forwards all mail to my super_private email, which then filters it according to the rules I've set up.
The reason I have the extra layer of indirection is because it wouldn't by very user friendly to force someone you know to email you with a plus sign and then some junk. This way I can give a 'normal' email address to normal people, and my filtering email address to auto signups and things like that.
2. You're right - I guess I'm not too worried about a profile being built for me, but this definitely would not handle that issue. I also use anonymous remailers like getnada.com if I am signing up for something which I think is particularly embarrassing if it gets out, but that is rare.
You cannot patent the use of a really long salt. Thats like patenting hashing of any string longer than 2000 chars. Its a trivial operation. They may think they have a patent but I trust it not to hold up. Go build your own 12 TB pool of data to use for salting hashes. I trust theyll never find out or have any ground to sue you if they do.
Their patent is meaningless.
A patent for using 16TB of data as a salt is trivial.
password + salt + password or salt + password + salt are known and trivial patterns in hashing. Unpatentable and even if a patent was somehow gotten, unenforcable.
If your salt is 5 characters it can certainly be 500000000 characters instead without the patent overlords having any slimy grounds to indemn you
I certainly wouldn't get 16TB of disks just for that if it were ever leaked.
Bummer(not for me :p) that you guys went the route of patenting it and keeping it proprietary & only available through an API.
I think it would be adopted in no time if it were open source, and I'd definitely like to see something like this available as a service on clouds like GCP/AWS/Azure/etc for my day job.