The article says "then the challenge protocol was changed" so that's why people ...

tptacek · on Sept 10, 2020

You know this, but for the benefit of the thread: to say "tls-sni-01 is safe unless there are bulk hosting sites that break it" is to say that tls-sni-01 is unsafe. The "crazy" sites you're referring to included AWS and Heroku.

This all happened 2 years ago, so it's a bit odd to see it litigated today.

schoen · on Sept 10, 2020

We briefly describe this history on page 6 of

https://jhalderm.com/pub/papers/letsencrypt-ccs19.pdf

in case anyone is more interested (there are also references there for further details). Twice, methods that seemed plausible for proving control over domain names turned out to make assumptions that were potentially violated by shared hosting environments.

Jacques, I'm really sorry for the hassle that these changes caused you.

tialaramex · on Sept 10, 2020

Thanks for the link Seth. I wasn't aware this existed and it's sometimes nice to have something specific to cite as well as convenient that it's all in one place like this.

Edited to add: Wow the Sankey diagram (showing changes in which CA if any is used by a site) is something I hadn't seen anywhere else and is especially useful. Thanks again.

tialaramex · on Sept 10, 2020

Heroku and (so far as I can tell) Cloudfront independently re-invented this stupidity. But if it was "just" say Heroku and Cloudfront you can imagine plausibly notifying those two providers to fix their broken infrastructure and then you're good.

Apache makes it unsalvageable by sheer numbers the same way it had already for HTTPS in http-01, so that's why I focused on Apache.

It's entirely possible for some fool to ship an exciting new cloud service that lets people bind to arbitrary ALPN values on a shared service and thereby re-introduce this problem for tls-alpn-01 - but unlike with tls-sni-01 that's not a bug common to hundreds of small bulk hosts using out of box Apache so I assume we'd tell the exciting upstart to knock it off and warn their customers what they're doing is inherently unsafe, rather than requiring Let's Encrypt to stop offering tls-alpn-01.

In fact we're already on the other side of this for the ordinary version of http-01 for a different reason. Apache really does potentially let an attacker who controls aaa-aardvark.example at some bulk host perform http-01 challenges for www.some-custom-site.example that has created A records pointing to the bulk host but hasn't currently actually got them serving www.some-custom-site.example maybe due to a typo or unpaid bill.

But most bulk hosts have specifically configured Apache to show a default "Did you pay? / Have you configured your hosting properly?" type site which is harmless in this case, and for the few that haven't users can understand that um, if they visit www.some-custom-site.example in their browser they get to the attacker's site, so like yeah, that's where the problem is, nothing new with http-01

jacquesm · on Sept 10, 2020

I did provide an email address, never got any mail (I did actually check that).

> it's just that new things can't be launched against this already deprecated service.

Yes, I noticed. So, I now have the entirely unforced option of re-imaging a machine that is working just fine besides this little detail, which is in fact just one very small thing of a whole pile of much bigger things that run on that particular box. Not to mention migrating twentynine years of email to a new mail server.

I'm sure there is a lesson in there somewhere, but I'm not sure I'm overly receptive today, I had a lot of other stuff on my agenda.

sergiosgc · on Sept 10, 2020

If you let a server lag in OS version, at some point in time you're going to hit this kind of problem. If not with Let's Encrypt, then with some other dependency. I know, I've been in the exact same spot. I just don't blame the dependency, and included server OS updates as part of a yearly maintenance cycle.

jacquesm · on Sept 10, 2020

I find that really ridiculous. Not you, but the fact that an OS needs to be upgraded because of some application level stuff that has to do with a protocol that is being run on some other server.

That's the kind of dependency snowball that we should work hard to avoid, not accept as some kind of new normality.

Servers should be able to live for years without re-imaging.

yebyen · on Sept 10, 2020

Is there a reason you can't just upgrade that one component on the server, why do you have to re-image it from scratch?

If you have external dependencies they are going to move around from time to time throughout their lifetimes, especially if they are beta. LetsEncrypt may not have signaled beta with v1, but I've been a cert-manager user for years in pre-1.0 and I've known that meant I might need to come up for air and read the docs for a specific upgrade instruction from one pre-1.0 minor version to another at any time.

Now cert-manager is 1.0+ and my expectations can change. It should remain backwards compatible until the next major version (hopefully for a while! And they will provide a migration path when that comes, with clear instructions and a fairly long sunset, godwilling)

But cert-manager depends on letsencrypt, and I depend on cert-manager, all of which depends on a protocol called acme, and this is the arrangement. We made this deal because it was going to turn out less complicated than managing the certificates by hand, and they made that deal because it was going to turn out better than rolling their own protocol from absolutely scratch, similarly. Eyes on the prize.

If you didn't want LetsEncrypt as a dependency there are other ways to connect cert-manager or another tool like it, including other acme providers... they all depend on the acme protocol, (or there might be some other protocol that you can use, with its own characteristics of change or stability, or roll your own) at some point you have to roll the dice and bet on something.

Occasionally these things happen. You suggest that servers should be able to go for years, (but they have allowed years for this transition! What more can be expected, realistically?)

jacquesm · on Sept 10, 2020

> Is there a reason you can't just upgrade that one component on the server, why do you have to re-image it from scratch?

Yes, I did this now and I have it working. But it leaves things in a messed up state and I don't like that so I will go back to this in a short while and fix it properly.

What I still wonder about is why their warning email never reached me, that I really need to figure out because then at least I would have dealt with this under a lot less time pressure.

> If you didn't want LetsEncrypt as a dependency there are other ways to connect cert-manager or another tool like it, including other acme providers...

There are some very good suggestions in this thread, I will probably adopt one of them.

> You suggest that servers should be able to go for years, (but they have allowed years for this!)

And somehow I missed that memo. Even so, I am still not convinced of the necessity, it is possible that it exists but I have yet to see a valid reason for shutting down the old protocol for new registrations like this. There also seems to be some confusion with people saying it should have worked for the same account, which I can prove did not work.

yebyen · on Sept 10, 2020

> But it leaves things in a messed up state and I don't like that so I will go back to this in a short while and fix it properly.

You say this with confidence, I wish my own situation provided me with the confidence to say this and mean it. We do not have reproducible systems and depend in many ways wholly on backup images of live production systems. Someone is going to say this makes my life simpler than yours by some twisted math, but I have a doubt about that myself.

We are still talking about migrating from Amazon Linux v1 to Amazon Linux v2, and with a recent announcement from AWS, the pressure is off! We'll be able to continue talking about this transition for a good long time to come. Again, mixed blessing, is it better to have an operating system that can crawl along on life support? For those that can't upgrade, sure, it is better to get security maintenance than to have zombie servers which are not upgradeable, but who is to say what opportunity costs will arise because we are not on a formally supported leading-edge version of the platform.

jacquesm · on Sept 10, 2020

Agreed, reproducible systems are an absolute must and it is a shame that we are still not even close to having a solid foundation under all this mission critical stuff we build.

It feels like we are building these huge castles on quicksand.

At the same time I think the whole 'treat your servers like cattle, not like pets' is exactly because we don't know how to do this properly. It is the cloud equivalent to hitting ctrl-alt-delete to solve issues.

moeffju · on Sept 10, 2020

I know you solved your issue; for others in the same boat, look into acme.sh. it's a shell only implementation, no python, no loads of dependencies. I used that to keep let's encrypt running on an ancient server (firewalled) that I cannot upgrade for reasons.

hibbelig · on Sept 11, 2020

I decided to go with acme.sh instead of certbot on some servers because I am hoping that upgrading acme.sh will cause fewer headaches. But who knows...