Say you are debugging a memory leak in your own code that only shows up in produ...

joshuamorton · 2026-01-12T02:08:53 1768183733

I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things:

- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant

- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant

- Potentially try to push additional logging or instrumentation

- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data

I certainly can't strace or run raw CLI commands on a host in production.

reactordev · 2026-01-12T02:38:13 1768185493

Combined with stack traces of the events, this is the way.

If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.

I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.

zinodaur · 2026-01-12T04:19:38 1768191578

> I certainly can't strace or run raw CLI commands on a host in production.

Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

To me, being without those capabilities just feels crippling

joshuamorton · 2026-01-12T05:15:42 1768194942

> Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege, not from a security perspective (though there are obvious upsides to this), but from a debugging/clarity perspective. If you have a relatively small and statically verifiable set of (networked) dependencies, and minimize which resources which containers can access, reasoning about the system as a whole becomes a lot easier.

I can think of lots of cases where I've witnessed good outcomes from moving towards more fine-grained resource access, and very few cases where life has gotten better by saying "everyone has access to everything".

zinodaur · 2026-01-12T21:02:21 1768251741

> A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege

You are my perfect foil :)

> very few cases where life has gotten better by saying "everyone has access to everything"

I should have been more clear - I like the dev env where people have access to the things they are responsible for. E.g., as a maintainer/operator of service X, you can do all the things service X can do. So it's not like random employees are running binaries that interact with your db - only the small set of experts responsible for maintaining that service (also the people most inclined to be cautious, since they own the impact).

It does require you to trust the people operating their services, and requires those people to be careful and competent, but it can yield spectacular results.

The hacker thing mentioned by a sibling comment is definitely true though. I airgap my work machine, never browse the web on it and require fingerprint scans whenever sshing/rsyncing in to prod, but even then its pretty sketch.

I feel like its important to remember how powerful it is though - I want something like ssh/rsync access to a machine with a vlan tag that only lets it perform "safe" db/service interactions - hashing PII and stopping writes. But instead I get "observability" and half assed webuis, stale/redacted datalakes, and minutes long read-eval-print loop iterations with a coworker PR stamp required each iteration

reactordev · 2026-01-12T12:54:28 1768222468

If you can do those things in production, so can Lee Hong Quag in North Korea. I’d rather not have that capability in production and rely on proper CI/CD to deploy resources into the cloud. The way you like to work is like giving hackers a complete jump box into your organization. You are bound to get hacked, it’s only a matter of time.