More

lanstin · 2025-12-23T19:30:34 1766518234

I just can’t understand people whose life goal is like to be a CTO by some date; I have never tried for a promotion since I first shifted into a job where I was paid to make software. Due to the random walk of promotion rules (and most likely the good work of bosses that believed in me), I have been promoted enough that I had to put thought leader into my linked in, but I do believe just making sure whatever you are working on is useful and successful to other people is enough worry; if you are skilled in software and thinking, the other things will work them selves out.

lanstin · 2025-12-22T16:25:57 1766420757

This sounds like much better advice. Trying to plan out a tech career over decades seems like very premature optimizing. Being curious and making sure you keep learning is not only very pleasant, it’s useful. And when the tech changes, fine, no problem. Most of the big features of my life have not been plannable ahead of time.

nrvn · 2025-12-22T17:02:25 1766422945

Exactly. You can “have a vision” to accelerate full speed and hit the hard wall and just before going full throttle you are offered an opportunity to enter an open door around the corner which you have never even thought about. And that door helps you discover a new vision, that might stick for lifetime.

Also, while the original advice about “vision” sounds reasonable, it also sounds a bit dogmatic. The filpside of “career vision” is “tunnel vision”. And life is not deterministic, it has a much more probabalistic nature. Hence, curiosity and open mind.

lanstin · 2025-12-21T23:23:58 1766359438

And everything logging from the API to the network to the ingestion pipeline needs to be best effort - configure a capacity and ruthlessly drop msgs as needed, at all stages. Actually a nice case for UDP :)

otterley · 2025-12-21T23:41:56 1766360516

It depends. Some cases like auditing require full fidelity. Others don’t.

Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.

lanstin · 2025-12-23T21:09:30 1766524170

1. those ingested logs are not logs for you, they are customer payload which are business criticial; 2. I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed. Also, the alternative to best effort/shed excessive load isn't 100% availability, it's catastrophic failure when capacity is reached.

Auditing has the requirement to be mostly not lost, but most importantly not being able to be deleted by people on the host. And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on." Hopefully, the audit traffic is consistent enough that you don't get spikes and can over-capacitize with confidence.

otterley · 2025-12-23T22:33:02 1766529182

> those ingested logs are not logs for you, they are customer payload which are business criticial

Why does that make any difference? Keep in mind that at large enough organizations, even though the company is the same, there will often be an internal observability service team (frequently, but not always, as part of a larger platform team). At a highly-functioning org, this team is run very much like an external service provider.

> I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed.

You should take a look at CloudWatch Logs. I'm unaware of any time in its 17-year history that it has successfully ingested logs and subsequently lost or corrupted them. (Disclaimer: I work for AWS.) Also, I didn't say anything about delays, which we often accept as a tradeoff for durability.

> And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on."

This is one of the many reasons why buffering outgoing logs in memory is an anti-pattern, as I noted earlier in this thread. There should always -- always -- be some sort of non-volatile storage buffer in between a sender and remote receiver. It’s not just about resilience against backpressure; it also means you won’t lose logs if your application or machine crashes. Disk is cheap. Use it.

lanstin · 2025-12-21T01:50:22 1766281822

If they cause your customers to ditch your product but calling them and saying "your calls are all getting 4xx because you are not putting the state code into the call parameters" would keep them as customers, then you would be wise to make that communication.

dolmen · 2025-12-21T05:07:03 1766293623

But first ensure that the input error is properly reported to the client in the response body (ideally in a structured way), so the client could have figured out by himself.

If a fix is needed on your side for this matter, having a conversation with a customer might be useful before breaking more stuff. ("We have no state code in EU. Why is that mandatory?").

lanstin · 2025-12-23T21:10:12 1766524212

If you are trying to sell a product, it is sometimes useful to solve people problems for them, rather than counting on them to figure them out on their own.

lanstin · 2025-12-21T01:48:26 1766281706

Unless it is logging more warnings because your new code is failing somehow; maybe it stopped parsing the reply correctly from a "is this request rate limited" service so it is only returning 429 to callers never accepting work.

lanstin · 2025-12-21T01:44:49 1766281489

It continues to become more realistic with the passing of time.

lanstin · 2025-12-21T01:40:32 1766281232

Also everywhere I have worked there are transient network glitches from time to time. Timeout can often be caused by these.

lanstin · 2025-12-21T01:37:39 1766281059

While it is fun to have your code run for 500 days without restart, it is a bad architecture. You should be able to move load around from host to host or network to network without losing any work. This involves graceful draining and then shutting down the old.

For impossible errors exiting and sending the dev team as much info as possible (thread dump, memory dump, etc) is helpful.

In my experience logs are good for finding out what is wrong once you know something is wrong. Also if the server is written to have enough but not too much logging you can read them over and get a feel for normal operation.

lanstin · 2025-12-21T00:10:37 1766275837

Also put the fucking data in the message that led to the decision to emit the logs. I can't remember how many times I have had a three part test trigger a log "blah: called with illegal parameters, shouldn't happen" and the illegal parameters were not logged.

lanstin · 2025-12-21T00:08:13 1766275693

The frequency is important and so is the answer to "could we have done something different ourselves to make the request work". For example in credit card processing, if the remote network declines, then at first it seems like not your problem. But then it turns out for many BINs there are multiple choices for processing and you could add dynamic routing when one back end starts declining more than normal. Not a 5xx and not a fault in your process, but a chance to make your customer experience better.