Hogwash. I’ll agree that it’s not as simple with logs, but amazingly powerful, a...

KaiserPro · on July 11, 2024

You're missing my main point: logs should not be your primary source of information.

> Without logs, I would not have been able to pinpoint multiple issues that plagued our systems.

Logs are great for finding out what went wrong, but terrible at telling there is a problem. This is what I mean by primary information source. If you are sifting through TBs logs to pinpoint a issue, it sucks. Yes, there are tools, but its still hard.

Logs are shit for deriving metrics, it usually requires some level of bespoke processing which is easy to break silently, especially for rarer messages.

_boffin_ · on July 11, 2024

> You're missing my main point: logs should not be your primary source of information.

I think you're missing my point. They're both needed. Metrics are outside blackbox and logs are inside -- they're both needed. I don't recall saying that logs should be the primary source.

> Logs are shit for deriving metrics, it usually requires some level of bespoke processing which is easy to break silently, especially for rarer messages.

Truthfully, you're probably just doing it wrong if you can't derive actionable metrics from logs / tracing. I'm willing to hear you out though. Are you using structured logs? if so, please tell me more how you're having issues deriving metrics from those. if not, that's your first problem.

> logs are great for finding out what went wrong, but terrible at telling there is a problem

see prior comment.

KaiserPro · on July 11, 2024

> Truthfully, you're probably just doing it wrong if you can't derive actionable metrics from logs

I have ~200 services, each composed of many sub services, each made up of a number of processes. something like 150k processes.

Now, we are going to ship all those logs, where every transaction emits something like 500-2000 bytes of data. Storing that is easy, evne storing it in a structured way is easy. making sure we don'y leak PII is a lot harder, so we have to have fairly strict ACLs.

now, I want process them to generate metrics and then display them. But that takes a lot of horse power. Moreover when I want to have metrics for more than a week or so, the amount of data I have to process grows linearly. I also need to back up that data, and derived metrics. We are looking at a large cluster just for processing.

Now, if we make sure that our services emit metrics for all useful things, the infra for recording, processing and displaying that is much smaller, maybe two/three instances. Not only that but custom queries are way quicker, and much more resistant to PII leaking. Just like structured logging, it does require some dev effort.

At no point is it _impossible_ to use logs as the data store/transport, its just either fucking expensive, fragile, or dogshit slow.

or to put it another way:

old system == >£1million in licenses and servers (yearly)

metric system == £100k in licenses and servers + £12k for the metrics servers (yearly)