As an engineer I generally want logs so I can dive into problems that weren't an...

andmarios · on July 11, 2024

Most probably, said ops folks have quite a few war stories to share about logs.

Maybe a JVM-based app went haywire, producing 500GB of logs within 15 minutes, filling the disk, and breaking a critical system because no one anticipated that a disk could go from 75% free to 0% free in 15 minutes.

Maybe another JVM-based app went haywire inside a managed Kubernetes service, producing 4 terabytes of logs, and the company's Google Cloud monthly usage went from $5,000 to $15,000 because storing bytes is supposed to be cheap when they are bytes and not when they are terabytes.

I completely agree that logs are useful, but developers often do not consider what to log and when. Check your company's cloud costs. I bet you the cost of keeping logs is at least 10%, maybe closer to 25% of the total cost.

andrewf · on July 11, 2024

Agreed you need to engineer the logging system and not just pray. "The log service slowed down and our writes to it are synchronous" is one I've seen a few times.

On "do not consider what to log and when" .. I'm not saying don't think about it at all, but if I could anticipate bugs well enough to know exactly what I'll need to debug them, I'd just not write the bug.

jamesfinlayson · on July 12, 2024

Just saw this at work recently - 94% of log disk space for domain controllers were filled by logging what groups users were in (I don't know the specifics but group membership is pretty static, and if a log-on fails I assume the missing group is logged as part of that failure message).

charlie0 · on July 11, 2024

Sounds like really bad design choices here. #1 logs shouldn't go on the same machine that's running the app, they should be reported tp another server and if you want local logs, then properly setup log rotators. Both would be good.

jiggawatts · on July 11, 2024

Something I’ve discovered is that Azure App Insights can capture memory snapshots when an exception happens. You can download these with a button press and open in Visual Studio with a double-click.

It’s magic!

The stack variables, other threads and most of the heap is right there as-if you had set a breakpoint and it was an interactive debug session.

IMHO this eliminates the need for 99% of the typical detailed tracing seen in large complex apps.

sgarland · on July 12, 2024

I simply doubt that most of these logs (or anyone’s, usually) are that useful.

I worked at a SaaS observability company (Datadog competitor) that was ingesting, IIRC, multiple GBps of metrics, spread across multiple regions, dozens upon dozens of cells, etc. Our log budget was 650 GB/day.

I have seen – entirely too many times – DEBUG logs running in prod endlessly, messages that are clearly INFO at best classified as ERROR, etc. Not to mention where a 3rd party library is spamming the same line continuously, and no one bothers to track down why and stop it.

ansgri · on July 12, 2024

You probably don't need full text search, but only exact match search and very efficient time-based retrieval of contiguous log fragments. As an engineer spending quite a lot of time debugging and reading logs, our Opensearch has been almost useless for me (and a nightmare for our ops folks), since it can miss searches on terms like filenames and OSD UX is slow and generally unpleasant. I'd rather have a 100MB of text logs downloaded locally.

Please enlighten me, what are use cases for real full-text search (with fuzzy matching, linguistic normalization etc.) in logs and similar machine-generated transactional data? I understand its use for dealing with human-written texts, but these are rarely in TB range, unless you are indexing the Web or logs of some large-scale communication platform.

qw · on July 12, 2024

I agree that fuzzy matching etc. are usually not needed, but in my experience I need at least substring match. A log message may say "XYZ failed for FOO id 1234556789" and I want to be able to search logs for 123456789 to see all related information (+ trace id if available)

In systems that deal with asynchronous actions, log entries relating to "123456789" may be spread over minutes, hours or even days. When researching issues, I have found searches like Opensearch, Splunk etc. invaluable and think the additional cost is worth it. But we also don't have PB of logs to handle, so there may be a point where the cost is greater than the benefit.

potamic · on July 12, 2024

This is why you should always do structured logging. Finding logs using string match can be fragile.

9dev · on July 12, 2024

My response to that would be that you can enable logging locally, or in your staging environment, but not in production. If an error occurs, your telemetry tooling should gather a stack trace and all related metadata, so you should be able to reproduce or at least locate the error.

But all other logs produced at runtime are breadcrumbs that are only ever useful when an exception occurs, anyway. Thus, you don’t need them otherwise.

mianos · on July 12, 2024

Storage is not cheap at this scale. That would be 100s of thousands a year at the very least. (How I know, I work in an identical area and have huge budget problems with rando verbose logging).

andrewf · on July 12, 2024

Compared to: how much are they spending on dev salaries? On cloud or infra overall?

KaiserPro · on July 12, 2024

100pb on single zone S3 + those index/processing/catching nodes is about 12-14million a year.

Thats excluding the dev time needed to keep those queries useful and insightful.

mianos · on July 12, 2024

100s of thousands spends are always a target not matter what your budget.

__0x01 · on July 12, 2024

Error level logging can exist with a metrics focused approach.

Log_out_ · on July 12, 2024

My system has a version number and input + known starting state dbwise. Now assuming i have determenistic reprodible state, is a log just a replay of that game engine at work?

andrewf · on July 12, 2024

Interesting you should mention inputs. One of the things I’ve often found useful to log are the data that are inputs into a decision the code is going to make. This can be difficult to reconstruct after the fact, especially if there is a cache between my code and the source of truth.