Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One thing to be aware of is that up/down alerting bakes downtime into the incident detection and response process, so literally anything anyone can do to get away from that will help.

A lot of the details are pretty application-specific, but the metrics I care about can be broadly classified as "pressure" metrics: CPU pressure, memory pressure, I/O pressure, network pressure, etc.

Something that's "overpressure" can manifest as, e.g., excessively paging in and out, a lot of processes/threads stuck in "defunct" state, DNS resolutions failing, and so on.

I don't have much of an opinion about push versus pull metrics collection as long as it doesn't melt my switches. They both have their place. (That said, programmable aggregation on the metrics exporter is something that's nice to have.)



What you call pressure is often called saturation. Saturation means the resource is at 100% utilization.

But saturation is not the same as errors.


I'm talking beyond saturation.

There are actually quite a few resources for which I'd like to maintain something resembling steady-state saturation, like CPU and RAM utilization. However, it's when I've overcommitted those resources (e.g., for RAM, no more cache pages that can't simply be purged to make more room for RSS) that I start to see problems. (Of note, if I start paging in and out too much, that can also affect task switching, which leaves the kernel doing way more work, which itself can lead to a fun cascade of problems.)


Saturation is not a boolean, it's how beyond 100% utilization the resource is.

https://www.brendangregg.com/usemethod.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: