Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also: 1) Maximum # of open file descriptors

2) Whether your slave DB stopped replicating because of some error.

3) Whether something is screwed up in your SOLR/ElasticSearch instance so it doesn't respond to search queries, but respond to simple heartbeat pings.

4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.

5) If you're running out of space in a specific partition you usually store random stuff like /var/log.

I've had my ass bitten by all of the above :)



6) Free inodes (as distinct from space) per filesystem.


Similar to free inodes, you should also check for maximum number of directories. dir_index option helps, but I've seen it become a problem.


There's a maximum number of directories? On what filesystem is that?


ext3 without dir_index has a limit of 32K directories in any one directory.

Where I saw it crop up was 32K folders under /tmp on a cluster system. So no it's not a limit on number of directories entirely (that's inodes), but rather how many subdirectories you can have.

http://en.wikipedia.org/wiki/Ext4#Features <-- Fixes 32K limit


ext3/4 has really poor large-directory performance, even with dir_index, especially if you are constantly removing and readding nodes. I would highly recommend XFS for large-directory use cases.


I got bit by this once, i think it was related to a maximum of 32k hardlinks per inode, which effectively sets a limit of 32k subdirs since each subdir has a hardlink to ".."


> Maximum # of open file descriptors

Augh. I ran one of my servers hard into that wall, and now it's something I watch. At least I learned from that mistake.


Related to this, if you've ever built/run anything on Solaris, you probably found out the hard way that even in modern times, fdopen() in 32-bit apps only allows up to 255 fds because they oh so badly want to preserve ages old ABI. Funny bug to hit at runtime in production when you aren't aware of this compatibility "feature".


I learned the hard way that MySQL creates a file descriptor for every database partition you create. Someone had a script that created a new partition every week...


So after 5000 years you were running out?


I forget the details, but practically speaking the database keeled over after some 200 or 500 files were open at the same time.


X) Number of cgroups. We were getting slow performance, apparently related to slow IO, but nothing stood out as being the culprit. Turns out, since vsftpd was creating cgroups and not removing them, the pseudo-filesystem /sys/fs/cgroup had myriads of subdirectories (each representing a cgroup), and whenever something wanted to create a new cgroup or access the list of cgroups, this counted as listing that pseudo-directory, which counted as IO.

Fixed by using the undocumented option isolate_network=NO in vsftpd.conf.


Feels like this list (and the original post) are problems caused by:

* lack of proper/default monitoring advocated for your tools (2), (4).

* Choosing poor (default/recommended) settings (1), (4).

* Keeping stateless server/instances when you don't need to (5), (6).

* Not tracking performance as part of monitoring (3), (4)

Albeit, I have made the same mistakes too.

edit: formatting




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: