- I never needed to scan the directory to find "all users" (after all, if you have millions of users, this is going to take a while whatever directory structure you use)
- Modern filesystems either use a tree or hash structure to identify files, meaning that lookup by name, and creating/deleting files, is quick, even if you have millions of files.
- Given no performance benefits are to be had by directory nesting, I always went with the option of simplicity, i.e. having everything in one directory.
But no doubt the HN developers had a reason for doing this change, I'd love to know what it is (e.g. if they need to do something I never needed to do, or if they need to do the same things but I was wrong.)
I read your blog entry. Your experience was with tru64 and you also mention zfs. These and other file systems may indeed use data structures to make filename lookup performant.
But traditionally, ufs and ext2/3/4 (without dir_index) have to perform a linear scan through a linked list for lookup, and so they do indeed grow slower with number of files. This is likely where the fanout strategy originated from.
So as usual, YMMV and you should test on your file system of choice.
Personally, I don't really consider that fanout adds much complexity and I'd be surprised if it hurt performance.
edit: HN runs on FreeBSD. Not sure if they are using zfs or ufs, but I'm going to guess ufs. UFS apparently has a dirhash which improves directory lookups, but it's an in-memory structure so it won't help in the cold-cache case after reboot and it can be purged in low memory situations too.
Sure, but that's the least helpful possible response you could have made. We've got an observation:
> I never needed to scan the directory to find "all users"
> lookup by name, and creating/deleting files, is quick, even if you have millions of files.
And a question: given these observations, where do the benefits of filesystem fanout come from? Is it not true that looking up a file by name is fast no matter how many other files sit in the same directory? Is HN doing something weird?
You can't answer the question "where do the performance benefits come from?" by saying "look, the performance benefits exist".
> You can't answer the question "where do the performance benefits come from?" by saying "look, the performance benefits exist".
I think he is trying to say is that the parent poster's observations must be wrong. After all, we are talking about an unsubstantiated claim ("there's no benefit to fanning out files") that directly contradicts another claim which we have data for ("HN is 5x faster after fanning out files").
Again, when someone asks why they're wrong, it's not useful to tell them "but you're wrong". Parent poster already acknowledged that the combination of his ideas and the facts on the ground didn't make sense. What good does it do anyone to repeat it back to him?
I guess when you're absolutely sure that you're right, but the observation proves you wrong, you have to be prepared to consider the possibility that you're wrong.
The comment I was replying to was saying that the file system takes care of it automatically, so there's no purpose to arranging millions of files into directories. I'm not going to speculate how it all works under the hood.
Large number of files in windows-based directories was killer. I can not remember the full details as it was years ago but once you went above 10,000 files or 20,000 files performance just died in Windows. It was because a bunch of the main API calls for accessing files in directories were inefficient I believe.
- I never needed to scan the directory to find "all users" (after all, if you have millions of users, this is going to take a while whatever directory structure you use)
- Modern filesystems either use a tree or hash structure to identify files, meaning that lookup by name, and creating/deleting files, is quick, even if you have millions of files.
- Given no performance benefits are to be had by directory nesting, I always went with the option of simplicity, i.e. having everything in one directory.
(I blogged about this here: http://www.databasesandlife.com/flat-directories/)
But no doubt the HN developers had a reason for doing this change, I'd love to know what it is (e.g. if they need to do something I never needed to do, or if they need to do the same things but I was wrong.)