Using less virtual memory to store user data doesn't imply better cache use, only a smaller cache size. The tradeoff is memory over CPU. With a database you're using more CPU. It's a fair bet to say that your memory capacity will increase at a greater rate than your CPU, not to mention costing less to power and/or cool, and a simpler software architecture to support. Wasting memory is a simple hack to increase performance and decrease complexity.
> This of course before saying anything about transactional safety of writing directly to the filesystem
1) transactions aren't made or broken by what or when they're written, they're made or broken by being verified after being written, and 2) this is a user forum for people to comment on news stories, not an e-commerce site. Worst case the filesystem's journal gets replayed and you lose some pithy comments.
That's incorrect, you're potentialy trading several system calls (open, read, close) and their associated copies, which have high fixed costs for, with the right database, no system calls at all. I've spent most of the past year working with LMDB, and can decisively say that filesystems can in no way be competitive with an embedded database, by virtue of the UNIX filesystem interface.
> this is a user forum for people to comment on news stories, not an e-commerce site
That much is true, though based on what we've learned in the parent post, until today all passwords on the site were stored in one file. Many popular filesystems on Linux exhibit surprising results rewriting files, unless you're incredibly careful with fsync and suchlike. For example, http://lwn.net/Articles/322823/ is a famous case where the many-decades traditional approach of writing out "foo.tmp" before renaming to "foo" could result in complete data loss should an outage occur at just the right moment.
So you're saying LMDB looking up a user-specific record and returning it will always be faster than either an lseek() and read() on a cached mmapped file [old model] or an open(), read(), close() on a cached file [new model] ? Is the Linux VFS that slow?
In terms of transaction guarantees, I thought the commenter was talking about the newer model where each profile is an independent (and tiny) file; if that's the case, then deleting and renaming files wouldn't be necessary, and any failures in writing could be rolled back in the journal rather than be a file that's now non-existent or renamed. From what I understand, the most the ext4 issue would affect this newer model would be to revert newly-created profile files, which again I think would be a minor setback for this forum.
Serious database can use raw partitions with no filesystem for storage. Even when storing data on a filesystem a database is unlikely to be using a single file for each entry; the database might make one mmap system call when it starts, and none thereafter (simplified example). The point is that the database can do O(1) system calls for n queries, whereas using the filesystem with a separate file for each entry you're going to need at O(n) system calls.
You could of course avoid this problem by using a single large file, but that has its own problems (aforementioned possibility of corruption). Working around those problems probably amounts to embedding a database in your application.
In the read-only case, pretty much any embedded DB with a large userspace cache configured won't read data back in redundantly.
In the specific case of LMDB, this is further extended since read transactions are managed entirely in shared memory (no system calls or locks required), and the cache just happens to be the OS page cache.
Per a post a few weeks back, the complete size of the HN dataset is well under 10GB, it comfortably fits in RAM.
> This of course before saying anything about transactional safety of writing directly to the filesystem
1) transactions aren't made or broken by what or when they're written, they're made or broken by being verified after being written, and 2) this is a user forum for people to comment on news stories, not an e-commerce site. Worst case the filesystem's journal gets replayed and you lose some pithy comments.