Start with Tokyo Cabinet and only switch when you hit a ceiling. My system can d...

smokinn · on May 8, 2009

Tokyo Cabinet (TC) is what I was going to recommend as well. It's crazy fast and quite reliable. It also does zlib compression so your dataset will be smaller than with most other key/value stores. Be careful picking an open source key/value store because a lot of them are kinda half baked right now. (But TC is very solid.)

My first attempt would be as suggested above, to use a single instance of TC. If it ever started to get slow or you simply ran out of space you could even put a consistent hash in front of a cluster of machines. Though I wouldn't be surprised if a single machine would be enough for your current needs (20 billion).

The hash doesn't have to be very complicated, depending on your key. If your key is just a numeric auto incremented id for example you could implement a "hash" with modulo which just maps 0-x billion to machine 1, x-2x billion to machine 2, etc.

If you do this though you'll lose the ability to generate keys to search through your data. If you need to generate keys and your data is distributed I highly recommend sphinx search ( http://www.sphinxsearch.com/ ). Technically, it's a full-text search engine but I use it for normal searching. You can build an index distributed across multiple machines and even multiple cores of multiple machines. For example, if you have 2 boxes with 4 cores each, you can split your data into 8 chunks, 4 of the chunks running on one machine, 4 on the other with a single "distributed index" that ties them all together. When you do a search on your distributed index for say 100 items sorted by date, 8 threads will search the 8 chunks, return the 100 best results and the distributed index will filter out duplicates and return the 100 best results. And it'll do it incredibly quickly.

We're using sphinx with mysql. I'm not sure how you'd get the data from TC to sphinx (there may or may not be adapters for that I'm not sure) but if you have to pick between sphinx and TC you'd have to decide what's most important for you, low storage space with very quick lookup or very low latency search. If you need low latency search I suggest you simply use mysql for storage and index your data with sphinx.

gfodor · on May 8, 2009

Seconding this. TC is solid.

aaronblohowiak · on May 9, 2009

Sphinx has an XML input format, IIRC.

http://www.sphinxsearch.com/docs/current.html#xmlpipe2

oconnor0 · on May 8, 2009

Why Tokyo Cabinet?

How quickly can you populate your database locally? Adding 10 billion objects at 300/seconds will take too long.

mahmud · on May 8, 2009

1 million records in 0.4 seconds. 2.5M queries per second.

http://www.scribd.com/doc/12016121/Tokyo-Cabinet-and-Tokyo-T...

page 4.

ntoshev · on May 8, 2009

You realize these numbers are not for transactional (ACID) updates, right? They are with write cache enabled, meaning a power failure would cause data reported as written not to be available. I hope it doesn't mean data corruption, if someone has tested this I'd like to hear from them.

For ACID storage, one transaction per disk rotation is pretty much a theoretical limit for sustained performance.

anamax · on May 8, 2009

> For ACID storage, one transaction per disk rotation is pretty much a theoretical limit for sustained performance.

Why? There's nothing about "all or nothing" that forbids performing independent application-level commits with a single implementation-level commit. And, it's even possible to do application-level-dependent commits at the same time. (Yes, it's tricky.)

And, you can have commits on different spindles. And, you can do multiple writes during a single rotation.

leej · on May 8, 2009

Those numbers are probably from non-real world benchmark tests of sequential read/write. You probably never reach to those levels and when put TT to the equation # will decrease even more.

crad · on May 8, 2009

If I read it correctly, 300 records per second was his production rate. I've benched TokyoCabinet out to be much, much faster.