High Performance Erlang – Finding Bottlenecks in a CouchDB Cluster

im_down_w_otp · on Feb 15, 2016

I would have expected any series attempting to address optimizing Erlang applications to have introduced and used cprof, eprof, and fprof.

That said, looking forward to additional installments.

Also, I can totally hear somebody saying, "That's premature optimization, don't worry about it." with respect to that horrible way of getting the application's version number. Stuff like that drives me nuts, and this particular issue is a perfect example of why that blanket statement is so misguided.

Yeah, your broken, slow, locally isolated implementation of a thing is innocuous in and of itself... but then some functionality in the critical path uses it and builds on top of it, and then there end up being dozens or hundreds of similar problems and internal dependencies... before you know it you've got death by a trillion tiny cuts.

In the pathological case those trillion tiny cuts start to look like noise rather than signal when profiling because nothing interesting just jumps out as critically broken, and you're left just assuming, "Well I guess it's slow just because the language/runtime/machine/whatever is slow, not because I've ground performance down to a fine dust through accreted questionable decisions."

MCRed · on Feb 15, 2016

Slightly off topic but the general hacker community seems to somehow missed it-- the creators of CouchDB formed a company and merged with the creators of Memcached, and the new company is called Couchbase. This is the best NoSQL database going. Memcached built in, CouchDB views, scalable (really actually scalable, not mongodb "scalable") etc.

I've long thought the hacker culture ignored databases and just picked something because it was popular (eg mysql) even though there were superior (objectively superior- we are engineers after all) solutions out there.

Erlang is one of those languages that is objectively superior -- I've yet to meet another language that does concurrency right-- yet many hackers just ignore it because it's not got java's syntax. Which is silly.

Don't make the same mistake with next generation databases.

strmpnk · on Feb 15, 2016

Couchbase may have involved the creators of Apache CouchDB but they are only lightly related these days (mostly some remnants of replication and map-reduce concepts). Both have move some distance since then so it's best to assume that CouchDB != Couchbase. (A most annoying name clash but that's that.)

What is being talked about here is the 2.0 release which is under development. It integrates IBM Cloudant's clustering layer which was previously called BigCouch. It's been a long road but it's good to see the C in couch (Cluster Of Unreliable Commodity Hardware) finally get supported. IMO, the HTTP API code in the project is probably the biggest jungle left mostly untouched from the earlier days of CouchDB and is ripe for a major cleanup. It's not surprising to see bottlenecks like this and it's great to see the author find a reasonable fix for the time being.

true_religion · on Feb 15, 2016

> Erlang is one of those languages that is objectively superior -- I've yet to meet another language that does concurrency right-- yet many hackers just ignore it because it's not got java's syntax. Which is silly.

What do you think about Elixir?

Zikes · on Feb 15, 2016

There is no "objectively best", only "that which is best suited to the problem at hand".

hnbroseph · on Feb 15, 2016

> that which is best suited to the problem at hand

also best suited...according to our understanding of the problem space, our capability/competency to tackle it, our resources presently available to allocate, our worldview and preferences, any external constraints such as customer requirements, etc, etc, etc.

lackbeard · on Feb 15, 2016

I wonder why every CouchDB-related post to Hacker News has someone in the comments derailing things by bringing up Couchbase...

ddorian43 · on Feb 15, 2016

*This is the best NoSQL database going.

Yeah, right. Most of the things about it ARE wrong:

1. couchdb views, who likes async map-reduce indexes ?

2. memcached built ok (better would be redis built in)

3. json documents, even mongodb has bson and not json

4. the new global-indexes-thing IS NOT scalable because you have to hit every index-node to do a query

5. when will you be able to modify a document ? looks like still in beta

skjhn · on Feb 16, 2016

Let's see if I can help here.

A lot people like async map-reduce. If you need to perform aggregation on a lot of data, its constantly growing, and you need the results to be current, async map-reduce is great. In the best case scenario, the results are precomputed. In a worst case scenario, they are a few seconds out of date. However, you have the option of forcing an update if need be. Either way, it's a hell of lot faster than running the full aggregation every time it's requested.

Redis is great, but a) the memcached protocol is well established and b) Redis is more than a simple cache.

BSON vs. JSON, what's the point here?

A query doesn't have to hit every index node. That doesn't make any sense. In fact, it's quite the opposite. With local indexes, you would, in fact, hit every single node. With global secondary indexes, you hit the index node with the right index.

Are you talking about partial updates? If so, yes, that will be available in the next developer preview. Stay tuned.

ddorian43 · on Feb 16, 2016

Hi,

1. For indexes ? But only couchdb has them. If more people would liked them it would be more popular ?

2. Yeah. I agree that for distributed-persistent-memcache it's good.

2.5 Json is inefficient.

3. Yeah, but you usually shard indexes, say by user_id. So when you're filtering where user_id=x and column_b=y you hit only 1 node.

4. Things that don't have partial-updates are key-value dbs, right ? If yes, why don't you call yourself that till you actually have partial-updates ?

skjhn · on Feb 16, 2016

There are a handful of databases that implement map-reduce one way or another - CouchDB, Couchbase, and MongoDB off the top of my head. Views might be a CouchDB/Couchbase concept, but not incremental map-reduce.

In what way is JSON inefficient? Are we talking about size?

GSI indexes may or may not be partitioned. With GSI, depending on the index size and resources available, you would most likely NOT partition the index - that's the recommendation. You can create an index on user_id and column_b, place it on a specific node, and you'd only be hitting that node for a query. Especially if it's a covering index. Again, databases without GSI indexes have an index partition on every single node - that means hitting every single one for every single query. I'm still not sure what you're trying to get at.

I'm guessing you are referring to MongoDB shards and routers. However, that example doesn't make sense. If user_id is the shard key, then yes, the router sends the query to the right node. The same thing happens with Couchbase. Given the key, you get the document straight from the node that has it. However, if you have user_id, why are you querying on column_b too? Now, if user_id is not the shard key, then no, the router does not send the query to the right node, its sends the query to every single node.

I'm generalizing, but key-value databases are best for key-value operations on arbitrary data. Document databases understand JSON and, as such, can provide access via queries. With Couchbase, you can choose from views, N1QL (SQL), geospatial (built on views), or full-text search (preview). Pretty far off from a key-value store.

That, and it already has support for partial updates via N1QL. However, my assumption was that you were talking about partial updates via key-value operations.

true_religion · on Feb 16, 2016

Does CouchDB actually store the data in JSON format on disk? Or does it use a more specialized format?

Elasticsearch, for example, has a full featured JSON API, but stores documents on disk in Lucerne---not that you'd ever know that if you only perused the API.

skjhn · on Feb 16, 2016

I'm not sure about the format, but it's compressed.

dang · on Feb 15, 2016

> Yeah, right. Most of the things about it ARE wrong

You've clearly got some good points here, but please don't break the HN guidelines by introducing them that way.

ddorian43 · on Feb 16, 2016

I try to, but nosql discussions get pretty heated.

dang · on Feb 17, 2016

Lots of discussions do. The trick is not to post while heated, or at least to edit the excesses out.

tiagobraw · on Feb 15, 2016

My new year resolution was to learn Erlang. I am implementing a simple REST service with it and I'm loving its approach to concurrency and terseness. It is indeed a beautiful language.

pmarreck · on Feb 15, 2016

Obligatory "try Elixir too; it has the same semantics, a more Ruby-like syntax (which is a matter of preference) and actual macros, while still compiling down to the same BEAM!"

siscia · on Feb 15, 2016

Just being extremely pedantic here, but for the least statistician of us I believe is important.

Usually just have lower numbers in your benchmark doesn't actually means that your software runs faster, you should run a statistical test and see.

In this case, if we run a standard test the confidence for improved performance is basically 1, and it is definitely accepted.

However the claim of performance improved of 8%, well the confidence for that claim is of around ~83%, which is quite big, but lower than 95% usually expected in peer-reviewed journals.

Said so, I really enjoy the post :)

rdtsc · on Feb 15, 2016

Very well done. And I like the format -- "here are the steps I took from start to end".

Looking forward for more posts like this.

abrookewood · on Feb 16, 2016

When I think of optimisation, my first thought isn't to go and patch the source code for a large open source application! I'm impressed ... just not sure it would be my first step.