What makes this graph database "highly scalable, distributed"? There are difficu...

cbsmith · on June 14, 2012

You could have said the same thing about generalized distributed computing as well. Much like with MapReduce, the trick is to recognize that there is an 80% solution that works quite well (based around BSP):

http://kowshik.github.com/JPregel/pregel_paper.pdf

http://googleresearch.blogspot.com/2009/06/large-scale-graph...

The trick is not to build a generalized graph operator solution. It's to have a specialized graph operator solution, and then see how many solutions you can fit to the specialized graph operators. Turns out you can do a lot with a little.

jandrewrogers · on June 14, 2012

Approaches like Pregel have been the canonical way of dealing with large graph problems for many years. Unfortunately, most interesting graph analytic problems do not fit into that model because the "graph-like" aspect is still limited to problem sizes that fit conventional algorithms.

For example, if you can reduce a trillion-edge graph analysis problem into a billion-edge graph plus some other stuff (usually materialized document structures) then you can fit that into something like Pregel. That is how almost all real-world graph analysis is done today.

But by doing so, you've lost the ability to do graph analysis on the other ~trillion edges for the sake of tractability in a narrow case. You can't do relationship analysis across the attached documents. There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.

The Holy Grail is still the ability to run ad hoc graph analytic queries against a massively distributed graph representation. There are no shortcuts around this for many interesting applications. Right now, we are limited to mere billions of edges for most practical purposes and all of the hacks and workarounds are designed to keep the number of true edges to around this number even when the data model is much larger.

cbsmith · on June 14, 2012

> Unfortunately, most interesting graph analytic problems do not fit into that model because the "graph-like" aspect is still limited to problem sizes that fit conventional algorithms.

"interesting" != "useful"

Again, the same thing has been true with distributed computing in general. MapReduce & Hadoop are pretty much the antithesis of where distributed computing research for the last ~20 years had been working, because MapReduce solves what is nearly an "embarrassingly parallel" problem.

> There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.

There are even more distributed computing algorithms that don't fit in to MapReduce terribly well (and really, it's not that algorithms don't work with MapReduce/Pregel, it's that they don't work well), but it is still quite useful.

Turns out, the reason it's the Holy Grail is that it is just flat out hard to do (provably so). While what Titan/Pregel do isn't nearly is difficult, it is surprisingly difficult to do at massive scale, so just doing the simple stuff they do is quite useful and game changing.

bendemott · on June 14, 2012

Agreed, this is why our solution was to use Node.JS and achieve parallelism through eventually consistent independent data-stores separated on each physical web-server application instance, I'll gladly switch to a distributed centrally managed system when it becomes available.

th0ma5 · on June 14, 2012

You mean CAP theorem? http://en.wikipedia.org/wiki/CAP_theorem I imagine it is either the A or the P that gets to be the victim, but I'm not sure which in this case.

jandrewrogers · on June 14, 2012

No, I am referring to the set of problems related to graph partitioning.

This is essentially the same underlying problem that is the source of why distributed NoSQL databases do not support join operations. In the case of NoSQL databases, they simply do not support joins because it is not a core operations. (Technically you can still do a join, it just has terrible scaling characteristics.)

The fundamental operation of graph databases are relational joins by another name, which means that graph databases have the same limitation on distribution that distributed NoSQL databases have on joins. However, unlike NoSQL databases it is their primary operation so they can't just not support it. Consequently, the only way to have a "graph database" that is massively distributable is to solve the same problem that prevents distributed databases from supporting joins.

philjohn · on June 14, 2012

There are several distributed triple stores (triple stores are a specialised graph store).

http://4store.org http://virtuoso.openlinksw.com/ http://www.systap.com/bigdata.htm

There's a paper on 4store here: http://4store.org/publications/harris-ssws09.pdf

ferrouswheel · on June 14, 2012

I've been working on the graph partitioning issues for distirbuted graphdbs for a while (my own pet project that my brain won't let me give up on) but it's been from the perspective of someone who's not been keeping up to date with the Comp Sci literature. Have you got any suggestions for canonical papers from the last decade or so, for the state of the art?

jandrewrogers · on June 15, 2012

There has not been much that is both new and interesting in the graph partitioning literature in a long time. What you already know is probably not too far off from what is in the current literature. The literature on this topic has been stagnant for years.

Solutions to the graph partitioning problem exist and among people doing high-end graph analytics this has been rumored for years now. It just is not published and people that know how it is done are slathered in NDAs. I know of two different (related) algorithms for parallelizing graph analysis. IBM Research currently has the most advanced algorithms for graph analysis and they disclose very little about how they work.

tptacek · on June 14, 2012

Worth a re-read: http://codahale.com/you-cant-sacrifice-partition-tolerance/

th0ma5 · on June 14, 2012

Thank you, this is really great to show AP or CP but never CA and call it distributed.

okram · on June 14, 2012

See https://speakerdeck.com/u/okram/p/titan-the-rise-of-big-grap...

th0ma5 · on June 14, 2012

Thank you, this shows that for the one you want, the choice is in the backend.