Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I need to add two factoids to this insightful discussion:

There is a big difference between space-filling curves and hyperspace hashing. Space-filling curves map a multidimensional space to a single path through that space that is then mapped to nodes. In the process, they do not retain locality. To our knowledge, hyperspace hashing is a direct intellectual descendant of consistent hashing and has not been done before. If you have pointers to work where data is mapped to nodes in a cluster using a multidimensional hash, please send them to us!

And one major reason why multidimensional databases failed to take off is a problem known as "the curse of dimensionality." If you implement a multi-dimensional representation naively, highly-dimensional data (say, an object with 10-20 attributes) will require a large number of nodes to be efficient. HyperDex solves this through something called space partitioning (I think the paper calls it "data partitioning," but we've changed the name to be a bit more descriptive). They're kind of analogous to materialized views, very loosely speaking.

Agreed completely that hyperspace hashing comes to its own when coupled with value-dependent chaining!



While there is a distinction between space-filling curves and what you are calling "hyperspace hashing", they are theoretical siblings and the conceptual gap between the two is very small, arguably differing interpretations of the same abstract construct. The "hash" interpretation was well-known when the space-filling curve indexing literature in the late 1980s was written though they were primarily interested in access methods for sequential storage.

With respect to "hyperspace hashing", it has been the standard way of interacting with large space decompositions for decades. The mathematics is from the 1960s (Morton et al) and the practical computer science was mature by the 1980s. When I was writing parallel codes for supercomputers, virtually all of our data structures were built this way. It is also the algorithm used for some large-scale geospatial systems; some of the Google Earth backends used it back in the day. While I've never personally worked on it, I've heard some graphics processing pipelines also use the algorithm. I first learned about it ten years ago because I ran into a couple systems using it.

I've never heard it called "hashing" until recently, largely because technically the algorithms are not a hash even if used as one in many implementations. This, and the fact that most of the literature on this idea significantly pre-dates electronic publishing, probably explains why you've seen so few references to hyper-dimensional hashing.

As you point out, the curse of dimensionality is a classic problem for these structures. The partitioning technique in the paper works for some use cases but is not used as much anymore. There have been some clever and fairly general attacks on the dimensionality problem over the last few years by a couple companies but I have not seen anything published.

All that said, this general approach is overdue for large-scale distributed systems. It has been proven out in niche markets like supercomputing for decades and is superior to distributed hash tables for many purposes. However, most of the recent research does not seem to be published.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: