Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's easy to get into situations where you're paying massive costs with serialization, deserialization, and network I/O, and I believe graph operations with Spark are one of those situations. I would be curious if running Spark in local mode with a single thread would actually improve the runtime, or if it would reveal other issues with the Spark graph libraries.


Generally memory layout is extremely important for graph problems, even on a single node. As I understand it the Spark approach does not embrace a "flat" layout, but rather does lots of pointer chasing, which can really slow things down. Because Spark isn't very careful about memory usage and layout, you outgrow a single node quite fast, and then you're back to really bad distributed scaling characteristics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: