cilk is definitely cool, but i don't think it makes designing parallel programs any easier, it just removes a lot of boilerplate. I have worked with similar infrastructures (not the language support) and found it to be extremely difficult. I looked into cilk and I don't think that the language support would have been a game changer. Does anyone have experience with cilk? What have been your experiences?
How do you know all of this if you haven't tried it? I think you'll find Cilk a lot more polished and refined then you might at first realize. Having language level support for fine-grained parallelism together with a provably efficient scheduler is a huge win.
But you'll have to be more specific with your complaints. What kind of parallel programs are you trying to design? What similar infrastructures have you worked with?
I was working with a small library that i implemented (with help) on a grad school research project. We were experimenting with various low level techniques/approaches to how one would implement an operating system for a 1000+ core chip. We were experimenting with memory management and scheduling designs. In order to test our designs we wrote a number of toy benchmark programs. Basically the typical set: merge_sort, fft, mat_mult, kmeans.... We ended up with a programming model that is not disimilar to the cilk model, though without the compiler support. So we were manually managing continuation scheduling with latches on atomic variables. This was a somewhat annoying thing to do, but we were able to make it work. I looked at cilk at the time and it definitely would have been nice to have compiler support but i don't think it would have fundamentally changed the way we implemented our algorithms.
If you're parallelizing algorithms with divide-and-conquer behavior (recursive algorithms, or anything that forms a tree of tasks and sub-tasks), I think it's a very natural form of parallelism.
I did something similar as a C++ library for my Master's: http://people.cs.vt.edu/~scschnei/factory/ I felt it was an intuitive way of expressing task parallelism, but if the dependent tasks don't operate on strict subsets of each other's data, I agree it's not much help. If that's the case, then you've successfully expressed the parallelism, but the larger problem remains: synchronized access to data structures.
Of course, transactional memory could help there. (Hey, full circle.)
in order to do work stealing in general you just need to organize your computations into tasks. Having a language that supports closures (or a similar thing) makes it easier, but it is not neccesary because you can just use function pointers to define tasks (and some kind of struct to pass args)
I've been using g++-4.6 lately and with the new lambdas I'm migrating more and more to this async task application model.
One point that seems frequently understated with the lamentations about the absence of a true double-pointer-sized DCAS is that on machines with a 64 bit CAS (e.g. x86 and x64) you could still implement something useful CAS on pairs of 32-bit handles.
2^32 concurrent work items "ought to be enough for anybody", right? :-) Well, it might be useful to those of us with less than 100 GB of RAM.
In the paper you linked they don't seem to be able to perform a priority queue op in under about 1 us. Even when there are only 15 threads running on an exotic 29-core machine, they can only handle 750e3 ops/second (with no workload on each op). (This is consistent with a benchmark I did on Boost.ASIO a while back.
So the tasks should be sized to take at least (thread cnt)* 9 us to complete to keep the overhead from the priority queue overhead under about 10%. The best cases for the lock free algorithm might let you bring that down to about 3 us in theory. I'm not sure that justifies the additional complexity.
This survey paper http://www.zurich.ibm.com/pdf/sys/adv_messaging/shortConcurr... says a lot of stuff and then at the end Figure 4 shows a simple coarse-locked single-threaded structure providing double the performance when there is actual load involved and significant preemption. Which is the opposite of what I'd expected, which would be the occasional preemption of the thread holding the coarse lock would cause a slowdown.
In my investigations i did find the overhead to be fairly high and in most of my benchmarks i found that using simple fixed size fifo queues was significantly faster. However my benchmarks were generally simple cpu bound tasks you would definitely want to explore more varied workloads. Also since you are using a priority based scheduling algorithm presumably the priorities are significant and therefore sacrificing some throughput would be worth it.