From my ray-tracing days, I recall that the majority of the time spent in the acceleration structure was due to cache misses when traversing nodes.
It might be you want to use a binary partitioning algorithm or similar for just a few levels, and then have the leaf nodes be N spans in a (sorted) list, where N is somewhat large. Then you can have some fast loop to mow through the leaf spans.
It might be you want to use a binary partitioning algorithm or similar for just a few levels, and then have the leaf nodes be N spans in a (sorted) list, where N is somewhat large. Then you can have some fast loop to mow through the leaf spans.