BVH Acceleration Structure (Part 2)

Faster BVH traversal

Store more than one point per leaf

In CUDA, memory reads are expensive, so we always try to maximize the usage of each read. Either by making sure threads in a warp are doing coalesced reads, by reading the data that’s reused in shared memory, or by using the full size of memory lanes.

Depending on the compute capability, each memory access will read from 32B to 128B in one single transaction. This means that every time we read a point, and even though we are only interested in 12B (float3), each thread will still read at least 32B from global memory.

One way to improve this is to store more points per leaf, and processing them alltogether when reach the node. This means more computation but that’s okay if it’s going to reduce memory reads.

General idea is simple enough, we introduce the following constants that will allow us to easily experiment with different lane sizes:

const unsigned int lane_size_float = 32 / sizeof(float);
const unsigned int lane_size_spheres = lane_size_float / 3;
const unsigned int lane_padding_float = lane_size_float - lane_size_spheres * 3;

For example, for a lane size of 32 bytes we can store 8 floats, or 2 float3 + 8B padding. And for a lane size of 64 bytes we can store 16 floats, or 5 float3 + 4B padding.

Building the BVH remains the same, the only change is that we create the leaf BVH whenever we have lane_size_spheres points or less in a node.

Traversing the BVH is also mostly the same, but when we reach a leaf node, instead of intersecting the ray with two spheres (on both sides of the node) we intersect with lane_size_spheres:

__device__ bool hit_bvh(const scene& sc, const ray& r, float t_min, float t_max, hit_record &rec) {
    
    bool down = true;
    int idx = 1;
    bool found = false;
    float closest = t_max;
    bvh_node node = sc.bvh[1];

    while (true) {
        if (down) {
            if (hit_bbox(node, r, t_min, closest)) {
                if (idx >= sc.count) { // leaf node
                    int m = (idx - sc.count) * lane_size_float;
                    #pragma unroll
                    for (int i = 0; i < lane_size_spheres; i++) {
                        vec3 center(sc.spheres[m++], sc.spheres[m++], sc.spheres[m++]);
                        if (hit_sphere(center, r, t_min, closest, rec)) {
                            found = true;
                            closest = rec.t;
                        }
                    }
                    down = false;
                }
                else {
                    // keep going down
                    idx = idx * 2; // node = node.left
                    node = sc.bvh[idx];
                }
            }
            else {
                down = false;
            }
        }
        else if (idx == 1) {
            break;
        }
        else if ((idx % 2) == 0) { // node is left
            idx++; // node = node.sibling
            node = sc.bvh[idx];
            down = true;
        }
        else {
            idx = idx / 2; // node = node.parent
            node = sc.bvh[idx];
        }
    }

    return found;
}

After a few experiments, I found out that a lane_size of 64B gives the best results, even though my gpu’s lane size is actually 32B. One reason that could explain this is that there is less waste (padding) in the 64B size vs 32B thus more efficiency.

better BVH building logic

One way to improve BVH traversal is build better BVHs. Current implementation picks a random split axis for each node, but if we think about it splitting on the largest axis for each node may give better trees. This is a simple change, it only affects the build logic and it’s one lane change. Early testing shows that this simple change reduces total number of global memory transactions and actual memory reads.

smart traversal

Peter Shirley’s blog post describes a smarter way to travere the BVH that takes into account the fact that nodes in the tree are sorted according to the split axis and uses the ray direction to decide which node to intersect first. The idea is that if we find an intersection in the first child it will most likely allow the traversal to skip the second child sooner.

General idea is simple: in the original traversal code, when we are at node idx we go to the first child by moving to idx*2 and we go to its sibling by moving to idx*2+1. We can rewrite this to:

child(idx) = idx*2 + 0
sibling(idx) = idx + 1

At each node, we know that we sort the spheres according to the split axis, thus if ray.direction[split_axis] >= 0 the ray intersection will have a higher chance happening in the left child, otherwise we should explore the right child first.

Using this we can rewrite our previous traversal formulas as follow:

child(idx) = idx*2 + (ray.direction[split_axis] >= 0 ? 0 : 1)
sibling(idx) = idx + (ray.direction[split_axis] >= 0 ? 1 : -1)

Using this we can rewrite our BVH traversal code as follows:

__device__ bool hit_bvh(const scene& sc, const ray& r, float t_min, float t_max, hit_record &rec) {
    
    bool down = true;
    int idx = 1;
    bool found = false;
    float closest = t_max;
    bvh_node node = sc.bvh[1];

    // precompute move increments
    int move_left[3], move_right[3];
    for (unsigned int i = 0; i < 3; i++) {
        move_left[i] = (r.direction()[i] >= 0) ? 0 : 1;
        move_right[i] = (r.direction()[i] >= 0) ? 1 : -1;
    }

    while (true) {
        if (down) {
            if (hit_bbox(node, r, t_min, closest)) {
                if (idx >= sc.count) { // leaf node
                    int m = (idx - sc.count) * lane_size_float;
                    #pragma unroll
                    for (int i = 0; i < lane_size_spheres; i++) {
                        vec3 center(sc.spheres[m++], sc.spheres[m++], sc.spheres[m++]);
                        if (hit_sphere(center, r, t_min, closest, rec)) {
                            found = true;
                            closest = rec.t;
                        }
                    }
                    down = false;
                }
                else {
                    // keep going down
                    idx = idx * 2 + move_left[node.split_axis()]; // node = node.left
                    node = sc.bvh[idx];
                }
            }
            else {
                down = false;
            }
        }
        else if (idx == 1) {
            break;
        }
        else {
            // let's read the parent again
            node = sc.bvh[idx / 2];
            if ((idx % 2) == move_left[node.split_axis()]) { // go to sibling
                idx += move_right[node.split_axis()]; // node = node.sibling
                node = sc.bvh[idx];
                down = true;
            }
            else { // go up
                idx = idx / 2; // node = node.parent
                //node = sc.bvh[idx]; // we already read parent node
            }
        }
    }

    return found;
}

This code is nearly 8x faster than the previous one, but it still suffers from three limitations:

using arrays to store the precomputed values causes the kernel to rely on local memory, which can actually end up being stored in global memory (we don’t want that)
when moving from one child to its sibling we need to load the parent node again, from global memory, to identify the split axis.
we load the nodes as soon as we change the idx but we don’t really have to when we are moving back up the tree.

To solve this, we no longer precompute the move increments as we can simply compute them as follows:

move_left = signbit(ray.direction[split_axis])
move_right = -2*move_left + 1

As we can see, we no longer need the split_axis to compute the move_right increment, so we can just store 1bit per level when traversing the tree to be able to compute move_right without loading the parent node again. Tree traversal becomes:

__device__ bool hit_bvh(const scene& sc, const ray& r, float t_min, float t_max, hit_record &rec) {

    bool down = true;
    int idx = 1;
    bool found = false;
    float closest = t_max;

    unsigned int move_bit_stack = 0;
    int lvl = 0;

    while (true) {
        if (down) {
            bvh_node node = sc.bvh[idx]; // only load nodes when we are going down the tree
            if (hit_bbox(node, r, t_min, closest)) {
                if (idx >= sc.count) { // leaf node
                    int m = (idx - sc.count) * lane_size_float;
                    #pragma unroll
                    for (int i = 0; i < lane_size_spheres; i++) {
                        vec3 center(sc.spheres[m++], sc.spheres[m++], sc.spheres[m++]);
                        if (hit_sphere(center, r, t_min, closest, rec)) {
                            found = true;
                            closest = rec.t;
                        }
                    }
                    down = false;
                }
                else {
                    // current -> left
                    const int move_left = signbit(r.direction()[node.split_axis()]);
                    move_bit_stack &= ~(1 << lvl); // clear previous bit
                    move_bit_stack |= move_left << lvl;
                    idx = idx * 2 + move_left;
                    lvl++;
                }
            }
            else {
                down = false;
            }
        }
        else if (idx == 1) {
            break;
        }
        else {
            const int move_left = (move_bit_stack >> (lvl - 1)) & 1;
            const int left_idx = move_left;
            if ((idx % 2) == left_idx) { // left -> right
                idx += -2 * left_idx + 1; // node = node.sibling
                down = true;
            }
            else { // right -> parent
                lvl--;
                idx = idx / 2; // node = node.parent
            }
        }
    }

    return found;
}

Store the BVH tree in constant memory

Given that all threads are traversing the BVH tree most of the time, if we could cache the whole tree it would save a lot of memory reads. Storing the tree in constant memory is a sure way to do that, but unfortunately constant memory is limited to 64KB. Each BVH node is 2 x float3 = 2 x 3 x 4 = 24B At most we can store 2730 nodes, so we can fit the first 10 levels of the tree or a total of 2048 nodes.

Building the BVH remains the same with the difference that we copy the first 2048 nodes to constant memory and the remaining nodes to global memory. Traversing the tree is also the same except for the line of code that loads the node:

bvh_node node = (idx < 2048) ? d_nodes[idx] : sc.bvh[idx - 2048];

d_nodes is the array of nodes stored in constant memory, and sc.bvh is the array of nodes stored in global memory.

Results

In the previous post we were able to render 1M points in 2s. With all the optimizations in, we can now render 10M points in 0.7s.

Written on April 27, 2019

voxel-tracer.github.io