Single Kernel

Optimizing memory access by combining the kernels into a single one

In the previous post we were able to render 348M rays/s. As we’ve seen, accessing global memory is slow and all our optimizations were targeting those accesses. In this post we are going to rewrite the rendering logic to reduce those accesses to a minimum.

The general rendering logic is as follows:

render_frame {
  generate_rays_kernel {
    stores ray to global memory
  }
  
  for each depth iteration {
    hit_world_kernel {
      loads ray
      stores hit info
    }
    scatter_kernel {
      loads ray, hit info, color, and attenuation
      stores scattered ray, color, and attenuation
    }
  }
}

The only reason we rely so much on global memory is to be able to communicate between the kernels, as this is the only way different kernels can exchange information. The alternative is to write a single “uber” kernel, thus reducing the usage of global memory to a minimum. The general algorithm of such a kernels is:

render_frame_kernel {
  generate ray
  loads color
  for each depth iteration {
    intersect ray with scene
    update ray with scattered ray
    update local color and attenuation
  }
  store color
}

The change has been done through a series of commits to make it easier to understand. Here is a link to the branch.

After making this change, the median rendering performance went up to 716.1M rays/s. This is ~90x the original single threaded implementation!

In the next post we are going to investigate what’s limiting the performance of the kernel, if any, and how we can improve it even further.

Written on November 5, 2018

voxel-tracer.github.io

Single Kernel