Volt is a direct volume renderer for the parallel Nvidia CUDA hardware. Its goal is to gain framerates suitable for interactive exploration of volume data while producing high quality images at the same time. This is achieved without spatial data structures. Current optimization is concentrated on architecture, CUDA API and hardware properties.
Real-time high quality direct volume rendering is a computational intense task. So far, clusters for parallel rendering, specialized volume rendering hardware or the reconstruction of isosurfaces were approaches to reach interactive framerates. Emerged GPGPU techniques like Slicing are not able to deliver the same quality. The CUDA architecture enables fine grained access to massively parallel graphics hardware. Volt is an interactive direct volume renderer that takes advantage of these devices’ properties for high quality interactive ray casting.
Emitted radiation is accumulated in a front-to-back manner by the approximative integration of eye rays, whilst opacity is taken into account for early ray termination. Post-classification of samples reconstructed by trilinear interpolation is used in the shading process. The transfer function uses opacity correction of colors based on the integration step width. The colors are associated with opacities to avoid color bleeding. The approximation of surface normals is used for Blinn-Phong Shading and damping of homogeneous regions. Start and end points on eye rays are computed with a Kay & Kajiya Slab-Method bounding box test.
A CUDA thread matches to one ray. High framerates are achieved by the fine grained partitioning of data and computation. The graphical user interface, work package generation and rendering process are executed in parallel. The configuration data is partitioned according to its needs for updates, thus the transmission of unchanged data is reduced. The host supports each kernel call by precomputing all expressions which yield universal results for all rays. Various expensive divisions and exponentiations on the device are spared. The asynchronous kernel only executes computations which depend on the first ray and result in a pixel’s color.
The special function units are used to interpolate the volume data and the transfer function lookup table. Configuration parameters are stored in constant memory as it is cached and broadcast access can be guaranteed. The quadratic arrangement of threads in warps increases performance because of less tight requirements for coalesced operations on global memory with recent hardware. The improved locality results in more cache hits and less diverging execution paths. Rarely used thread-private variables of a warp are stored in successive banks of shared memory. Register usage is also optimized by prohibiting inlining for some functions and by abandoning CUDA vector structures and dynamic array indexation where possible. Loops are unrolled manually if the compiler can’t do this because of conditional statements. Float types and intrinsic functions are preferred in kernel code.
Each CUDA thread uses 32 registers and 44 bytes of local memory. A block width of 4 and a height of 64 yields 256 threads in each block and the kernel allocate 5120 bytes of shared memory and 352 bytes of constant memory. Hence pipeline hazards and operations on off-chip memory are both sufficiently hidden. If compiled without shading, only 14 registers are used. The results are based on renderings of the VisMale dataset with an integration step width of 0.49 times the shortest voxel edge. The function values [0,40] are assigned an alpha of 0 for all measurements. Values [41,255] are assigned an alpha of 15 for low, 63 for medium and 255 for high opacity. The integration of a ray is stopped by early ray termination if its saturation exceeds 95% and it is globally limited by the bounding box. All used datasets are kindly made available by [Röt06].