Two Pass Occlusion Culling

I got inspiration for this topic from the new GPU Zen 3 book. This had an article about this and I've always been curious on how to implement occlusion culling. I did some surface level research and kept track of resources that might be relevant (can be found below).

Research

During this phase I read over all the resources I collected during the preparation phase, and looked for more as well. As I was reading up on this, I created questions for myself about things I was unsure about. And while doing the research, I started writing out a plan for how to implement this feature.

Questions

How big should the depth buffer be for the algorithm?
- Answer: The same size as the largest dimension of the depth buffer. We need to down sample it level by level, so we need to start at the biggest resolution.
Does the depth buffer need to be square?
- Answer: MIP level selection becomes more complicated if the lengths are not square
Can we reuse the normal depth buffer as HZB?
- Answer: During the first pass we still render to the depth buffer, we use this as the input for the HZB.
During the First Pass, do we render the results only for the depth buffer, or also as actual draws?
- Answer: You render the draws that were visible in the previous frame to the gbuffers.
How do we down-sample? Compute shader, or can we blit.
- Answer: Blit only has the vk::Filter options for scaling the image, which does not satisfy the requirement that we need to choose the highest value.
- Answer: Using the sampler filter minmax extension, we can use reduction modes to more easily sample the HZB and depth buffer.

Plan

Initialization

Create HZB image with a mip chain
Create visibility buffer the same size as all the draws, with one bool per draw
- Can be optimized by packing this into bits

Render loop

Generate draws
- Generate draws based on what is currently visible in the visibilty buffer. Also perform frustum culling, to cull out what it isn't visible.
First pass
- Render out all the draws generated.
Build HZB
- Build a mip chain for the HZB using the depth buffer.
Generate draws again
- Ignore all draws that were already processed in the first pass.
- Frustum cull
- Occlusion cull
  - Get AABB of bounding sphere in screen space
  - Determine mip index of HZB based on AABB
  - Sample correct LOD from the HZB
  - Compare sample against nearest depth of the object we're testing
Second Pass
- Render out the newly generated draws

Profiling results

To look at the profiling results I want to review the difference between 3 scenes. Each of these scenes having some special properties that I'll note down. To review the differences between the before and after I want to analyze the following: the before state is just the duration for the geometry pass to run, the after is the duration of the geometry pass, plus the building of the hzb, and draw generation.

The scene has 8000 draws in total and there is no culling happening on the CPU.

The profiling will be performed on the Steam Deck, because it's lower-end than my PC/laptop, and is also the hardware target for our project.

Scene 1

Before

Draw generation: 0.017ms
GBuffer pass: 3.44ms
Total: 3.457ms

After

Pre pass draw generation: 0.016ms
Pre GBuffer pass: 2.17ms
HZB generation: 0.789ms
Second pass draw generation: 0.021ms
Second GBuffer pass: 0ms
Total: 2.996ms

Difference 0.461ms dencrease

Scene 2

Before

Draw generation: 0.013ms
GBuffer pass: 5.07ms
Total: 5.083ms

After

Pre pass draw generation: 0.018ms
Pre GBuffer pass: 2.94ms
HZB generation: 0.757ms
Second pass draw generation: 0.022ms
Second GBuffer pass: 0ms
Total: 3.737ms

Difference 1.346ms decrease

Scene 3

Before

Draw generation: 0.014ms
GBuffer pass: 3.74ms
Total: 3.754ms

After

Pre pass draw generation: 0.016ms
Pre Gbuffer pass: 3.24ms
HZB generation: 0.725ms
Second pass draw generation: 0.02ms
Second GBuffer pass: 0ms
Total: 4.001

Difference -0.247ms increase

What did I learn

HZB culling

Implementing this feature made me understand the following:

The steps required to make use of a HZB image to cull out draws
- Project the bounding box on the screen, determine the bounds, determine the mip level, and sample from that mip level at the correct UVs.
How to build an HZB image, using compute shaders.
- Make use of sampler reduction modes to downscale progressively
How to sample from an HZB image and test the result against a bounding sphere.
- Compare the sample from the HZB against the closest depth point from the draw being tested.
How to properly manage two passes for rendering
- Make use of the visibility buffer between frames, so it orchastrates what should be drawn when.

Push descriptors

While trying to build the HZB, I realized one major flaw, is that our bindless model won't work for this. This is because we need to write to specific mips of the image, which isn't supported in our model. So back to descriptors. However, since we're updating these multiple times (one mip is the input for the next), this means that every update overrides the previous. To make this work we need a lot of descriptors, so this didn't feel like the correct solution either.

Enter push descriptors. This is an extension (supported on the Steam deck, of course) that allows you to push data for descriptors during the recording of a command buffer. So your descriptor update, becomes a command instead, this makes it possible to properly organize all these updates.

The idea is that you first create a vk::DescriptorUpdateTemplateInfo that lists all the resource descriptions, so it know what kind of data you will be pushing through.

std::array<vk::DescriptorUpdateTemplateEntry, 2> updateTemplateEntries {
    vk::DescriptorUpdateTemplateEntry {
        .offset = 0,
        .stride = sizeof(vk::DescriptorImageInfo),
        .dstBinding = 0,
        .dstArrayElement = 0,
        .descriptorCount = 1,
        .descriptorType = vk::DescriptorType::eCombinedImageSampler,
    },
    vk::DescriptorUpdateTemplateEntry {
        .offset = sizeof(vk::DescriptorImageInfo),
        .stride = sizeof(vk::DescriptorImageInfo),
        .dstBinding = 1,
        .dstArrayElement = 0,
        .descriptorCount = 1,
        .descriptorType = vk::DescriptorType::eStorageImage,
    }
};

vk::DescriptorUpdateTemplateCreateInfo updateTemplateInfo {
    .descriptorUpdateEntryCount = updateTemplateEntries.size(),
    .pDescriptorUpdateEntries = updateTemplateEntries.data(),
    .templateType = vk::DescriptorUpdateTemplateType::ePushDescriptorsKHR,
    .descriptorSetLayout = _hzbImageDSL,
    .pipelineBindPoint = vk::PipelineBindPoint::eCompute,
    .pipelineLayout = _buildHzbPipelineLayout,
    .set = 0
};

_hzbUpdateTemplate = _context->VulkanContext()->Device().createDescriptorUpdateTemplate(updateTemplateInfo);

After that, you can list out your descriptor info that you want to write into your command buffer, and push them using the correct update template!

vk::DescriptorImageInfo inputImageInfo {
    .imageView = inputTexture,
    .imageLayout = vk::ImageLayout::eShaderReadOnlyOptimal,
};
vk::DescriptorImageInfo outputImageInfo {
    .imageView = outputTexture,
    .imageLayout = vk::ImageLayout::eGeneral,
};

commandBuffer.pushDescriptorSetWithTemplateKHR<std::array<vk::DescriptorImageInfo, 2>>(_hzbUpdateTemplate, _buildHzbPipelineLayout, 0, { inputImageInfo, outputImageInfo }, _context->VulkanContext()->Dldi());

The only things omitted here are some flags and extensions that need to be set to make this work properly.

Sampler reduction modes

While looking at examples of HZB culling algorithms I found the original author of the paper using sampler reduction modes to sample the depth buffer and HZB. The reason this is interesting is because normally you would have to take multiple samples to determine which one is the lowest. For example, when building your HZB you need to look at each previous mip (starting with the depth buffer), and keep the lowest value while down sampling, per 4 pixels (since each size is halved).

However, using the sampler reduction mode extension, we can apply this our vk::Sampler and set a sampler reduction mode to be set to vk::SamplerReductionMode::eMin, and this will automatically return you the lowest value when down sampling. This can be seen as a separate behaviour for bilinear filtering, but instead of receiving a weighted average, you can receive either the highest or lowest value.

Conclusion

I've been able to successfully implement the two-pass HZB occlusion culling algorithm. By following my research and the implementation from the reference, I integrated the algorithm in our engine and have been able to use it to cull out draws for optimizations.

Looking at the profiling it can be seen that there is an improvement in performance. This is, however, not always the case, but it is in the general one. This new approach has a higher up-front cost, because of the HZB building. The extra draw generation computes are barely noticeable, even on the Steam Deck.

The cases in which it is less performant, is mainly when there are no draws to be culled, in that case the building of the HZB is just extra time. I think this will also be especially effective in scene with more geometry and triangles, since that would save even more than in this low-poly scene.

Resources

GPU Zen 3, chapter 4, Two-Pass HZB Occlusion Culling
- Describes the two-pass implementation well
https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501
https://www.rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/
https://www.youtube.com/watch?v=gCPgpvF1rUA
https://www.nickdarnell.com/hierarchical-z-buffer-occlusion-culling/
- Shows a complete compute workflow
- Outdated implementation
https://github.com/jstefanelli/vkOcclusionTest/blob/master/shaders/query.comp
- A complete compute implementation, using draw commands
https://github.com/milkru/vulkanizer
- Two-pass implementation
https://gist.github.com/edecoux/8a44614f135104f20aa0babafbcdcf5d
- Confirms what to draw during first and second pass.
https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_sampler_filter_minmax.html
- The extension for using reduction modes in Vulkan.