Creating Minecraft in One Week with C++ and Vulkan Week 2

Last week, I set up the basic renderer for the voxel game. This week, I’m going to write the terrain generator.

Day 8

I added texturing to the chunks. The textures are stored in a texture array instead of a texture atlas. I plan to add mipmapping later, so using texture arrays means I won’t have to deal with texture bleeding into each other at higher mip levels.

This texture array comes with it’s own descriptor pool and sampler.

The rest of the day was spent refactoring the render graph to handle synchronization better. Before now, edges connected two nodes directly and all synchronization was defined using information for the whole node. But there needs to be a way to specify multiple kinds of resources used by a single node. For example, the ChunkRenderer node needs the swapchain image in VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL, but it also needs the block textures in VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. The old render graph could only define one layout transition for all images used by a node.

To fix this, I added the concept of usages to render graph nodes. A node can define any number of image usages and buffer usages. An image usage defines it’s image layout, access mask, and pipeline stage. A buffer usage only defines access mask and pipeline stage.

The ChunkRenderer defines it’s swapchain image usage with the values needed to write to a swapchain image. ChunkRenderer also defines an image usage for the block textures, with the values needed to sample from a texture.

Render graph edges now connect two usages, instead of two nodes. The edge can now specify one layout transition for AcquireNode.imageUsage to ChunkRenderer.swapchainImageUsage, and a different transition for TransferNode.imageUsage to ChunkRenderer.textureUsage.

The nodes need to call sync for a specific usage, instead of for the entire node. A subset of a buffer or image can be specified when calling sync. So you can sync only a range within a buffer or a specific image subresource in an image.

I added ChunkManager to handle loading chunks. Right now it just loads the chunk the player is currently in and it never unloads chunks.

The texture files are copied into the final build by using a truly disgusting CMake file.

set(TEXTURES
    "resources/dirt.png";
    "resources/grass_side.png";
    "resources/grass_top.png";
    "resources/stone.png";
)
set(RESOURCES_TEXTURES)

foreach(TEXTURE ${TEXTURES})
    get_filename_component(FILE_NAME ${TEXTURE} NAME)
    set(TEXTURE_OUT "${PROJECT_BINARY_DIR}/resources/${FILE_NAME}")
    add_custom_command(
        OUTPUT ${TEXTURE_OUT}
        COMMAND ${CMAKE_COMMAND} -E copy "${PROJECT_SOURCE_DIR}/${TEXTURE}" ${TEXTURE_OUT}
        MAIN_DEPENDENCY "${PROJECT_SOURCE_DIR}/${TEXTURE}"
    )
    list(APPEND RESOURCES_TEXTURES ${TEXTURE_OUT})
endforeach(TEXTURE)

add_custom_target(
    CopyTextures
    DEPENDS ${RESOURCES_TEXTURES}
)

add_dependencies("Game" CopyTextures)

I used to not understand CMake. I still don’t, but I used to not, too.

Multiple chunks with multiple textures

Day 9

I wrote new render graph node, the MipmapGenerator. The MipmapGenerator generates mipmaps.

MipmapGenerator defines two image usages, an input usage and an output usage. Other systems can submit an image to generate mipmaps for. This node will transition the image internally, but ensures that it still matches the declared usage when it finishes.

Now, there’s an edge from TransferNode.imageUsage to MipmapGenerator.inputUsage and another edge from MipmapGenerator.outputUsage to ChunkRenderer.textureUsage. The other edges going to ChunkRenderer were not affected.

The internal transitions occur while the mipmaps are being generated, according to this chapter of vulkan-tutorial.com. (By the way, I wrote this chapter of the tutorial myself 😎)

Layer 0 is transitioned to TRANSFER_SRC_OPTIMAL by the render graph edge. Layer i is transitioned to TRANSFER_DST_OPTIMAL, then mip level i - 1 is blitted to level i, then i is transitioned to TRANSFER_SRC_OPTIMAL. Afterwards, every mip level is in TRANSFER_SRC_OPTIMAL. Then another render graph edge transitions the whole texture to SHADER_READ_ONLY_OPTIMAL for use in the ChunkRenderer.

I also started fleshing out the ChunkManager. The implementation here ended being too slow and was hard to reason about, but it was enough to load a small radius of chunks around the player. There’s still no terrain generator, so the chunks are just solid dirt.

I optimized the chunk vertex buffer formats. Since the vertex positions can only be integers in the range [0, 16], I didn’t need to use a 32-bit format. So instead of using VK_FORMAT_R32G32B32_SINT (equivalent to glm::ivec3), I could use VK_FORMAT_R8G8B8A8_SINT (equivalent to glm::i8vec4). Note that there is now an unused alpha channel, since the R8G8B8_SINT format is not widely supported. This is still a win since this reduces memory usage from 12 bytes per vertex to 4 bytes. I did the same with the UV attribute.

Another memory optimization is using the same index buffer for all chunks. All indices follow the same pattern to draw a quad with two triangles, so there’s no reason to have a separate index buffer for all chunks. I just need to allocate an index buffer that can handle the worse case number of quads that a chunk could have. This saves a lot of memory when shared between a large number of chunks. I could switch the vertex format from uint32_t to uint16_t, but at this point the memory savings wouldn’t matter that much.

I also fixed a serious bug in the memory allocation for buffers. I’ve been using the Vulkan Memory Allocator library to handle sub allocating buffers from the same VkDeviceMemory objects. However I made a typo when requesting allocations from the library.

Instead of writing what I meant, which was to tell VMA that the buffer should live on the GPU:

VmaAllocationCreateInfo allocInfo = {};
allocInfo.usage = VMA_MEMORY_USAGE_GPU_ONLY;

I wrote this:

VmaAllocationCreateInfo allocInfo = {};
allocInfo.flags = VMA_MEMORY_USAGE_GPU_ONLY;

I set the flags field instead of the usage field. VMA interpreted that value as flag that requested a dedicated allocation. That meant every vertex buffer was getting it’s own VkDeviceMemory object. I didn’t notice this until today since there were only a handful of chunks before. When the view distance for ChunkManager was set to a large value, the game quickly hit the max number of VkDeviceMemory allocations, which is 4096 on my machine, and crashed.

Day 10

I added the BlockManager class to hold data on how each block should be textured. Every block type is registered in BlockManager and says which array layer of the texture should be used for each side of a block.

I also refactored the ChunkManager to use a much simpler method to determine which chunks should be loaded and unloaded.

Now, the game can load a world that is just an infinite flat field. However I got caught up trying to fix a bug with synchronizing the chunk meshes, so I didn’t get much done for the rest of the day.

Day 11

The synchronization bug ended being extremely simple. It turns out I wasn’t synchronizing the chunk meshes at all.

I added multithreading today. I defined two different queue types to handle sending data between threads.

BlockingQueue is filled by the main thread and emptied by the worker thread. The queue has a fixed size (16 right now). The main thread simply skips adding to the queue when it is full. The worker thread blocks if the queue is empty.

BufferedQueue is for sending results back to the main thread. This class holds two queues, one which is being filled by the worker thread. When the main thread is ready to start reading, BufferedQueue swaps the two queues and gives the filled queue to the main thread. This is like how the GPU double buffers swapchain images.

These classes aren’t terribly complicated, but I wanted to minimize mutex contention without pulling in a dependency like boost.

The worker thread in this case is the chunk mesher, ChunkUpdater. It creates it’s own thread where the meshes are generated. The mesh transfer still needs to occur on the main thread, but a large amount of work has been moved to another thread.

The fixed size BlockingQueue ensures that at most 16 chunks will have their meshes generated in one frame and that (on average) only 16 meshes need to be transferred in one frame.

Day 12

The game started becoming painfully slow around this time. With a view distance of 16, the game quickly slows down to about 30 FPS (on my machine), even before all the chunks meshes are generated.

The first optimization was something that I wanted to add anyways, frustum culling. I followed this article from Lighthouse3D.com. Frustum culling ended up being simple to add, since the chunks can be represented by axis aligned bounding boxes.

Even though frustum culling reduced the number of draw calls, the game’s performance didn’t improve. Profiling revealed that over 50% of the game’s execution time was spent managing resource lifetimes.

The two functions that manage resource lifetime

Recall that the sync function copies the shared_ptr internally to extend a resource’s lifetime. The other function, clearSync is what destroys the shared_ptr on later frames. These two functions increment and decrement the reference count of the shared resource.

With a view distance of 16, there could be about ~10,000 chunks loaded at a time. Each chunk’s mesh has two vertex buffers and an index buffer. So there would ~30,000 increments and ~30,000 decrements per frame.

This was the bottleneck for the game. The best option would be to stop using shared_ptr here. However I would need some other way to prevent resources from being destroyed while in use.

I ended up using a queue of vectors, std::queue<std::vector<T>>, to delay resource destruction. If a resource is destroyed, the resource destructor transfers it’s internal VkBuffer or VkImage into the vector at the back of the queue. The render graph destroys the vector at the front of the queue once per frame and adds an empty vector to the end of the queue. The queue is initialized with one empty vector per frame in flight.

This removes almost all shared_ptr operations from the game. The meshes still use shared_ptr since the index buffer is shared between all chunks. But that only changes the reference count when a chunk is created or destroyed, not thousands per frame. A side effect of this change is that sync no longer has to be called if a resource is used.

The game now maintains 100 FPS while chunk meshes are being generated and over 250 FPS after all meshes are generated.

There’s still an obvious problem though. The chunks are generated in the order they are added by the ChunkManager. I wrote a priority queue class to generate the closer chunks first.

C++ provides the std::priority_queue class, but it has two limitations that prevent me from using it. The first is that items can’t be removed once they have been added. I need to be able to remove chunks if the player moves far enough away. The second is that the priority of the items can’t be changed. Since the priority is based on distance, I need to change it every time the player moves to a new chunk.

My custom PriorityQueue class lets me handle both of these problems.

Day 13

Now, with a world that can load chunks around the player, I can start writing the terrain generator.

I used the FastNoise library to handle the noise generation. Then I used fractal simplex noise to generate the terrain. This creates an infinite plain of gently rolling hills.

32 chunk view distance. 5X speed.

I thought the terrain was a little too boring, so I decided to add cave generation. I used the technique described at the end of this article. I used two noise generators with different seeds to produce ridged multifractal two noise values. The two values are multiplied together and compared to a threshold. This produces caves where the two noise generators “intersect”.

The same terrain with caves
A view of the caves underground.

The other change I made this day was adding visibility checks that can cross chunk boundaries. Before, the under ground shot wouldn’t have been possible, since chunks wouldn’t read neighbor chunks when checking visibility. So a completely solid underground would still have meshes at every chunk boundary.

Day 14

I wanted something a little more interesting than a black background to look at. So I got a free skybox texture and added a skybox renderer.

The skybox is stored using in a cubemap, which is really just an array texture and a special image view. The texture has 6 layers that contain the 6 skybox images. The image view uses the VK_IMAGE_VIEW_TYPE_CUBE, which tells Vulkan to interpret the 6 layers as a cubemap. The shader samples from a uniform textureCube variable.

Instead of using a box shaped mesh to render the skybox, I used a full screen post processing pass. The sky box renderer draws a fullscreen triangle with the clip space Z set to 1. Anywhere the depth buffer is still set to 1 is colored in by the sky box.

The vertex shader needs only the rotation of the camera, so the translation components of the view matrix are zeroed out in the shader. Then it calculates the world space view direction of each vertex, so that the fragment shader can sample the skybox texture correctly. It took me most of the day to realize that I need to use the inverse of the MVP matrix, instead of the MVP matrix.

Using this technique lets me avoid the arduous task of defining cube coordinates in code.

Conclusion

This week I set up the basic terrain renderer. Next week I’ll add more complicated terrain features and lighting.

Here’s a preview of something I worked on in Week 3:

Leave a Comment