[FIXED] TransferTask semaphore wait stages #968

siystar · 2023-09-16T13:44:35Z

Pull Request Template

Description

Fixed semaphore wait stages for transferCompletedSemaphore.

The Vulkan spec 7.4.2 "Semaphore waiting" states: "[...] the second synchronization scope is limited to operations on the pipeline stages determined by the destination stage mask specified by the corresponding element of pWaitDstStageMask."

Note, while waiting in a particular pipeline stage guarantees that any logically later pipeline stages will wait, transfer, graphics and compute pipelines are independent so using a wait stage mask of VK_PIPELINE_STAGE_TRANSFER_BIT doesn't imply waiting of any other pipeline stages.

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

General testing

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

timoore · 2023-09-16T15:50:49Z

I was hoping that this would fix a bug that I've been chasing down, but not quite. Basically, in vsgCs I'm modifying uniform buffers and seeing flashing in the output. This is on an Nvidia 1070 which has a separate graphics queue. The Vulkan sychronization layer produces a spew of errors like:

        [0] 0x220dc70, type: 4, name: NULL
SYNC-HAZARD-WRITE-RACING-READ(ERROR / SPEC): msgNum: -860391127 - Validation Error: [ SYNC-HAZARD-WRITE-RACING-READ ] Object 0: handle = 0x220dc70, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xccb77929 | vkQueueSubmit: Hazard WRITE_RACING_READ for entry 0, VkCommandBuffer 0x28c6a30[], Recorded access info (recorded_usage: SYNC_COPY_TRANSFER_WRITE, command: vkCmdCopyBuffer, seq_no: 5, reset_no: 1453). Access info (prior_usage: SYNC_FRAGMENT_SHADER_UNIFORM_READ, read_barriers: VK_PIPELINE_STAGE_2_BOTTOM_OF_PIPE_BIT, queue: VkQueue 0x2108000[], submit: 4483, batch: 0, batch_tag: 70343, command: vkCmdDrawIndexed, seq_no: 18, command_buffer: VkCommandBuffer 0x2729650[], reset_no: 1493).
    Objects: 1
        [0] 0x220dc70, type: 4, name: NULL

I know that this isn't a proper bug report, but I need to do more investigation, and I hope that @siystar will be inspired to look too 😎

robertosfield · 2023-09-16T18:13:08Z

@timoore I think it would be best to create a Issue report for this with details how to reproduce it. Info on how vsgCs is creating and updating the uniform data would be helpful too.

siystar · 2023-09-16T18:43:06Z

@timoore The semaphore changed in this PR only enforces the order of transfer and graphics commands for a particular frame.

There is another semaphore that is used to enforce the ordering of frames relative to one another. That semaphore currently uses the vsg::Semaphore's default pWaitDstStageMask of BOTTOM_OF_PIPE_BIT, which I think is incorrect. The Vulkan spec states that BOTTOM_OF_PIPE_BIT when used as a dstStageMask is equal to no pipeline stages. Not specifying any stages to wait renders the semaphore ineffective so the order of frames is not currently enforced except for implicit ordering constraints like drawing order. The Vulkan spec also states that the TOP/BOTTOM pipeline stages are deprecated so ideally we should cease using them.

To test, you could open vsg/vk/Semaphore.h and change the default stage mask to VK_PIPELINE_STAGE_ALL_COMMANDS_BIT.

timoore · 2023-09-16T22:52:28Z

I think there are two problems relevent to doing transfers on a different queue from the graphics, and I'll open an issue about it. Basically the transfer task is not waiting for previous rendering submissions to finish before writing into buffers, and there are no queue ownership transfers.

robertosfield · 2023-09-17T03:40:55Z

@timoore The TransferData class milti-buffers so that each frames worth of updates does not overlap other frame. This is done to prevent copies different frames worth of data contending.

siystar · 2023-09-17T07:45:22Z

@timoore queue family ownership transfers shouldn't be an issue for the TransferTask with PR #965 applied.

Basically the transfer task is not waiting for previous rendering submissions to finish before writing into buffers,

Good catch. Yes, I think we need to add a memory dependency between frames before transferring into buffers as there is no semaphore that connects them. The multi-buffering strategy of the TransferTask only applies to the staging buffers, not to the transfer itself, and the transferCompletedSemaphore only covers future frames, not past frames.

timoore · 2023-09-17T07:47:32Z

As I understand it, the transfer task multi buffers its own staging buffers, but the destination buffers and images are not multibuffered. If the transfer queue and graphics queue are the same, then there might be enough synchronisation already to prevent data races; I would need to study it further to be sure. If the graphics and transfer queues are different, the transfer task does need to synchronise on the use of those buffers in the graphics queue with a semaphore wait or fence.

siystar · 2023-09-17T08:24:59Z

If we need to do synchronization both before and after execution of the transfer commands, then what's the point of using a separate queue for the TransferTask? It certainly makes sense for the CompileTraversal where we only need to synchronize with operations that happen afterwards, but for the TransferTask I don't see how it could possibly help to use another queue.

robertosfield · 2023-09-17T09:22:28Z

Presently the TransferTask uses a graphics queue and will use the same queue on non NVidia systems as only one queue us present. The TransferTask currently has to use the graphics queue because the mipmap generation can't be done in a transfer queue.

Longer term I want to look at splitting out the transfer and the mipmap generation but this isn't on my priority list of TODO items.

siystar · 2023-09-18T13:20:25Z

The point I was trying to make concerns the efficiency point of view. Correct me if I'm wrong, but isn't the whole point of submitting work to different queues the possibility to overlap the execution of the work? The Vulkan spec describes queues as an "interface to the execution engines of a device" where each device "exposes one or more queues which may process work asynchronously to one another". If, in order to fix the bug @timoore has discovered, we need to add an additional semaphore wait in between the start of a TransferTask and the previous frame's rendering, then there is no longer any opportunity for work to overlap so we could just as well submit the transfers to the main rendering queue and get the same end result with less code, with the required synchronization being done for us automatically by the queue submission. Going a step further, we could even include the transfer task's command buffer, with a pipeline barrier added to it, in the vkQueueSubmit(..) call made by the RecordAndSubmitTask.

We'd need to benchmark to be sure of the performance, but my gut feeling tells me that opting for the simplest implementation is the wisest strategy regardless.

timoore · 2023-09-18T22:43:31Z

I think @robertosfield would agree that TransferTask is a work in progress. The problem isn't that the TransferTask needs to sync with the previous frame, it's that the dynamic data objects should be double buffered so that the TransferTask doesn't need to block.

A motivation for having a separate transfer queue is so that compile() in worker threads can make progress while the main graphics queue is blocked in vkQueueSubmit() waiting on a semaphore.

A wrinkle in all this is that various memory types are host visibile and should not be updated via a CopyBuffers command... but that introduces yet another kind of synchronization.

siystar · 2023-09-20T10:00:25Z

The first draft of a dynamic data updating mechanism, proposed in a PR by myself, did indeed use multi-buffered descriptor buffers on the GPU through the use of dynamic descriptors, so it would have supported the queuing of transfer tasks without synchronization. But Robert rejected the PR for good reasons.

The VSG is first and foremost a Vulkan abstraction layer, with some tools for developers built on top of that layer. One of the VSG's main strengths is the coherency of mapping VSG to Vulkan objects and vice versa. If a user creates a vsg::DescriptorBuffer object with a vsg::Data object attached to it, they'll expect a matching configuration of objects to be created on the Vulkan side, i.e. a single VkBuffer object containing exactly that data. To add multi-buffering support here would mean that some of the neat mapping of VSG to Vulkan objects is lost for the purpose of a feature that might theoretically lead to a performance gain but has a somewhat narrow scope.

In my opinion, a more sustainable way to improve the overlap of transfer work would be to make the pipeline stages used for the transferCompletedSemaphore configurable as I've described in #965.

timoore · 2023-09-21T09:56:12Z

I agree that the VSG objects that correspond directly to Vulkan objects shouldn't contain magical double buffering. Also, I've come to realize that there are an advantage to not double-buffering memory that is written by the host: the latency between the update and eventual screen display is lower, assuming that the whole scene can be updated, recorded, and submitted (i.e., drawn) in one frame time.

So, I think we should "lean in" to the single buffer approach for the moment and make sure the synchronization is correct for that case, using @siystar's ideas and probably some more flexible use of fences. The RecordAndSubmitTask and TransferTask's use of command buffers is a bit muddled; if I understand correctly they use triple-buffering, probably in an effort to align with available presentation images, but they don't need to be more than double-buffered.

Double buffering of host memory writes is going to be needed for complex scenes, but I'm thinking that that should be handled in the scene graph, perhaps with a variation of StateSwitch that performs its action on alternate frames. Synchronization between updates, whether by direct write or buffer copy commands, and command buffer usage will have to be dealt with too, of course.

robertosfield · 2023-09-25T07:23:10Z

One of the key reasons for buffering of data that is being transferred is if you copy data to GPU memory as soon as you've updated on the CPU you can pass the data to the GPU memory while it's still be used by a previous frame, so you end up with out of sync updates.

You can block on the CPU till all GPU rendering for a frame is complete so it's safe to update and copy but this would kick the whole point of CPU and GPU parallelism to touch and performance would suck.

So these are solid reasons why I've implemented TransferTask the way I have, this I'm not about to change. Refinements would be in the direction of separating out transfer and mipmap operations and synchronization side.

Now if there are specific problems we need to address then we need to distill down and explanation of what they are and how to reproduce them and then I sit down and think through a design that would work best.

timoore · 2023-09-25T09:01:33Z

One of the key reasons for buffering of data that is being transferred is if you copy data to GPU memory as soon as you've updated on the CPU you can pass the data to the GPU memory while it's still be used by a previous frame, so you end up with out of sync updates.

You can block on the CPU till all GPU rendering for a frame is complete so it's safe to update and copy but this would kick the whole point of CPU and GPU parallelism to touch and performance would suck.

This is all mostly true, but the buffers and images used for rendering are not double buffered. The submission of buffer copy commands assembled by the transfer task does not wait on a semaphore for the rendering commands using those buffers to complete. This is a bug: it produces real visual anomalies in vsgCs and error messages from the Vulkan synchronization layer. Since we're talking about submissions on different queues, there needs to be semaphore signals and waits.

As to whether performance would "suck," sure, throughput would be worse, but the story for latency is pretty good in a single buffered approach. Implementing the required double buffering is a big job, which is why I'm not volunteering right at the moment.

As an aside, I'm skeptical that there's much GPU paralellism happening with the current constraint that the transfer queue support graphics operations. The hardware that most of us have simply doesn't support that. The major advantage of using a separate transfer queue for us is that submissions aren't blocked by the wait-for-fences operation on main graphics queue. I'm happy to be proved wrong.

So these are solid reasons why I've implemented TransferTask the way I have, this I'm not about to change. Refinements would be in the direction of separating out transfer and mipmap operations and synchronization side.

Now if there are specific problems we need to address then we need to distill down and explanation of what they are and how to reproduce them and then I sit down and think through a design that would work best.

I'll submit my vsgCs problems as an issue so that the discussion will become more concrete.

See #980.

[FIXED] TransferTask semaphore wait stages

e64eaa4

robertosfield merged commit a1f7d86 into vsg-dev:master Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIXED] TransferTask semaphore wait stages #968

[FIXED] TransferTask semaphore wait stages #968

siystar commented Sep 16, 2023

timoore commented Sep 16, 2023

robertosfield commented Sep 16, 2023

siystar commented Sep 16, 2023

timoore commented Sep 16, 2023

robertosfield commented Sep 17, 2023

siystar commented Sep 17, 2023

timoore commented Sep 17, 2023

siystar commented Sep 17, 2023

robertosfield commented Sep 17, 2023

siystar commented Sep 18, 2023

timoore commented Sep 18, 2023

siystar commented Sep 20, 2023

timoore commented Sep 21, 2023

robertosfield commented Sep 25, 2023

timoore commented Sep 25, 2023 •

edited

Loading

[FIXED] TransferTask semaphore wait stages #968

[FIXED] TransferTask semaphore wait stages #968

Conversation

siystar commented Sep 16, 2023

Pull Request Template

Description

Type of change

How Has This Been Tested?

Checklist:

timoore commented Sep 16, 2023

robertosfield commented Sep 16, 2023

siystar commented Sep 16, 2023

timoore commented Sep 16, 2023

robertosfield commented Sep 17, 2023

siystar commented Sep 17, 2023

timoore commented Sep 17, 2023

siystar commented Sep 17, 2023

robertosfield commented Sep 17, 2023

siystar commented Sep 18, 2023

timoore commented Sep 18, 2023

siystar commented Sep 20, 2023

timoore commented Sep 21, 2023

robertosfield commented Sep 25, 2023

timoore commented Sep 25, 2023 • edited Loading

timoore commented Sep 25, 2023 •

edited

Loading