Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] TransferTask semaphore wait stages #968

Merged
merged 1 commit into from
Sep 16, 2023

Conversation

siystar
Copy link

@siystar siystar commented Sep 16, 2023

Pull Request Template

Description

Fixed semaphore wait stages for transferCompletedSemaphore.

The Vulkan spec 7.4.2 "Semaphore waiting" states: "[...] the second synchronization scope is limited to operations on the pipeline stages determined by the destination stage mask specified by the corresponding element of pWaitDstStageMask."

Note, while waiting in a particular pipeline stage guarantees that any logically later pipeline stages will wait, transfer, graphics and compute pipelines are independent so using a wait stage mask of VK_PIPELINE_STAGE_TRANSFER_BIT doesn't imply waiting of any other pipeline stages.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

General testing

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

@robertosfield robertosfield merged commit a1f7d86 into vsg-dev:master Sep 16, 2023
@timoore
Copy link
Contributor

timoore commented Sep 16, 2023

I was hoping that this would fix a bug that I've been chasing down, but not quite. Basically, in vsgCs I'm modifying uniform buffers and seeing flashing in the output. This is on an Nvidia 1070 which has a separate graphics queue. The Vulkan sychronization layer produces a spew of errors like:

        [0] 0x220dc70, type: 4, name: NULL
SYNC-HAZARD-WRITE-RACING-READ(ERROR / SPEC): msgNum: -860391127 - Validation Error: [ SYNC-HAZARD-WRITE-RACING-READ ] Object 0: handle = 0x220dc70, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xccb77929 | vkQueueSubmit: Hazard WRITE_RACING_READ for entry 0, VkCommandBuffer 0x28c6a30[], Recorded access info (recorded_usage: SYNC_COPY_TRANSFER_WRITE, command: vkCmdCopyBuffer, seq_no: 5, reset_no: 1453). Access info (prior_usage: SYNC_FRAGMENT_SHADER_UNIFORM_READ, read_barriers: VK_PIPELINE_STAGE_2_BOTTOM_OF_PIPE_BIT, queue: VkQueue 0x2108000[], submit: 4483, batch: 0, batch_tag: 70343, command: vkCmdDrawIndexed, seq_no: 18, command_buffer: VkCommandBuffer 0x2729650[], reset_no: 1493).
    Objects: 1
        [0] 0x220dc70, type: 4, name: NULL

I know that this isn't a proper bug report, but I need to do more investigation, and I hope that @siystar will be inspired to look too 😎

@robertosfield
Copy link
Collaborator

@timoore I think it would be best to create a Issue report for this with details how to reproduce it. Info on how vsgCs is creating and updating the uniform data would be helpful too.

@siystar
Copy link
Author

siystar commented Sep 16, 2023

@timoore The semaphore changed in this PR only enforces the order of transfer and graphics commands for a particular frame.

There is another semaphore that is used to enforce the ordering of frames relative to one another. That semaphore currently uses the vsg::Semaphore's default pWaitDstStageMask of BOTTOM_OF_PIPE_BIT, which I think is incorrect. The Vulkan spec states that BOTTOM_OF_PIPE_BIT when used as a dstStageMask is equal to no pipeline stages. Not specifying any stages to wait renders the semaphore ineffective so the order of frames is not currently enforced except for implicit ordering constraints like drawing order. The Vulkan spec also states that the TOP/BOTTOM pipeline stages are deprecated so ideally we should cease using them.

To test, you could open vsg/vk/Semaphore.h and change the default stage mask to VK_PIPELINE_STAGE_ALL_COMMANDS_BIT.

@timoore
Copy link
Contributor

timoore commented Sep 16, 2023

I think there are two problems relevent to doing transfers on a different queue from the graphics, and I'll open an issue about it. Basically the transfer task is not waiting for previous rendering submissions to finish before writing into buffers, and there are no queue ownership transfers.

@robertosfield
Copy link
Collaborator

@timoore The TransferData class milti-buffers so that each frames worth of updates does not overlap other frame. This is done to prevent copies different frames worth of data contending.

@siystar
Copy link
Author

siystar commented Sep 17, 2023

@timoore queue family ownership transfers shouldn't be an issue for the TransferTask with PR #965 applied.

Basically the transfer task is not waiting for previous rendering submissions to finish before writing into buffers,

Good catch. Yes, I think we need to add a memory dependency between frames before transferring into buffers as there is no semaphore that connects them. The multi-buffering strategy of the TransferTask only applies to the staging buffers, not to the transfer itself, and the transferCompletedSemaphore only covers future frames, not past frames.

@timoore
Copy link
Contributor

timoore commented Sep 17, 2023

As I understand it, the transfer task multi buffers its own staging buffers, but the destination buffers and images are not multibuffered. If the transfer queue and graphics queue are the same, then there might be enough synchronisation already to prevent data races; I would need to study it further to be sure. If the graphics and transfer queues are different, the transfer task does need to synchronise on the use of those buffers in the graphics queue with a semaphore wait or fence.

@siystar
Copy link
Author

siystar commented Sep 17, 2023

If we need to do synchronization both before and after execution of the transfer commands, then what's the point of using a separate queue for the TransferTask? It certainly makes sense for the CompileTraversal where we only need to synchronize with operations that happen afterwards, but for the TransferTask I don't see how it could possibly help to use another queue.

@robertosfield
Copy link
Collaborator

Presently the TransferTask uses a graphics queue and will use the same queue on non NVidia systems as only one queue us present. The TransferTask currently has to use the graphics queue because the mipmap generation can't be done in a transfer queue.

Longer term I want to look at splitting out the transfer and the mipmap generation but this isn't on my priority list of TODO items.

@siystar
Copy link
Author

siystar commented Sep 18, 2023

The point I was trying to make concerns the efficiency point of view. Correct me if I'm wrong, but isn't the whole point of submitting work to different queues the possibility to overlap the execution of the work? The Vulkan spec describes queues as an "interface to the execution engines of a device" where each device "exposes one or more queues which may process work asynchronously to one another". If, in order to fix the bug @timoore has discovered, we need to add an additional semaphore wait in between the start of a TransferTask and the previous frame's rendering, then there is no longer any opportunity for work to overlap so we could just as well submit the transfers to the main rendering queue and get the same end result with less code, with the required synchronization being done for us automatically by the queue submission. Going a step further, we could even include the transfer task's command buffer, with a pipeline barrier added to it, in the vkQueueSubmit(..) call made by the RecordAndSubmitTask.

We'd need to benchmark to be sure of the performance, but my gut feeling tells me that opting for the simplest implementation is the wisest strategy regardless.

@timoore
Copy link
Contributor

timoore commented Sep 18, 2023

I think @robertosfield would agree that TransferTask is a work in progress. The problem isn't that the TransferTask needs to sync with the previous frame, it's that the dynamic data objects should be double buffered so that the TransferTask doesn't need to block.

A motivation for having a separate transfer queue is so that compile() in worker threads can make progress while the main graphics queue is blocked in vkQueueSubmit() waiting on a semaphore.

A wrinkle in all this is that various memory types are host visibile and should not be updated via a CopyBuffers command... but that introduces yet another kind of synchronization.

@siystar
Copy link
Author

siystar commented Sep 20, 2023

The first draft of a dynamic data updating mechanism, proposed in a PR by myself, did indeed use multi-buffered descriptor buffers on the GPU through the use of dynamic descriptors, so it would have supported the queuing of transfer tasks without synchronization. But Robert rejected the PR for good reasons.

The VSG is first and foremost a Vulkan abstraction layer, with some tools for developers built on top of that layer. One of the VSG's main strengths is the coherency of mapping VSG to Vulkan objects and vice versa. If a user creates a vsg::DescriptorBuffer object with a vsg::Data object attached to it, they'll expect a matching configuration of objects to be created on the Vulkan side, i.e. a single VkBuffer object containing exactly that data. To add multi-buffering support here would mean that some of the neat mapping of VSG to Vulkan objects is lost for the purpose of a feature that might theoretically lead to a performance gain but has a somewhat narrow scope.

In my opinion, a more sustainable way to improve the overlap of transfer work would be to make the pipeline stages used for the transferCompletedSemaphore configurable as I've described in #965.

@timoore
Copy link
Contributor

timoore commented Sep 21, 2023

I agree that the VSG objects that correspond directly to Vulkan objects shouldn't contain magical double buffering. Also, I've come to realize that there are an advantage to not double-buffering memory that is written by the host: the latency between the update and eventual screen display is lower, assuming that the whole scene can be updated, recorded, and submitted (i.e., drawn) in one frame time.

So, I think we should "lean in" to the single buffer approach for the moment and make sure the synchronization is correct for that case, using @siystar's ideas and probably some more flexible use of fences. The RecordAndSubmitTask and TransferTask's use of command buffers is a bit muddled; if I understand correctly they use triple-buffering, probably in an effort to align with available presentation images, but they don't need to be more than double-buffered.

Double buffering of host memory writes is going to be needed for complex scenes, but I'm thinking that that should be handled in the scene graph, perhaps with a variation of StateSwitch that performs its action on alternate frames. Synchronization between updates, whether by direct write or buffer copy commands, and command buffer usage will have to be dealt with too, of course.

@robertosfield
Copy link
Collaborator

One of the key reasons for buffering of data that is being transferred is if you copy data to GPU memory as soon as you've updated on the CPU you can pass the data to the GPU memory while it's still be used by a previous frame, so you end up with out of sync updates.

You can block on the CPU till all GPU rendering for a frame is complete so it's safe to update and copy but this would kick the whole point of CPU and GPU parallelism to touch and performance would suck.

So these are solid reasons why I've implemented TransferTask the way I have, this I'm not about to change. Refinements would be in the direction of separating out transfer and mipmap operations and synchronization side.

Now if there are specific problems we need to address then we need to distill down and explanation of what they are and how to reproduce them and then I sit down and think through a design that would work best.

@timoore
Copy link
Contributor

timoore commented Sep 25, 2023

One of the key reasons for buffering of data that is being transferred is if you copy data to GPU memory as soon as you've updated on the CPU you can pass the data to the GPU memory while it's still be used by a previous frame, so you end up with out of sync updates.

You can block on the CPU till all GPU rendering for a frame is complete so it's safe to update and copy but this would kick the whole point of CPU and GPU parallelism to touch and performance would suck.

This is all mostly true, but the buffers and images used for rendering are not double buffered. The submission of buffer copy commands assembled by the transfer task does not wait on a semaphore for the rendering commands using those buffers to complete. This is a bug: it produces real visual anomalies in vsgCs and error messages from the Vulkan synchronization layer. Since we're talking about submissions on different queues, there needs to be semaphore signals and waits.

As to whether performance would "suck," sure, throughput would be worse, but the story for latency is pretty good in a single buffered approach. Implementing the required double buffering is a big job, which is why I'm not volunteering right at the moment.

As an aside, I'm skeptical that there's much GPU paralellism happening with the current constraint that the transfer queue support graphics operations. The hardware that most of us have simply doesn't support that. The major advantage of using a separate transfer queue for us is that submissions aren't blocked by the wait-for-fences operation on main graphics queue. I'm happy to be proved wrong.

So these are solid reasons why I've implemented TransferTask the way I have, this I'm not about to change. Refinements would be in the direction of separating out transfer and mipmap operations and synchronization side.

Now if there are specific problems we need to address then we need to distill down and explanation of what they are and how to reproduce them and then I sit down and think through a design that would work best.

I'll submit my vsgCs problems as an issue so that the discussion will become more concrete.

See #980.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants