Redefine command-buffer simultaneous-use#1411
Conversation
As discussed in KhronosGroup#891 the current definition of [simultaneous use](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_command_buffers) is hard for users to reason about. Instead a better model is one simultaneous-use isn't a contraint of the command-buffer submission, but it's execution. That is simultaneous-use occurs when a command-buffer is enqueued for execution while any previous submission of the command-buffer is still in-flight, and there any no scheduling dependencies expressed to serialize execution. This means that pipelined submissions of a command-buffer, where there is an in-order queue, barrier, or cl_events to serialize execution, is always valid usage without this optional simultaneous-use device feature. To avoid the runtime having to incur overheads from monitoring when simultaneous-use occurs so it can throw an error, violating this valid usage for non simultaneous-use command-buffers is UB. Following from this related to KhronosGroup#1311 the pending state has also been removed from the command-buffer lifecycle, and there is only the binary states of recording and executable. This is because a user can use the existing OpenCL mechanisms of host waits and event queries to inspect the state of individual command-buffer enqueues to avoid simultaneous-use, and having this stored as a command-buffer state incurs the runtime overhead of tracking a previous submissions. The pending count concept also now makes less sense. The implications for updating a command-buffer is that the error behavior is removed from update, so simultaneous-use is a property of execution rather than enqueue. See CTS test ideas for how this could work. I think the CTS changes that are required for this as follows, but I could create a separate CTS issue to track work once/if this PR merge. Or we could try prototype the CTS changes to give confidence in the spec change. * Remove test for pending state query * Either rework existing simulataneous use tests so that it tests the new definition, or delete them and create new more suitable tests without tech-debt from old definition. * Add tests for pipelined submission of a command-buffer not created with simultaneous-use. Ideally stressing indriect dependencies as well as direct ones: ** in-order queue to express depdencies ** out-of-order queue with event dependencies to express depenencies ** barrier to express for dependencies * Add cl_khr_command_buffer_mutable_dispatch tests for updating and enqueueing pipelined submissions of a non-simultaneous use command-buffer with depdencies between each enqueue, and only do a blocking wait at the end. * Add cl_khr_command_buffer_mutable_dispatch tests for a simultaneous-use command-buffer, where two invocations are scheduled such that they can run concurrently, but the second invocation is updated such that it uses different inputs/outputs to avoid race conditions in the kernel.
16e2cd3 to
e45df2e
Compare
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device simultaneous-use capability
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device simultaneous-use capability * Flip setting the simultaneous-use property on command-buffer creation in base class to off, and only enable if tests requests it. * Rewrite mutable dispatch simultaneous test to test updating between pipelined enqueues, and updating the new definition of simultaneous-use
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device simultaneous-use capability * Flip setting the simultaneous-use property on command-buffer creation in base class to off, and only enable if tests requests it. * Rewrite mutable dispatch simultaneous test to test updating both pipelined enqueues, and updating the new definition of simultaneous-use
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device imultaneous-use capability * Remove simultaneous-use command-buffer creation in base class to off, and require tests do it themselves if they require it. * Rewrite mutable dispatch simultaneous test to test updating both pipelined enqueues, and updating the new definition of simultaneous-use
|
@bashbaug I've requested you re-review the PR, as since you approved it I've pushed a commit that moves the simultaneous-use capability from the base extension to cl_khr_command_buffer_mutable_dispatch. While updating the CTS tests I realized I couldn't think of a scenario without update where it adds value, as any concurrent execution of the command-buffer would probably be race condition. Whereas with mutable-dispatch you can update your kernel to new inputs/outputs to use that would allow two concurrent executions to operate on different data. We haven't discussed this in the group yet, so I can revert it if we think it's undesirable. |
Interesting. My first reaction was that this goes a bit too far, but after thinking about it a bit more, I'm starting to agree. Without mutable dispatch, every execution of the command buffer is necessarily reading from and writing to the same locations in memory, which is why "any concurrent execution of the command-buffer would probably be race condition". Note that Vulkan has the command buffer simultaneous usage flag Let's discuss a bit more in our call on Tuesday before merging this change, just to be sure. |
|
One more thought: "simultaneous use" is going to apply to mutable dispatch, plus mutable memory commands (#1065), plus potentially any other future extensions that allow modifying a command buffer (e.g. #1279). So, practically speaking, we still may want to keep the "simultaneous use" concept and enums in the base-level extension and have the layered extensions refer to them vs. replicating everything into each of the layered extensions. Even if we do this, I think it's worth an informative note cautioning against concurrent execution of a command buffer without modification due to the possibility of data races, since this wasn't immediately obvious to me. |
|
Discussed in the August 12th teleconference:
|
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device imultaneous-use capability * Remove simultaneous-use command-buffer creation in base class to off, and require tests do it themselves if they require it. * Rewrite mutable dispatch simultaneous test to test updating both pipelined enqueues, and updating the new definition of simultaneous-use
bashbaug
left a comment
There was a problem hiding this comment.
Merging as discussed in the August 26th teleconference.
Use script to regenerate the headers based on changes from OpenCL-Docs that updated cl.xml * KhronosGroup/OpenCL-Docs#1411 * KhronosGroup/OpenCL-Docs#1435 * KhronosGroup/OpenCL-Docs#1408
Use script to regenerate the headers based on changes from OpenCL-Docs that updated cl.xml * KhronosGroup/OpenCL-Docs#1411 * KhronosGroup/OpenCL-Docs#1435 * KhronosGroup/OpenCL-Docs#1408
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device imultaneous-use capability * Remove simultaneous-use command-buffer creation in base class to off, and require tests do it themselves if they require it. * Rewrite mutable dispatch simultaneous test to test updating both pipelined enqueues, and updating the new definition of simultaneous-use
Actions test plan from KhronosGroup#2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device imultaneous-use capability * Remove simultaneous-use command-buffer creation in base class to off, and require tests do it themselves if they require it. * Rewrite mutable dispatch simultaneous test to test updating both pipelined enqueues, and updating the new definition of simultaneous-use
Actions test plan from #2473 to update CTS tests to reflect changes from cl_khr_command_buffer PR KhronosGroup/OpenCL-Docs#1411 * Adds new test in`command_buffer_pipelined_enqueue.cpp` for multiple enqueues without blocking in-between, but serialized execution. * Removed test for `CL_COMMAND_BUFFER_STATE_PENDING_KHR` state query. * Remove negative test for `clEnqueueCommandBuffer` pending state error. * Simplify `cl_khr_command_buffer` tests that stress simultaneous-use by testing multiple serialized enqueues of the same command-buffer, which doesn't now require the device imultaneous-use capability * Remove simultaneous-use command-buffer creation in base class to off, and require tests do it themselves if they require it. * Rewrite mutable dispatch simultaneous test to test updating both pipelined enqueues, and updating the new definition of simultaneous-use --------- Co-authored-by: Ewan Crawford <ewan@codeplay.com>
As discussed in #891 the current definition of simultaneous use is hard for users to reason about.
Instead a better model is one where simultaneous-use isn't a constraint on when a command-buffer can be submitted for execution, but the execution itself. That is, simultaneous-use occurs when a command-buffer is enqueued for execution while any previous submission of the command-buffer is still in-flight, and there any no scheduling dependencies expressed to serialize execution on the in-flight submission.
This means that pipelined submissions of a command-buffer, where there is an in-order queue, barrier, or cl_events to serialize execution, is always valid usage without the optional simultaneous-use device feature.
To avoid the runtime having to incur overheads from monitoring when simultaneous-use occurs so it can throw an error, violating this valid usage for non simultaneous-use command-buffers is UB. Following from this and relation to #1311, the pending state has also been removed from the command-buffer lifecycle, and there is only left the binary states of recording and executable (note that this conflicts with the change proposed in #1382, but it I believe it is straightforward to resolve). This is because a user can use the existing OpenCL mechanisms of host waits and event queries to inspect the state of individual command-buffer enqueues to avoid simultaneous-use, and having this stored as a command-buffer state incurs the runtime overhead of tracking a previous submissions.
The implications for updating a command-buffer is that the error behavior is removed from update, so simultaneous-use is a property of execution rather than enqueue. In fact update is the only scenario where simultaneous use now adds value, as without it there will be race conditions during execution. But with command-buffer update the command-buffer operations can work on different data. As a result i've moved the simultaneous capability from the base extension to mutable-dispatch.
I've outlined the CTS changes that would be required for this change in KhronosGroup/OpenCL-CTS#2473 and drafted them in KhronosGroup/OpenCL-CTS#2477