Skip to content

Commit 7cf69ec

Browse files
committed
Add all intercepts
1 parent 9f91d75 commit 7cf69ec

23 files changed

+27026
-86
lines changed

layer_gpu_timeline/CMakeLists.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,5 +32,6 @@ set(LGL_LOG_TAG, "VkLayerGPUTimeline")
3232
include(../source_common/compiler_helper.cmake)
3333

3434
# Build steps
35-
add_subdirectory(source)
3635
add_subdirectory(../source_common/framework source_common/framework)
36+
add_subdirectory(../source_common/trackers source_common/trackers)
37+
add_subdirectory(source)

layer_gpu_timeline/README_LAYER.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Layer: GPU Timeline
2+
3+
This layer is used with Arm GPUs for tracking submitted schedulable workloads
4+
and emitting semantic information about them. This data can be combined with
5+
the raw workload execution timing information captured using the Android
6+
Perfetto service, providing developers with a richer debug visualization.
7+
8+
## What devices?
9+
10+
The Arm GPU driver integration with the Perfetto render stages scheduler event
11+
trace is supported at production quality since the r47p0 driver version.
12+
However, associating semantics from this layer relies on a further integration
13+
with debug labels which requires an r51p0 or later driver version.
14+
15+
## What workloads?
16+
17+
A schedulable workload is the smallest workload that the Arm GPU command stream
18+
scheduler will issue to the GPU hardware work queues. This includes the
19+
following workload types:
20+
21+
* Render passes, split into:
22+
* Vertex or Binning phase
23+
* Fragment or Main phase
24+
* Compute dispatches
25+
* Trace rays
26+
* Transfers to a buffer
27+
* Transfers to an image
28+
29+
Most workloads are dispatched using a single API call, and are trivial to
30+
manage in the layer. However, render passes are more complex and need extra
31+
handling. In particular:
32+
33+
* Render passes are issued using multiple API calls.
34+
* Useful render pass properties, such as draw count, are not known until the
35+
render pass recording has ended.
36+
* Dynamic render passes using `vkCmdBeginRendering()` and `vkCmdEndRendering()`
37+
can be suspended and resumed across command buffer boundaries. Properties
38+
such as draw count are not defined by the scope of a single command buffer.
39+
40+
## Tracking workloads
41+
42+
This layer tracks workloads encoded in command buffers, and emits semantic
43+
metadata for each workload via a communications side-channel. A host tool
44+
combines the semantic data stream with the Perfetto data stream, using debug
45+
label tags injected by the layer as a common cross-reference to link across
46+
the streams.
47+
48+
### Workload labelling
49+
50+
Command stream labelling is implemented using `vkCmdDebugMarkerBeginEXT()`
51+
and `vkCmdDebugMarkerEndEXT()`, wrapping one layer-owned `tagID` label around
52+
each semantic workload. This `tagID` can unambiguously refer to this workload
53+
encoding, and metadata that we do not expect to change per submit will be
54+
emitted using the matching `tagID` as the sole identifier.
55+
56+
_**TODO:** Dynamic `submitID` tracking is not yet implemented._
57+
58+
The `tagID` label is encoded into the recorded command buffer which means, for
59+
reusable command buffers, it is not an unambiguous identifier of a specific
60+
running workload. To allow us to disambiguate specific workload instances, the
61+
layer can optionally add an outer wrapper of `submitID` labels around each
62+
submitted command buffer. This wrapper is only generated if the submit contains
63+
any command buffers that require the generation of a per-submit annex (see the
64+
following section for when this is needed).
65+
66+
The `submitID.tagID` pair of IDs uniquely identifies a specific running
67+
workload, and can be used to attach an instance-specific metadata annex to a
68+
specific submitted workload rather than to the shared recorded command buffer.
69+
70+
### Workload metadata for split render passes
71+
72+
_**TODO:** Split render pass tracking is not yet implemented._
73+
74+
Dynamic render passes can be split across multiple Begin/End pairs, including
75+
being split across command buffer boundaries. If these splits occur within a
76+
single primary command buffer, or its secondaries, it is handled transparently
77+
by the layer and it appears as a single message as if no splits occurred. If
78+
these splits occur across primary command buffer boundaries, then some
79+
additional work is required.
80+
81+
In our design a `tagID` debug marker is only started when the render pass first
82+
starts (not on resume), and stopped at the end of the render pass (not on
83+
suspend). The same `tagID` is used to refer to all parts of the render pass,
84+
no matter how many times it was suspended and resumed.
85+
86+
If a render pass splits across command buffers, we cannot precompute metrics
87+
based on `tagID` alone, even if the command buffers are one-time use. This is
88+
because we do not know what combination of submitted command buffers will be
89+
used, and so we cannot know what the render pass contains until submit time.
90+
Split render passes will emit a `submitID.tagID` metadata annex containing
91+
the parameters that can only be known at submit time.
92+
93+
### Workload metadata for compute dispatches
94+
95+
_**TODO:** Compute workgroup parsing from the SPIR-V is not yet implemented._
96+
97+
Compute workload dispatch is simple to track, but one of the metadata items we
98+
want to export is the total size of the work space (work_group_count *
99+
work_group_size).
100+
101+
The work group count is defined by the API call, but may be an indirect
102+
parameter (see indirect tracking above).
103+
104+
The work group size is defined by the program pipeline, and is defined in the
105+
SPIR-V via a literal or a build-time specialization constant. To support this
106+
use case we will need to parse the SPIR-V when the pipeline is built, if
107+
SPIR-V is available.
108+
109+
### Workload metadata for indirect calls
110+
111+
_**TODO:** Indirect parameter tracking is not yet implemented._
112+
113+
One of the valuable pieces of metadata that we want to present is the size of
114+
each workload. For render passes this is captured at API call time, but for
115+
other workloads the size can be an indirect parameter that is not known when
116+
the triggering API call is made.
117+
118+
To capture indirect parameters we insert a transfer that copies the indirect
119+
parameters into a layer-owned buffer. To ensure exclusive use of the buffer and
120+
avoid data corruption, each buffer region used is unique to a specific `tagID`.
121+
Attempting to submit the same command buffer multiple times will result in
122+
the workload being serialized to avoid racy access to the buffer. Once the
123+
buffer has been retrieved by the layer, a metadata annex containing the
124+
indirect parameters will be emitted using the `submitID.tagID` pair. This may
125+
be some time later than the original submit.
126+
127+
### Workload metadata for user-defined labels
128+
129+
The workload metadata captures user-defined labels that the application
130+
provides using `vkCmdDebugMarkerBeginEXT()` and `vkCmdDebugMarkerEndEXT()`.
131+
These are a stack-based debug mechanism where `Begin` pushes a new entry on to
132+
to the stack, and `End` pops the the most recent level off the stack.
133+
134+
Workloads are labelled with the stack values that existed when the workload
135+
was started. For render passes this is the value on the stack when, e.g.,
136+
`vkCmdBeginRenderPass()` was called. We do not capture any labels that exist
137+
inside the render pass.
138+
139+
The debug label stack belongs to the queue, not to the command buffer, so the
140+
value of the label stack is not known until submit time. The debug information
141+
for a specific `submitID.tagID` pair is therefore provided as an annex at
142+
submit time once the stack can be resolved.
143+
144+
## Message protocol
145+
146+
For each workload in a command buffer, or part-workload in the case of a
147+
suspended render pass, we record a JSON metadata blob containing the payload
148+
we want to send.
149+
150+
The low level protocol message contains:
151+
152+
* Message type `uint8_t`
153+
* Sequence ID `uint64_t` (optional, implied by message type)
154+
* Tag ID `uint64_t`
155+
* JSON length `uint32_t`
156+
* JSON payload `uint8_t[]`
157+
158+
Each workload will read whatever properties it can from the `tagID` metadata
159+
and will then merge in all fields from any subsequent `sequenceID.tagID`
160+
metadata that matches.
161+
162+
- - -
163+
164+
_Copyright © 2024, Arm Limited and contributors._
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Layer: GPU Timeline - Command Buffer Modelling
2+
3+
One of the main challenges of this layer driver is modelling behavior in queues
4+
and command buffers that is not known until submit time, and then taking
5+
appropriate actions based on the combination of both the head state of the
6+
queue and the content of the pre-recorded command buffers.
7+
8+
Our design to solve this is a lightweight software command stream which is
9+
recorded when a command buffer is recorded, and then executed when the
10+
command buffer is submitted to the queue. Just like a real hardware command
11+
stream these commands can update state or trigger some other action we need
12+
performed.
13+
14+
## Layer commands
15+
16+
**MARKER_BEGIN(const std::string\*):**
17+
18+
* Push a new marker into the queue debug label stack.
19+
20+
**MARKER_END():**
21+
22+
* Pop the latest marker from the queue debug label stack.
23+
24+
**RENDERPASS_BEGIN(const json\*):**
25+
26+
* Set the current workload to a new render pass with the passed metadata.
27+
28+
**RENDERPASS_CONTINUE(const json\*):**
29+
30+
* Update the current workload, which must be a render pass, with extra
31+
draw count metadata.
32+
33+
**COMPUTE_DISPATCH_BEGIN(const json\*):**
34+
35+
* Set the current workload to a new compute dispatch with the passed metadata.
36+
37+
**TRACE_RAYS_BEGIN(const json\*):**
38+
39+
* Set the current workload to a new trace rays with the passed metadata.
40+
41+
**BUFFER_TRANSFER_BEGIN(const json\*):**
42+
43+
* Set the current workload to a new a buffer transfer.
44+
45+
**IMAGE_TRANSFER(const json\*):**
46+
47+
* Set the current workload to a new image transfer.
48+
49+
**WORKLOAD_END():**
50+
51+
* Mark the current workload as complete, and emit a built metadata entry for
52+
it.
53+
54+
## Layer command recording
55+
56+
Command buffer recording is effectively building two separate state
57+
structures for the layer.
58+
59+
The first is a per-workload or per-restart JSON structure that contains the
60+
metadata we need for that workload. For partial workloads - e.g. a dynamic
61+
render pass begin that has been suspended - this metadata will be partial and
62+
rely on later restart metadata to complete it.
63+
64+
The second is the layer "command stream" that contains the bytecode commands
65+
to execute when the command buffer is submitted to the queue. These commands
66+
are very simple, consisting of a list of command+pointer pairs, where the
67+
pointer value may be unused by some commands. Commands are stored in a
68+
std::vector, but we reserve enough memory to store 256 commands without
69+
reallocating which is enough for the majority of command buffers we see in
70+
real applications.
71+
72+
The command stream for a secondary command buffer is inlined into the primary
73+
command buffer during recording.
74+
75+
## Layer command playback
76+
77+
The persistent state for command playback belongs to the queues the command
78+
buffers are submitted to. The command stream bytecode is run by a bytecode
79+
interpreter associated with the state of the current queue, giving the
80+
interpreter access to the current `submitID` and queue debug label stack.
81+
82+
## Future: Async commands
83+
84+
One of our longer-term goals is to be able to capture indirect parameters,
85+
which will be available after-the-fact once the GPU has processed the command
86+
buffer. Once we have the data we can emit an annex message containing
87+
parameters for each indirect `submitID.tagID` pair in the command buffer.
88+
89+
We need to be able to emit the metadata after the commands are complete,
90+
and correctly synchronize use of the indirect capture staging buffer
91+
if command buffers are reissued. My current thinking is that we would
92+
implement this using additional layer commands that are processed on submit,
93+
including support for async commands that run in a separate thread and
94+
wait on the command buffer completion fence before running.
95+
96+
- - -
97+
98+
_Copyright © 2024, Arm Limited and contributors._

layer_gpu_timeline/source/CMakeLists.txt

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,14 @@ add_library(
4444
device.cpp
4545
entry.cpp
4646
instance.cpp
47-
layer_device_functions.cpp)
47+
layer_device_functions_command_buffer.cpp
48+
layer_device_functions_command_pool.cpp
49+
layer_device_functions_debug.cpp
50+
layer_device_functions_dispatch.cpp
51+
layer_device_functions_draw_call.cpp
52+
layer_device_functions_queue.cpp
53+
layer_device_functions_render_pass.cpp
54+
layer_device_functions_trace_rays.cpp)
4855

4956
target_include_directories(
5057
${VK_LAYER} PRIVATE
@@ -66,6 +73,7 @@ target_compile_definitions(
6673
target_link_libraries(
6774
${VK_LAYER}
6875
lib_layer_framework
76+
lib_layer_trackers
6977
$<$<PLATFORM_ID:Android>:log>)
7078

7179
if (CMAKE_BUILD_TYPE STREQUAL "Release")

layer_gpu_timeline/source/device.hpp

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
#include <vulkan/vk_layer.h>
5858

5959
#include "framework/device_dispatch_table.hpp"
60+
#include "trackers/device.hpp"
6061

6162
#include "instance.hpp"
6263

@@ -127,7 +128,21 @@ class Device
127128
*/
128129
~Device();
129130

131+
/**
132+
* @brief Get the cumulative stats for this device.
133+
*/
134+
Tracker::Device& getStateTracker()
135+
{
136+
return stateTracker;
137+
}
138+
130139
public:
140+
/**
141+
* @brief The driver function dispatch table.
142+
*/
143+
DeviceDispatchTable driver {};
144+
145+
private:
131146
/**
132147
* @brief The instance this device is created with.
133148
*/
@@ -144,7 +159,7 @@ class Device
144159
const VkDevice device;
145160

146161
/**
147-
* @brief The driver function dispatch table.
162+
* @brief State tracking for this device;
148163
*/
149-
DeviceDispatchTable driver {};
164+
Tracker::Device stateTracker;
150165
};

0 commit comments

Comments
 (0)