|
| 1 | +# Layer: GPU Timeline |
| 2 | + |
| 3 | +This layer is used with Arm GPUs for tracking submitted schedulable workloads |
| 4 | +and emitting semantic information about them. This data can be combined with |
| 5 | +the raw workload execution timing information captured using the Android |
| 6 | +Perfetto service, providing developers with a richer debug visualization. |
| 7 | + |
| 8 | +## What devices? |
| 9 | + |
| 10 | +The Arm GPU driver integration with the Perfetto render stages scheduler event |
| 11 | +trace is supported at production quality since the r47p0 driver version. |
| 12 | +However, associating semantics from this layer relies on a further integration |
| 13 | +with debug labels which requires an r51p0 or later driver version. |
| 14 | + |
| 15 | +## What workloads? |
| 16 | + |
| 17 | +A schedulable workload is the smallest workload that the Arm GPU command stream |
| 18 | +scheduler will issue to the GPU hardware work queues. This includes the |
| 19 | +following workload types: |
| 20 | + |
| 21 | +* Render passes, split into: |
| 22 | + * Vertex or Binning phase |
| 23 | + * Fragment or Main phase |
| 24 | +* Compute dispatches |
| 25 | +* Trace rays |
| 26 | +* Transfers to a buffer |
| 27 | +* Transfers to an image |
| 28 | + |
| 29 | +Most workloads are dispatched using a single API call, and are trivial to |
| 30 | +manage in the layer. However, render passes are more complex and need extra |
| 31 | +handling. In particular: |
| 32 | + |
| 33 | +* Render passes are issued using multiple API calls. |
| 34 | +* Useful render pass properties, such as draw count, are not known until the |
| 35 | + render pass recording has ended. |
| 36 | +* Dynamic render passes using `vkCmdBeginRendering()` and `vkCmdEndRendering()` |
| 37 | + can be suspended and resumed across command buffer boundaries. Properties |
| 38 | + such as draw count are not defined by the scope of a single command buffer. |
| 39 | + |
| 40 | +## Tracking workloads |
| 41 | + |
| 42 | +This layer tracks workloads encoded in command buffers, and emits semantic |
| 43 | +metadata for each workload via a communications side-channel. A host tool |
| 44 | +combines the semantic data stream with the Perfetto data stream, using debug |
| 45 | +label tags injected by the layer as a common cross-reference to link across |
| 46 | +the streams. |
| 47 | + |
| 48 | +### Workload labelling |
| 49 | + |
| 50 | +Command stream labelling is implemented using `vkCmdDebugMarkerBeginEXT()` |
| 51 | +and `vkCmdDebugMarkerEndEXT()`, wrapping one layer-owned `tagID` label around |
| 52 | +each semantic workload. This `tagID` can unambiguously refer to this workload |
| 53 | +encoding, and metadata that we do not expect to change per submit will be |
| 54 | +emitted using the matching `tagID` as the sole identifier. |
| 55 | + |
| 56 | +_**TODO:** Dynamic `submitID` tracking is not yet implemented._ |
| 57 | + |
| 58 | +The `tagID` label is encoded into the recorded command buffer which means, for |
| 59 | +reusable command buffers, it is not an unambiguous identifier of a specific |
| 60 | +running workload. To allow us to disambiguate specific workload instances, the |
| 61 | +layer can optionally add an outer wrapper of `submitID` labels around each |
| 62 | +submitted command buffer. This wrapper is only generated if the submit contains |
| 63 | +any command buffers that require the generation of a per-submit annex (see the |
| 64 | +following section for when this is needed). |
| 65 | + |
| 66 | +The `submitID.tagID` pair of IDs uniquely identifies a specific running |
| 67 | +workload, and can be used to attach an instance-specific metadata annex to a |
| 68 | +specific submitted workload rather than to the shared recorded command buffer. |
| 69 | + |
| 70 | +### Workload metadata for split render passes |
| 71 | + |
| 72 | +_**TODO:** Split render pass tracking is not yet implemented._ |
| 73 | + |
| 74 | +Dynamic render passes can be split across multiple Begin/End pairs, including |
| 75 | +being split across command buffer boundaries. If these splits occur within a |
| 76 | +single primary command buffer, or its secondaries, it is handled transparently |
| 77 | +by the layer and it appears as a single message as if no splits occurred. If |
| 78 | +these splits occur across primary command buffer boundaries, then some |
| 79 | +additional work is required. |
| 80 | + |
| 81 | +In our design a `tagID` debug marker is only started when the render pass first |
| 82 | +starts (not on resume), and stopped at the end of the render pass (not on |
| 83 | +suspend). The same `tagID` is used to refer to all parts of the render pass, |
| 84 | +no matter how many times it was suspended and resumed. |
| 85 | + |
| 86 | +If a render pass splits across command buffers, we cannot precompute metrics |
| 87 | +based on `tagID` alone, even if the command buffers are one-time use. This is |
| 88 | +because we do not know what combination of submitted command buffers will be |
| 89 | +used, and so we cannot know what the render pass contains until submit time. |
| 90 | +Split render passes will emit a `submitID.tagID` metadata annex containing |
| 91 | +the parameters that can only be known at submit time. |
| 92 | + |
| 93 | +### Workload metadata for compute dispatches |
| 94 | + |
| 95 | +_**TODO:** Compute workgroup parsing from the SPIR-V is not yet implemented._ |
| 96 | + |
| 97 | +Compute workload dispatch is simple to track, but one of the metadata items we |
| 98 | +want to export is the total size of the work space (work_group_count * |
| 99 | +work_group_size). |
| 100 | + |
| 101 | +The work group count is defined by the API call, but may be an indirect |
| 102 | +parameter (see indirect tracking above). |
| 103 | + |
| 104 | +The work group size is defined by the program pipeline, and is defined in the |
| 105 | +SPIR-V via a literal or a build-time specialization constant. To support this |
| 106 | +use case we will need to parse the SPIR-V when the pipeline is built, if |
| 107 | +SPIR-V is available. |
| 108 | + |
| 109 | +### Workload metadata for indirect calls |
| 110 | + |
| 111 | +_**TODO:** Indirect parameter tracking is not yet implemented._ |
| 112 | + |
| 113 | +One of the valuable pieces of metadata that we want to present is the size of |
| 114 | +each workload. For render passes this is captured at API call time, but for |
| 115 | +other workloads the size can be an indirect parameter that is not known when |
| 116 | +the triggering API call is made. |
| 117 | + |
| 118 | +To capture indirect parameters we insert a transfer that copies the indirect |
| 119 | +parameters into a layer-owned buffer. To ensure exclusive use of the buffer and |
| 120 | +avoid data corruption, each buffer region used is unique to a specific `tagID`. |
| 121 | +Attempting to submit the same command buffer multiple times will result in |
| 122 | +the workload being serialized to avoid racy access to the buffer. Once the |
| 123 | +buffer has been retrieved by the layer, a metadata annex containing the |
| 124 | +indirect parameters will be emitted using the `submitID.tagID` pair. This may |
| 125 | +be some time later than the original submit. |
| 126 | + |
| 127 | +### Workload metadata for user-defined labels |
| 128 | + |
| 129 | +The workload metadata captures user-defined labels that the application |
| 130 | +provides using `vkCmdDebugMarkerBeginEXT()` and `vkCmdDebugMarkerEndEXT()`. |
| 131 | +These are a stack-based debug mechanism where `Begin` pushes a new entry on to |
| 132 | +to the stack, and `End` pops the the most recent level off the stack. |
| 133 | + |
| 134 | +Workloads are labelled with the stack values that existed when the workload |
| 135 | +was started. For render passes this is the value on the stack when, e.g., |
| 136 | +`vkCmdBeginRenderPass()` was called. We do not capture any labels that exist |
| 137 | +inside the render pass. |
| 138 | + |
| 139 | +The debug label stack belongs to the queue, not to the command buffer, so the |
| 140 | +value of the label stack is not known until submit time. The debug information |
| 141 | +for a specific `submitID.tagID` pair is therefore provided as an annex at |
| 142 | +submit time once the stack can be resolved. |
| 143 | + |
| 144 | +## Message protocol |
| 145 | + |
| 146 | +For each workload in a command buffer, or part-workload in the case of a |
| 147 | +suspended render pass, we record a JSON metadata blob containing the payload |
| 148 | +we want to send. |
| 149 | + |
| 150 | +The low level protocol message contains: |
| 151 | + |
| 152 | +* Message type `uint8_t` |
| 153 | +* Sequence ID `uint64_t` (optional, implied by message type) |
| 154 | +* Tag ID `uint64_t` |
| 155 | +* JSON length `uint32_t` |
| 156 | +* JSON payload `uint8_t[]` |
| 157 | + |
| 158 | +Each workload will read whatever properties it can from the `tagID` metadata |
| 159 | +and will then merge in all fields from any subsequent `sequenceID.tagID` |
| 160 | +metadata that matches. |
| 161 | + |
| 162 | +- - - |
| 163 | + |
| 164 | +_Copyright © 2024, Arm Limited and contributors._ |
0 commit comments