Skip to content

Conversation

William-An
Copy link
Contributor

@William-An William-An commented Sep 16, 2025

Some changes to help mitigate the spinlock issue with the accel-sim trace-based simulation, as the captured traces cannot reflect the true behavior of spinlocks and will skew the simulation result.

The fix has two components: detection and handling

  • Spinlock detection
    • Code in tracer_tool/others/spinlock_tool
    • Detection of spinlock is done by
      1. Running the target application multiple times and collecting the instruction histogram between runs.
      2. Diffing the two runs' histograms for instructions with different counts.
      3. These instructions are the non-deterministic code regions that are likely to be spinlocks.
      4. This methodology is based on Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator
    • There are two approaches for collecting the instruction histograms:
      1. Collect individual kernel launches from different contexts and diff based on kernel launch order.
        • This approach is simple, but the CUDA context launch order might not be consistent throughout different runs.
        • Also, for multiprocessing applications, the CUDA kernel launch order might also not be consistent.
      2. Summing up instruction histograms by kernel names and diff based on each kernel's
        collective instruction histogram.
        • This approach should work as long as the actual work (non-spinlock section) is deterministic between program runs.
        • With simple modulo hashing to make sure the histogram does not overflow.
  • Spinlock handling via fast-forwarding
    • During the host side recv_thread_fun of accel-sim's tracer, if ENABLE_SPINLOCK_FAST_FORWARD env is set to 1, the host receiving thread will keep only SPINLOCK_ITER_TO_KEEP iters of spinlock for each spinlock section.
    • Note that the fast-forwarding will only be done on the innermost spinlock loop with a nested spinlock loop with deterministic work between two inner spinlock loops.

@William-An William-An marked this pull request as ready for review September 23, 2025 20:16
@William-An William-An requested review from JRPan and Copilot September 25, 2025 14:18
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces spinlock detection and fast-forwarding capabilities for accel-sim's trace-based simulation to mitigate timing inaccuracies caused by spinlocks in captured traces.

  • Implements a two-phase spinlock detection tool that identifies non-deterministic code regions by comparing instruction execution histograms across multiple runs
  • Adds fast-forwarding mechanism to tracer that limits the number of spinlock iterations recorded in traces
  • Integrates spinlock handling options into the trace generation pipeline

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
util/tracer_nvbit/tracer_tool/tracer_tool.cu Adds spinlock fast-forwarding logic and instruction index tracking
util/tracer_nvbit/tracer_tool/inject_funcs.cu Updates instrumentation to pass instruction indices
util/tracer_nvbit/tracer_tool/common.h Adds instruction index field to trace structure
util/tracer_nvbit/tracer_tool/Makefile Updates C++ standard to C++17 for new features
util/tracer_nvbit/run_hw_trace.py Integrates spinlock detection and handling into trace workflow
util/tracer_nvbit/others/spinlock_tool/* New spinlock detection tool implementation
util/tracer_nvbit/Makefile Includes spinlock tool in build process
util/tracer_nvbit/.gitignore Simplifies ignore patterns
util/job_launching/apps/define-all-apps.yml Adds spinlock test application
.github/workflows/main.yml Adds matrix testing for spinlock handling options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

runs-on: tgrogers-gpu01
strategy:
matrix:
spinlock_handling: ["none", "fast_forward"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would run every thing twice, including the builds. Can we just run the tracer twice? And how is CI determining if fast forward is correct? Just by functionally passing the test?

Copy link
Contributor Author

@William-An William-An Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Okay, I will change the CI script to just run the tracer.
  • Right now, CI won't test if fast forward is correct; it will just trace it and run the generated trace, making sure fast-forward won't mess up non-spinlock kernels.

I can also change the CI script to run and trace the spinlock app, and then test to check if the total number of instructions is reduced or not.

- args: 16
accel-sim-mem: 1G

Spinlock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a dedicated "suite" instead of part of uBench? Then can we only run fast forward on this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I saw that on gpu-app-collection Spinlock is part of the uBench. Can you move this under ubench suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but the correlation of it will be terrible with just fast-forwarding when running with the ubench suite, given that this app is acquiring a highly contested lock.

I gave it a dedicated suite as all the atomic kernels also got a dedicated suite, even though they are under the ubench folder in the gpu-app-collection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay valid point

/* Add Source code line number for current instr */
nvbit_add_call_arg_const_val32(instr, (int)line_num);
/* Add instruction index for current instr (spinlock detection) */
nvbit_add_call_arg_const_val32(instr, (uint32_t)instr->getIdx());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong. Just curious, what is id? Why not just use PC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just index for each instruction in the kernel's instruction array. I am pretty sure it is just Offset/16.

+ " ; "
)

for path, content in [("run.sh", tracer_contents), ("run_spinlock_detection.sh", spinlock_contents)]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thi is okay. What about writing to the same script? Any benefit for seperating them? I just think it might be confusing for someone want to run the script manually. Just asking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spinlock detection is decoupled from the actual tracing. You just need to run it once to generate the instruction indices. I think that is why I separate those two.

@William-An William-An requested a review from JRPan October 10, 2025 18:53
@William-An
Copy link
Contributor Author

Force merge as the failure is unrelated to code changes.

@William-An William-An merged commit 96d7923 into accel-sim:dev Oct 10, 2025
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants