- 
                Notifications
    You must be signed in to change notification settings 
- Fork 41
NERSC GPU hackathon (Dec 2021)
This page summarises preparations for the hackathon on the 2nd/8th/9th/10th December 2021. We will use CoreNEURON+NMODL with a few sets of test model data.
NOTE If you see any issues that we need to be handled after hackathon should be created in https://github.com/neuronsimulator/gpuhackathon/issues
- 
Olli - Install Caliper, Ninja, NMODL and NMODL's Python dependencies on Ascent. 
- 
Pramod - Update Ascent instructions to enable NMODL (same as NERSC). 
- 
Olli - Channel benchmark failing on Ascent during load input dataset. Should we re-generate data on Ascent? 
- 
Ioannis - Generate & copy input dataset for Solver analysis and optimisation during Hackathon. Add info on this wiki page for the same. 
- 
Omar - List of all OpenACC directives and APIs used in the CoreNEURON + NMODL. See https://github.com/neuronsimulator/gpuhackathon/blob/main/overview.md 
- 
Olli - Common profiling/benchmarking script which could compare & plot OpenMP vs OpenACC runtimes 
- 
First Day Presentation Preparation - Pramod + Olli: 3 min Introduction
- Ioannis + Christos: Solver introduction to mentors
- Pramod + Omar + Olli: Compute loop + DSL code generation introduction to mentors
 
- Alex: Check the neuron-gpu-nightly execution on Perlmutter. Do they work with/without MPI?
-  Alex: NEURON + CoreNEURON - Interviews and NEURON + CoreNEURON + NMODL - Interviews build on Perlmutter & Ascent. Do they work?
- Alex: Should we automatically disable Interviews on the Cray systems and IBM Power systems by detection via CMake?
 
- Alex: Check if there is any improvement needed for building from source on Hackathon systems.
-  Nico + Alex: Make usage of OpenACC API's (acc_memcpy_to_deviceandacc_copyin) portable for OpenMP as well as OpenACC. In short https://github.com/BlueBrain/CoreNeuron/blob/master/coreneuron/gpu/nrn_acc_manager.cpp should become OpenACC as well as OpenMP.
- Nico + Alex: Test first ringtest and then channel-benchmark for the correctness of the migration
- Pramod: Review OpenACC/OpenMP interoperability via API calls, see e.g. https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866.
- Omar: Add OpenMP offload example like https://bbpcode.epfl.ch/reps/user/kumbhar/cunrnran123/tree/mainacc.c (still using CUDA Random123) -- it moved to GitLab https://bbpgitlab.epfl.ch/hpc/user/cunrnran123
- Omar: Trying Random123 directly in OpenMP offload regions
- Omar: Anything to discuss / ask to mentors about Random123 portability on different gpus?
- Add an option in CoreNEURON that only executes Hines solver ?
-  Find out a way to dynamically set a good number for the nwarpvariable (used to distribute the cells)
- Ioannis + Christos: Profiling of a realistic model and analyse different performance metrics with mentors to understand the limitations.
- Ioannis + Christos: With current algorithm, investigate possible performance improvement opportunities
- Ioannis + Christos: Performance comparison of OpenACC vs OpenMP vs CUDA implementation
- Ioannis + Christos: Would benefit from special memory types usage?
- Ioannis + Christos: Look into algorithmic improvements to expose more parallelism (if suitable)
- Olli - Update codegen backend to emit OpenMP constructs with OpenACC (ongoing)
-  Olli + Omar - Review with the mentors performance aspects (e.g. presentvsmapaspects)
- Build system changes to enable OpenMP target offload as well as OpenACC
- Olli: Try simple OpenMP + OpenACC offload test (e.g. with inbuilt ring test)
-  Olli: Update Hines Solver in coreneuron with OpenMP (ongoing)
- - Christos + Ioannis to review: https://github.com/BlueBrain/CoreNeuron/pull/693
-  - Confirm if we need an OpenMP equivalent to #pragma acc loop seq
 
- Pramod: Measure performance difference between OpenMP and OpenACC runs
- Olli: Under coreneuron event communication part, start tackling OpenACC pragmas or API calls that are not converted to OpenMP
- Pramod: Discuss compiler bug / error with NVIDIA colleague (Brent): https://forums.developer.nvidia.com/t/enabling-openmp-offload-breaks-openacc-code/196643/6
- Pramod: Prepare a slide describing the goals - what currently we do and what we would like to do?
- Pramod: Organise / plan a discussion with a / few nvidia compiler engineers. Central question is - ABI compatibility across different CUDA + OpenMP + OpenACC runtimes/compilers.
- First, check/fix https://github.com/BlueBrain/CoreNeuron/issues/594
- Re-measure the slowdown from using unified memory (ringtest + channel-benchmark)
- Identify which data structures are causing the slowdown. How to identify this easily with the tools?
These are instructions to build + run on Ascent (login1.ascent.olcf.ornl.gov).
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATHNote that Boost is an optional dependency, but it has an apparently-significant impact on both initialisation time and profiling time when using NSight Compute.
# clone repository
cd $HOME
git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron # or: git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron
git submodule update --recursive
mkdir -p build && cd build
cmake .. -G Ninja -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0"  -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="-gpu=cuda11.0" -DCORENRN_ENABLE_UNIT_TESTS=OFF
cmake --build . --parallel 8As $HOME is not writable, create your own directory into the project directory:
mkdir -p /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
cd /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USERNow run a tiny, functional test on the GPU:
- Allocate a node
bsub -P GEN170 -J neuron -W 90 -nnodes 1 -alloc_flags "gpudefault" -Is $SHELLNote: gpumps here would enable MPS, but that is incompatible with the CUDA 11.0 version we are using here.
- Make necessary modules are loaded:
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost- Run simple function test on the GPU:
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi- How to generate NSight System report with simple ringtest:
module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-systems/2021.5.1.117
OMP_NUM_THREADS=1 jsrun -E LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/pami_471/libpami.so --gpu_per_rs 1 -n 1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0,CALI_CONFIG=nvtx,OMP_NUM_THREADS=$OMP_NUM_THREADS --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2- How to generate full NSight Compute report with simple ringtest for a single kernel:
module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-compute/2021.2.2
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 ncu -k <kernel_name_you_want_to_analyze> -f --profile-from-start=off --target-processes all --set full -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2For the purposes of the hackathon we have provided two datasets that cover the basic use cases for the coming days.
For running small tests you should use the following data set:
/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank
This data set has a circuit of 1440 cells and is meant to be simulated with 1 MPI rank on 1 GPU.
To run CoreNEURON with this data set you may use the following command:
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 \
  jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
  --datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank \
  --mpi --gpu --cell-permute=2 --tstop=5For running the simulator to gather profiling or other performance measurements you should use the following data set:
/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks
This data set resembles a production level simulation with a large enough circuit to fill up most of the GPU memory (~14GB). It is meant to be executed on 6 MPI ranks or 6 threads on 1 GPU and consists of 8800 cells. Note that MPS needs to be enabled if you are using pure MPI execution and MPS is not currently working on Ascent with CUDA 11.0, so we need to run this with 1 rank / 6 threads / 1 GPU.
To run CoreNEURON with this data set you may use the following command:
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=6 \
  jsrun -n 1 --cpu_per_rs 6 --tasks_per_rs 1 --gpu_per_rs 1 --bind rs $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
  --datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
  --mpi --gpu --cell-permute=2 --tstop=5Note: The datasets were generated with 1k synapses/cell and 100 stimuli/cell. Number of synapses and stimuli are fewer than the original model.
- The system-wide modules only go up to NVHPC 21.7, which has known issues with NEURON.
We have ourselves inserted an installation of NVHPC 21.9 (module nvidia/21.9) into the Cray Programming Environment setup.
- Note that NVHPC 21.9 is configured to use the system GCC 7.5 standard library, while we have built some other dependencies with GCC 9.2. This seems to be "close enough".
- 
-tp haswellor-DR123_USE_SSE=0is required becausenvc++defaults to-tp zenon the Perlmutter nodes, which defines__ABM__and causes Random123 to try and includeintrin.h, which fails.
# clone repository
git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git # or git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron && mkdir -p build && cd build
# allocate node
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 4 --account=ntrain9_g
# Use our own hand-crafted modules for cuda 11.4 (from nvhpc/21.9) and PrgEnv-nvidia for nvhpc 21.9
module use /global/cfs/cdirs/ntrain9/neuron/modules
# Also, spack-generated modules for dependencies (caliper, ninja, py-*)
module use /global/cfs/cdirs/ntrain9/neuron/spack_modules/cray-sles15-zen2
# Load modules: prefer CUDA 11.4 from NVHPC/21.9
module swap cuda cuda/11.4.1
module load cmake nvidia/21.9 python caliper ninja py-pytest py-pyyaml py-jinja2 py-sympy
# Build CoreNEURON
cmake .. -G Ninja \
  -DCORENRN_ENABLE_CALIPER_PROFILING=ON  \
  -DCORENRN_ENABLE_GPU=ON \
  -DCORENRN_ENABLE_NMODL=ON \
  -DCORENRN_NMODL_FLAGS="sympy --analytic" \
  -DCORENRN_EXTERNAL_BENCHMARK_DATA=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021 \
  -DCMAKE_INSTALL_PREFIX=../install \
  -DCMAKE_CUDA_COMPILER=nvcc \
  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" \
  -DCMAKE_CXX_COMPILER=CC \
  -DCMAKE_CUDA_ARCHITECTURES=80 \
  -DCORENRN_NMODL_DIR=/global/cfs/cdirs/ntrain9/neuron/spack/cray-sles15-zen2/gcc-9.3.0/nmodl-0.3.0.20111126-m2kos252sgvxkq7xltv5w35e4irae7gj
cmake --build . --parallel
ctest --output-on-failure -j 16If you are working on the NMODL code generation, avoid -DCORENRN_NMODL_DIR=.. option or provide your own install directory -DCORENRN_NMODL_DIR=...
If you haven't allocated a session
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 2 --account=ntrain9_gNote that we are setting NVCOMPILER_ACC_SYNCHRONOUS=1 below so that we get correct timings for individual kernels. Otherwise kernel timings are incorrect due to async launch.
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 CALI_CONFIG=runtime-report,calc.inclusive srun -n 2 benchmark/x86_64/special-core -e 1 --datpath=../tests/integration/ring --mpi --gpu --cell-permute=2
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 84a7f50 (2021-11-25 18:03:04 +0100)
 Additional mechanisms from files
 Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod
 num_mpi=2
 num_omp_thread=1
 Info : 2 GPUs shared by 2 ranks per node
...
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 CALI_CONFIG=runtime-report,calc.inclusive \
  srun -n 2 benchmark/x86_64/special-core \
  --datpath=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1320-cells-2-ranks/ \
  --mpi --gpu --cell-permute=2 --tstop=100
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 de4e433 (2021-11-26 08:50:58 +0100)
 Additional mechanisms from files
 Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod
 num_mpi=2
 num_omp_thread=1
 Info : 2 GPUs shared by 2 ranks per node
....
Solver Time : 16.2698
 Simulation Statistics
 Number of cells: 1320
 Number of compartments: 648360
 Number of presyns: 3961320
 Number of input presyns: 0
 Number of synapses: 3960002
 Number of point processes: 7921322
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 11969
 Number of spikes with non negative gid-s: 11969
Path                                     Min time/rank Max time/rank Avg time/rank Time %
main                                         40.376340     40.378758     40.377549 99.399072
  checkpoint                                  0.000001      0.000002      0.000002  0.000004
  output-spike                                0.004200      0.004211      0.004205  0.010353
  simulation                                 16.269795     16.269813     16.269804 40.052045
    spike-exchange                            0.019603      0.041487      0.030545  0.075194
      spike-exchange                          0.019578      0.041452      0.030515  0.075120
        communication                         0.000886      0.000906      0.000896  0.002206
        imbalance                             0.018654      0.040551      0.029603  0.072874
    timestep                                 16.225238     16.247191     16.236214 39.969356
      state-update                            4.167419      4.176959      4.172189 10.270849
        state-SKv3_1                          0.084673      0.086011      0.085342  0.210090
        state-SK_E2                           0.086975      0.087625      0.087300  0.214910
        state-ProbGABAAB_EMS                  0.176391      0.182205      0.179298  0.441385
        state-ProbAMPANMDA_EMS                0.512746      0.521091      0.516919  1.272520
        state-nax                             0.135361      0.136370      0.135866  0.334466
        state-NaTg                            0.121694      0.121713      0.121704  0.299603
        state-Nap_Et2                         0.091850      0.092020      0.091935  0.226320
        state-na3                             0.126686      0.127354      0.127020  0.312690
        state-K_Tst                           0.083087      0.083141      0.083114  0.204605
        state-K_Pst                           0.086288      0.086488      0.086388  0.212665
        state-kmb                             0.078860      0.079250      0.079055  0.194613
        state-KdShu2007                       0.077877      0.078125      0.078001  0.192018
        state-kdr                             0.094169      0.095289      0.094729  0.233198
        state-kdrb                            0.092489      0.092745      0.092617  0.227999
        state-kdb                             0.089317      0.089467      0.089392  0.220060
        state-kd2                             0.087586      0.092668      0.090127  0.221869
        state-kca                             0.363553      0.364731      0.364142  0.896423
        state-kap                             0.109070      0.109326      0.109198  0.268817
        state-kad                             0.112513      0.113974      0.113244  0.278776
        state-Ih                              0.098592      0.100036      0.099314  0.244485
        state-hd                              0.093146      0.094326      0.093736  0.230754
        state-cat                             0.121305      0.121554      0.121430  0.298928
        state-can                             0.121067      0.122852      0.121960  0.300233
        state-Ca_LVAst                        0.129896      0.130772      0.130334  0.320849
        state-cal                             0.108788      0.109597      0.109192  0.268804
        state-Ca_HVA2                         0.142881      0.143095      0.142988  0.351999
        state-cagk                            0.097391      0.098042      0.097716  0.240553
        state-cacum                           0.234768      0.235131      0.234950  0.578385
        state-cacumb                          0.175342      0.176663      0.176002  0.433273
        state-IClamp                          0.002348      0.002558      0.002453  0.006039
        state-CaDynamics_DC0                  0.113226      0.113724      0.113475  0.279346
        state-pas                             0.002896      0.003032      0.002964  0.007297
      update                                  0.116486      0.117232      0.116859  0.287677
      second-order-cur                        0.001943      0.002165      0.002054  0.005056
      matrix-solver                           6.071176      6.123533      6.097355 15.010108
      setup-tree-matrix                       4.745456      4.748895      4.747175 11.686317
        cur-SKv3_1                            0.104343      0.105509      0.104926  0.258301
        cur-SK_E2                             0.097890      0.098530      0.098210  0.241768
        cur-ProbGABAAB_EMS                    0.252629      0.255678      0.254154  0.625660
        cur-ProbAMPANMDA_EMS                  0.903469      0.903876      0.903672  2.224608
        cur-nax                               0.113687      0.115076      0.114382  0.281578
        cur-NaTg                              0.113117      0.113994      0.113556  0.279544
        cur-Nap_Et2                           0.084684      0.084767      0.084725  0.208572
        cur-na3                               0.095460      0.096301      0.095880  0.236033
        cur-K_Tst                             0.076803      0.076846      0.076825  0.189122
        cur-K_Pst                             0.083091      0.083296      0.083193  0.204801
        cur-kmb                               0.084998      0.085193      0.085095  0.209483
        cur-KdShu2007                         0.079360      0.079602      0.079481  0.195662
        cur-kdr                               0.084818      0.085221      0.085019  0.209296
        cur-kdrb                              0.083571      0.083837      0.083704  0.206058
        cur-kdb                               0.074649      0.075234      0.074942  0.184487
        cur-kd2                               0.074704      0.075905      0.075305  0.185380
        cur-kca                               0.102095      0.102144      0.102119  0.251392
        cur-kap                               0.093734      0.094259      0.093996  0.231395
        cur-kad                               0.096892      0.096943      0.096917  0.238586
        cur-Ih                                0.099069      0.099392      0.099230  0.244280
        cur-hd                                0.079998      0.080359      0.080179  0.197379
        cur-cat                               0.103624      0.103911      0.103767  0.255449
        cur-can                               0.118858      0.119161      0.119010  0.292971
        cur-Ca_LVAst                          0.123353      0.126097      0.124725  0.307041
        cur-cal                               0.122882      0.123123      0.123003  0.302800
        cur-Ca_HVA2                           0.142775      0.143094      0.142934  0.351868
        cur-cagk                              0.122285      0.123171      0.122728  0.302125
        cur-cacum                             0.081801      0.082285      0.082043  0.201969
        cur-cacumb                            0.074288      0.075040      0.074664  0.183803
        cur-IClamp                            0.085479      0.088503      0.086991  0.214149
        cur-CaDynamics_DC0                    0.064449      0.064505      0.064477  0.158726
        cur-ttx_ion                           0.063089      0.063417      0.063253  0.155713
        cur-ca_ion                            0.100195      0.100529      0.100362  0.247065
        cur-k_ion                             0.063816      0.064154      0.063985  0.157515
        cur-na_ion                            0.066107      0.066272      0.066190  0.162941
        cur-pas                               0.109251      0.109293      0.109272  0.268999
      deliver-events                          0.982300      1.005465      0.993883  2.446681
        net-receive-ProbGABAAB_EMS            0.000950      0.001073      0.001012  0.002490
        net-receive-ProbAMPANMDA_EMS          0.003695      0.003919      0.003807  0.009372
        net-buf-receive-ExpSyn                0.003570      0.003671      0.003621  0.008913
        net-buf-receive-Exp2Syn               0.003701      0.003835      0.003768  0.009276
        net-buf-receive-ProbGABAAB_EMS        0.068577      0.070341      0.069459  0.170990
        net-buf-receive-ProbAMPANMDA_EMS      0.090366      0.094045      0.092206  0.226986
        update-net-receive-buf                0.399891      0.406366      0.403129  0.992398
          net-receive-buf-cpu2gpu             0.379896      0.385496      0.382696  0.942098
          net-receive-buf-order               0.003375      0.003533      0.003454  0.008503
        check-threshold                       0.182114      0.184026      0.183070  0.450671
  finitialize                                 2.094235      2.094272      2.094254  5.155510
    spike-exchange                            0.000035      0.105358      0.052696  0.129725
      spike-exchange                          0.000031      0.105354      0.052693  0.129715
        communication                         0.000019      0.000022      0.000020  0.000050
        imbalance                             0.000006      0.105331      0.052669  0.129656
    cur-SKv3_1                                0.000031      0.000032      0.000031  0.000078
    cur-SK_E2                                 0.000029      0.000031      0.000030  0.000074
    cur-ProbGABAAB_EMS                        0.000072      0.000073      0.000073  0.000178
    cur-ProbAMPANMDA_EMS                      0.000265      0.000267      0.000266  0.000655
    cur-nax                                   0.000034      0.000034      0.000034  0.000084
    cur-NaTg                                  0.000032      0.000034      0.000033  0.000081
    cur-Nap_Et2                               0.000025      0.000026      0.000025  0.000063
    cur-na3                                   0.000029      0.000030      0.000029  0.000073
    cur-K_Tst                                 0.000023      0.000024      0.000024  0.000058
    cur-K_Pst                                 0.000025      0.000026      0.000025  0.000063
    cur-kmb                                   0.000026      0.000027      0.000027  0.000065
    cur-KdShu2007                             0.000024      0.000024      0.000024  0.000059
    cur-kdr                                   0.000026      0.000026      0.000026  0.000064
    cur-kdrb                                  0.000025      0.000025      0.000025  0.000062
    cur-kdb                                   0.000023      0.000023      0.000023  0.000057
    cur-kd2                                   0.000024      0.000024      0.000024  0.000059
    cur-kca                                   0.000031      0.000032      0.000031  0.000078
    cur-kap                                   0.000027      0.000028      0.000027  0.000068
    cur-kad                                   0.000028      0.000030      0.000029  0.000071
    cur-Ih                                    0.000029      0.000030      0.000029  0.000073
    cur-hd                                    0.000024      0.000025      0.000024  0.000060
    cur-cat                                   0.000030      0.000032      0.000031  0.000076
    cur-can                                   0.000035      0.000035      0.000035  0.000086
    cur-Ca_LVAst                              0.000035      0.000036      0.000035  0.000087
    cur-cal                                   0.000036      0.000037      0.000036  0.000090
    cur-Ca_HVA2                               0.000039      0.000041      0.000040  0.000098
    cur-cagk                                  0.000037      0.000038      0.000037  0.000092
    cur-cacum                                 0.000025      0.000026      0.000025  0.000063
    cur-cacumb                                0.000024      0.000025      0.000024  0.000060
    cur-IClamp                                0.000027      0.000028      0.000027  0.000068
    cur-CaDynamics_DC0                        0.000021      0.000022      0.000022  0.000053
    cur-ttx_ion                               0.000026      0.000026      0.000026  0.000064
    cur-ca_ion                                0.000028      0.000029      0.000029  0.000070
    cur-k_ion                                 0.000018      0.000019      0.000018  0.000046
    cur-na_ion                                0.000022      0.000023      0.000022  0.000055
    cur-pas                                   0.000034      0.000034      0.000034  0.000084
    update-net-receive-buf                    0.000021      0.000023      0.000022  0.000054
  load-model                                 21.857876     21.860913     21.859395 53.812170
To produce a useful profile with NSight Systems we need to configure Caliper to emit NVTX markers (CALI_CONFIG=nvtx) and tell NSight Systems to record regions with names that are not registered strings (NSYS_NVTX_PROFILER_REGISTER_ONLY=0).
To avoid profiling model initialisation and setup, you may want to only record the actual simulation (--capture-range=nvtx --nvtx-capture=simulation).
Additionally, NSight Systems seems to have trouble profiling multiple OpenMP host threads launching GPU kernels at once, so you may want to disable that (OMP_NUM_THREADS=1).
Taken together, an example prefix could be:
CALI_CONFIG=nvtx OMP_NUM_THREADS=1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0 --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation ./x86_64/special-core ...Some other notes:
- during development, just use simple ring test (for quick iteration)
- install nmodl master into project space $CFS/ntrain9/neuronso that mentors or other people can just use standard version for profiling or other non-codegen related tasks.
- Repository and branch which includes lots of tests of OpenACC and OpenMP interoperability https://github.com/iomaganaris/openMP_GPU/tree/magkanar/example. Also the code mentioned in https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866/2