Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Dec 13, 2024

Resolves Issues

gpu/install_gpu_driver.sh

  • Driver version defaults to version in driver .run file if specified

  • CUDA version defaults to version in cuda .run file if specified

  • exclusively using .run file installation method for cuda and driver installation

    • Installing non-open driver from .run file on rocky8
  • build nccl from source, since that is the only mechanism which supports all Dataproc OSs

  • wrap expensive functions in completion checks to reduce re-run time when testing manually

  • cache build results in GCS

  • waiting on apt lock when it exists

  • only install build dependencies if build is necessary

  • fix problem with ops agent not installing ; using venv

  • Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS

  • mark task completion by creating a file rather than setting a variable

  • added functions to check and report secure-boot and os version details

  • if the correct metadata attributes are specified, pytorch will be installed, but not by default

gpu/manual-test-runner.sh

  • order commands correctly
  • point to origin rather than staging repo

gpu/test_gpu.py

  • clearer test skipping logic

@cjac cjac self-assigned this Dec 13, 2024
@cjac
Copy link
Contributor Author

cjac commented Dec 13, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Dec 13, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

Tests took 43 minutes to run this time. I think it's because the binary driver build cache was invalidated. Let's see if this one takes more like 14 minutes.

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

the intention is for this PR to resolve issue #1268

changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must remove this before merging

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

slightly longer than the 14 minutes I predicted, this run completed in 16:29

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

10 similar comments
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

okay, taking a little break...

changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # to be removed before merge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.

@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2025

I met with customer to exercise the cluster and they asked for one additional change. If the correct metadata key/value pair are passed, we now install pytorch and register a new kernel for use with JupyterLab.

@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2025

/gcbrun

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 29, 2025
@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2025

/gcbrun

Copy link
Member

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @cjac ! I entered a few comments. In general, can we also review the curl calls to make sure they have the necessary retries?

Copy link
Contributor Author

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, Chris and Prince. I will make these changes and resolve each.

changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # to be removed before merge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.

@cjac
Copy link
Contributor Author

cjac commented Feb 2, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Feb 3, 2025

@cnauroth - I believe that I addressed all of your concerns in the previous change, eacb99f

Please let me know if you see anything that needs further attention.

# Minimum supported version for open kernel driver is 515.43.04
# https://github.com/NVIDIA/open-gpu-kernel-modules/tags
latest="$(curl -s https://us.download.nvidia.com/XFree86/Linux-x86_64/latest.txt | awk '{print $1}')"
#latest="$(curl -s https://us.download.nvidia.com/XFree86/Linux-x86_64/latest.txt | awk '{print $1}')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line if latest is no longer being used?

while gsutil ls "${gcs_tarball}.building" 2>&1 | grep -q "${gcs_tarball}.building" ; do
sleep 5m
done
if gsutil ls -j "${gcs_tarball}.building" > "${local_tarball}.building.json" ; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -j a recent new feature? I couldn't find this option in my gsutil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it dumps the object metadata in JSON format ; I thought it would be --format json to match other gcloud commands, but I guess that it's slightly different because of gsutil. The argument is documented for gcloud storage ls --help which I assumed used the same ABI as gsutil, but I don't see the argument in the gcloud ls --help documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for whether it's new, I don't think it is. I looked through the release history[1], and it seems that JSON support has been included from the beginning. But I couldn't find the argument parsing code where the argument is detected on a quick review of [2].

[1] https://github.com/GoogleCloudPlatform/gsutil/blob/master/CHANGES.md
[2] https://github.com/GoogleCloudPlatform/gsutil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see that I was mistaken! I'll change that gcloud storage ls -j

$ gsutil ls -j gs://${BUCKET}/dataproc-initialization-actions/gpu/install_gpu_driver.sh
CommandException: Incorrect option(s) specified. Usage:

  gsutil ls [-a] [-b] [-d] [-l] [-L] [-r] [-p proj_id] url...

For additional help run:
  gsutil help ls

@cjac
Copy link
Contributor Author

cjac commented Feb 3, 2025

/gcbrun

@cjac cjac force-pushed the gpu-20241212 branch 2 times, most recently from 8944d62 to 2bf97fc Compare February 4, 2025 22:09
gpu/install_gpu_driver.sh:
* use the same retry arguments in all calls to curl
* correct 12.3's driver and sub-version
* improve logic for pause as other workers perform build
* remove call to undefined clear_nvsmi_cache
* move closing "fi" to line of its own
* added comments for unclear logic
* removed commented code
* remove unused curl for latest driver version

gpu/test_gpu.py
* removed excess test
* added comment about numa node selection
* removed skips of rocky9 ; 2.2.44-rocky9 build succeeds
@cjac
Copy link
Contributor Author

cjac commented Feb 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Feb 4, 2025

2.2.44-rocky9 is able to build the kernel driver, so I've removed the test skip for rocky9.

Copy link
Member

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjac , that addresses all of my feedback. Thank you very much!

LGTM, pending a successful CI run on the latest changes.

@prince-cs and @kuldeepkk-dev , please let us know if you had any other feedback.

@cjac cjac merged commit 87e2f91 into GoogleCloudDataproc:master Feb 6, 2025
1 of 2 checks passed
cjac added a commit that referenced this pull request Feb 10, 2025
cjac added a commit that referenced this pull request Feb 10, 2025
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Feb 12, 2025
Roll forward GoogleCloudDataproc#1275

gpu/install_gpu_driver.sh
  * updated supported versions
  * moved all code into functions, which are called at the footer of
    the installer
  * install cuda and driver exclusively from run files
  * extract cuda and driver version from urls if supplied
  * support supplying cuda version as x.y.z instead of just x.y
  * build nccl from source
  * poll dpkg lock status for up to 60 seconds
  * cache build artifacts from kernel driver and nccl
  * use consistent arguments to curl
  * create is_complete and mark_complete functions to allow re-running
  * Tested more CUDA minor versions
  * Printing warnings when combination provided is known to fail
  * only install build dependencies on build cache miss
  * added optional pytorch install option
  * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
  * verified that proprietary kernel drivers work with older dataproc images
  * clear dkms key immediately after use
  * cache .run files to GCS to reduce fetches from origin
  * Install nvidia container toolkit and select container runtime
  * tested installer on clusters without GPUs attached
  * fixed a problem with ops agent not installing ; using venv
  * Older CapacityScheduler does not permit use of gpu resources ;
    switch to FairScheduler on 2.0 and below
  * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
  * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * Hold all NVIDIA-related packages from upgrading unintenionally
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability
  * harden sshd config
  * install spark rapids acceleration libraries

gpu/manual-test-runner.sh
  * order commands correctly

gpu/run-bazel-tests.sh
  * do not retry flakey tests

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark
  * remove skip of rocky9 tests
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Feb 12, 2025
Roll forward GoogleCloudDataproc#1275

gpu/install_gpu_driver.sh
  * updated supported versions
  * moved all code into functions, which are called at the footer of
    the installer
  * install cuda and driver exclusively from run files
  * extract cuda and driver version from urls if supplied
  * support supplying cuda version as x.y.z instead of just x.y
  * build nccl from source
  * poll dpkg lock status for up to 60 seconds
  * cache build artifacts from kernel driver and nccl
  * use consistent arguments to curl
  * create is_complete and mark_complete functions to allow re-running
  * Tested more CUDA minor versions
  * Printing warnings when combination provided is known to fail
  * only install build dependencies on build cache miss
  * added optional pytorch install option
  * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
  * verified that proprietary kernel drivers work with older dataproc images
  * clear dkms key immediately after use
  * cache .run files to GCS to reduce fetches from origin
  * Install nvidia container toolkit and select container runtime
  * tested installer on clusters without GPUs attached
  * fixed a problem with ops agent not installing ; using venv
  * Older CapacityScheduler does not permit use of gpu resources ;
    switch to FairScheduler on 2.0 and below
  * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
  * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * Hold all NVIDIA-related packages from upgrading unintenionally
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability
  * harden sshd config
  * install spark rapids acceleration libraries

gpu/manual-test-runner.sh
  * order commands correctly

gpu/run-bazel-tests.sh
  * do not retry flakey tests

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark
  * remove skip of rocky9 tests
cjac added a commit that referenced this pull request Feb 13, 2025
```
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

* [gpu] strict driver and cuda version assignment

Roll forward #1275

gpu/install_gpu_driver.sh
  * updated supported versions
  * moved all code into functions, which are called at the footer of
    the installer
  * install cuda and driver exclusively from run files
  * extract cuda and driver version from urls if supplied
  * support supplying cuda version as x.y.z instead of just x.y
  * build nccl from source
  * poll dpkg lock status for up to 60 seconds
  * cache build artifacts from kernel driver and nccl
  * use consistent arguments to curl
  * create is_complete and mark_complete functions to allow re-running
  * Tested more CUDA minor versions
  * Printing warnings when combination provided is known to fail
  * only install build dependencies on build cache miss
  * added optional pytorch install option
  * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
  * verified that proprietary kernel drivers work with older dataproc images
  * clear dkms key immediately after use
  * cache .run files to GCS to reduce fetches from origin
  * Install nvidia container toolkit and select container runtime
  * tested installer on clusters without GPUs attached
  * fixed a problem with ops agent not installing ; using venv
  * Older CapacityScheduler does not permit use of gpu resources ;
    switch to FairScheduler on 2.0 and below
  * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
  * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * Hold all NVIDIA-related packages from upgrading unintenionally
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability
  * harden sshd config
  * install spark rapids acceleration libraries

gpu/manual-test-runner.sh
  * order commands correctly

gpu/run-bazel-tests.sh
  * do not retry flakey tests

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark
  * remove skip of rocky9 tests

* Correct test failures on 2.0-debian10

gpu/install_gpu_driver.sh

* Do not use fair scheduler for 2.0 clusters
* comment out spark-defaults.conf config options as guidance for tuning

gpu/test_gpu.py

* There are now three tests run from the verify_instance_spark function
* * Run the SparkPi example with no parameters specified
* * Run the JavaIndexToStringExample with many parameters specified
* * Run the JavaIndexToStringExample with few parameters specified

-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEEWBh4gudL5t7O9mieFuBp2E4LHyAFAmeuX5wACgkQFuBp2E4L
HyBnxQv/fWnbrBx0NuZQJGJt8qfuja5zSbmZL2XdgqLEkzv+y78jrIWX4wQYVDni
Hy5aN8HIRttslitNj+f4et0XSpxRFSvwJ/JZ362RMCUVUrNG/W6p+haIzPkzJz2+
0SgAaAE8JL8NOjPgCqLD7ZnaHBsA8ZPq9lXJkktkzdxzo6+jCoPY8GHELg5Cfm2e
x8mzKMwgRWIOPiW3kzvxIEJCdkQ+oM+18TyWfdal/QKDNvNTepVHeSCwzgrUEq6y
lXv+DsAfI8s8zLp1WQQt5fV+eLO66ey98RIpGedLlKhuOaTAVOyq+6ZrPva2RQEd
2QTEYWRyRynST+Cy/fLST/rZhRKoA4U0WLEru2XtIXuGU6UIdZT4ob2VEk25hxaH
FHpi3zoHzK28sx6v7qM7DuGYgyUwhL+mVddWXdwIvPvDXbJsf2ATFCCqGbCOjLYA
WXKeGg69BrERvjbQqVcpppyy2mw+CMBEPLGix7VwmVnJdU1zIqp0vvmqUD66D3gY
McpSvQyd
=llel
-----END PGP SIGNATURE-----
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[gpu] versions installed by gpu/install_gpu_driver.sh do not match requested versions

6 participants