Skip to content

Conversation

@SurajAralihalli
Copy link
Contributor

@SurajAralihalli SurajAralihalli commented Sep 24, 2025

Resolves #1358

Updated the installation process for the NVIDIA GPU driver on Rocky Linux. The script now installs kernel development packages directly and downloads the CUDA installer run file, executing it in silent mode. The installer file is removed post-installation to clean up.

Updated the installation process for the NVIDIA GPU driver on Rocky Linux. The script now installs kernel development packages directly and downloads the CUDA installer run file, executing it in silent mode. The installer file is removed post-installation to clean up. This change simplifies the installation steps and ensures the correct driver version is used.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SurajAralihalli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the NVIDIA GPU driver installation logic for Rocky Linux within the spark-rapids.sh script. The change aims to provide a more robust and direct installation experience by utilizing the official CUDA installer run file, thereby simplifying the dependency management and ensuring a consistent driver setup.

Highlights

  • NVIDIA Driver Installation Refactor: The method for installing NVIDIA GPU drivers on Rocky Linux has been completely revamped to use a direct approach.
  • Direct CUDA Installer Usage: The script now directly downloads and executes the official CUDA installer run file in silent mode, moving away from dnf module installations for the driver.
  • Targeted Kernel Package Installation: Kernel development packages are now installed specifically for the running kernel version using dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r).
  • Cleanup of Installer File: The downloaded CUDA installer run file is automatically removed after successful installation to maintain a clean system.
  • Simplified Dependency Management: Removed several dnf commands related to repository configuration, DKMS, and general package updates, streamlining the installation process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the NVIDIA driver installation for Rocky Linux to use a .run file instead of the dnf package manager. This is a good simplification. My review includes suggestions to improve the robustness of the download and installation steps by adding error handling, retries, and more explicit installer options to prevent unintended side effects. I've also suggested a minor improvement for file removal to make it safer.

bash driver.run --silent

# Remove the installer file after installation to clean up
rm driver.run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using rm driver.run could cause the script to fail if driver.run was not downloaded successfully. It's safer to use rm -f driver.run to prevent an error if the file does not exist.

Suggested change
rm driver.run
rm -f driver.run

@cjac
Copy link
Contributor

cjac commented Sep 24, 2025

gcbrun

@cjac
Copy link
Contributor

cjac commented Sep 24, 2025

/gcbrun

1 similar comment
@cjac
Copy link
Contributor

cjac commented Sep 25, 2025

/gcbrun

@SurajAralihalli
Copy link
Contributor Author

Thanks @cjac, is this PR ready to be merged?

@cjac
Copy link
Contributor

cjac commented Oct 1, 2025

It appears that the rocky 8 tests were disabled. I'm going to re-enable them and test your change in my dev environment.

@cjac
Copy link
Contributor

cjac commented Oct 1, 2025

/gcbrun

@cjac
Copy link
Contributor

cjac commented Oct 1, 2025

The good news is that 2.1-rocky8 tests pass. The 2.2-rocky9 tests, however, are failing and causing the check to fail.

@cjac
Copy link
Contributor

cjac commented Oct 1, 2025

I found the problem with rocky9. It was the one I thought might be to blame. I'm exercising the change on rocky 9 and if it resolves the issue in my repro environment, I'll commit and push.

@cjac
Copy link
Contributor

cjac commented Oct 1, 2025

in addition to the move of the kernel packages to the vault, 2.4.1 will not compile against the kernel in Rocky 9. I'm not sure whether you want to:

  • disable support on the 2.0 images
  • disable support on the 2.2-rocky9 images
  • select the CUDA and driver versions based on the dataproc version

My recommendation would be to use gpu/install_gpu_driver.sh instead, since it already handles all of these issues, and it would reduce the maintenance load.

@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Oct 1, 2025

  • select the CUDA and driver versions based on the dataproc version

Can you please explain how to choose the appropriate CUDA and driver versions depending on the Dataproc version being used?

My recommendation would be to use gpu/install_gpu_driver.sh instead, since it already handles all of these issues, and it would reduce the maintenance load.

I agree, but the MIG issue is blocking us from doing that #1269 (comment)

Update:
If MIG support is disabled, can we invoke gpu/install_gpu_driver.sh directly from spark-rapids.sh so users only need a single init script, instead of two separate ones?
I also want to avoid duplicating the logic from gpu/install_gpu_driver.sh into spark-rapids.sh, which could otherwise lead to inconsistencies.

@cjac
Copy link
Contributor

cjac commented Oct 2, 2025

  • select the CUDA and driver versions based on the dataproc version

Can you please explain how to choose the appropriate CUDA and driver versions depending on the Dataproc version being used?

My recommendation would be to use gpu/install_gpu_driver.sh instead, since it already handles all of these issues, and it would reduce the maintenance load.

I agree, but the MIG issue is blocking us from doing that #1269 (comment)

Update: If MIG support is disabled, can we invoke gpu/install_gpu_driver.sh directly from spark-rapids.sh so users only need a single init script, instead of two separate ones? I also want to avoid duplicating the logic from gpu/install_gpu_driver.sh into spark-rapids.sh, which could otherwise lead to inconsistencies.

You're right. I need to make it reboot safe. What do you think about changing the test suite such that instead of a fire-and-forget approach to the initialization action, we start a screen session which executes the main script and installs a "nanny" systemd service to poll every N seconds. If it finds that the success criteria have not been met, but there is not presently a screen session running the installer, it runs the script in a screen session. That way, the process becomes immune from some environmental factors which might terminate its connection to the controlling terminal of the parent process.

It would also mean that the script log would not always be printing to /var/log/dataproc-initialization-script-0.log ; this might break the ABI of some use cases. Perhaps when clusters which have not completed initialization re-run their startup script on boot. I will investigate and let you know as I know more.

This commit integrates changes to enable the spark-rapids initialization action on Dataproc 2.1-rocky8 images.

- Updates the NVIDIA driver installation process in `spark-rapids.sh` for Rocky Linux:
  - Uses `curl` with retry and fail-fast options for downloading the CUDA installer.
  - Executes the NVIDIA installer with `--silent --driver --toolkit --no-opengl-libs` flags and wraps it in `execute_with_retries`.

- Modifies `test_spark_rapids.py` to enable tests for Rocky Linux on Dataproc 2.1 and below, while keeping them skipped for 2.2+ (Rocky 9).

This resolves the installation issues on Rocky 8. Further work is required to support Rocky 9 (Dataproc 2.2).
@cjac
Copy link
Contributor

cjac commented Oct 2, 2025

/gcbrun

@cjac
Copy link
Contributor

cjac commented Oct 2, 2025

This change makes 2.1-rocky8 functional. 2.2-rocky9 will take quite a bit more effort, and I think it would be better to resolve the issues with install_gpu_driver.sh instead of putting that effort into a script we intend to replace moving forward.

@cjac cjac self-requested a review October 2, 2025 21:59
Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is sufficient to get 2.1-rocky8 up and running.

For 2.2-rocky9, we should build from github.

@cjac cjac merged commit 17b1f6e into GoogleCloudDataproc:main Oct 2, 2025
2 checks passed
@cjac cjac changed the title [spark-rapids.sh] Refactor NVIDIA driver installation for Rocky Linux to use run file [spark-rapids.sh] Refactor NVIDIA driver installation for Rocky Linux 8 to use run file Oct 2, 2025
@SurajAralihalli
Copy link
Contributor Author

This change makes 2.1-rocky8 functional. 2.2-rocky9 will take quite a bit more effort, and I think it would be better to resolve the issues with install_gpu_driver.sh instead of putting that effort into a script we intend to replace moving forward.

Thanks @cjac for working on this! I’m open to switching spark-rapids.sh to the install_gpu_driver.sh approach. Is it possible to read and invoke the existing install_gpu_driver.sh from within spark-rapids.sh. I’d like to avoid duplicating the CUDA/driver installation logic in spark-rapids.sh, since that create maintenance challenges? It would be easier for the user to work with a single init script rather than having separate ones for Spark and for CUDA/driver installation.

I'm happy to create an issue/ticket for us to track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error installing Nvidia Drivers during spark-rapids.sh initialization on Dataproc 2.1-rocky8 cluster

2 participants