-
Notifications
You must be signed in to change notification settings - Fork 517
[spark-rapids.sh] Refactor NVIDIA driver installation for Rocky Linux 8 to use run file #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated the installation process for the NVIDIA GPU driver on Rocky Linux. The script now installs kernel development packages directly and downloads the CUDA installer run file, executing it in silent mode. The installer file is removed post-installation to clean up. This change simplifies the installation steps and ensures the correct driver version is used.
Summary of ChangesHello @SurajAralihalli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the NVIDIA GPU driver installation logic for Rocky Linux within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the NVIDIA driver installation for Rocky Linux to use a .run file instead of the dnf package manager. This is a good simplification. My review includes suggestions to improve the robustness of the download and installation steps by adding error handling, retries, and more explicit installer options to prevent unintended side effects. I've also suggested a minor improvement for file removal to make it safer.
| bash driver.run --silent | ||
|
|
||
| # Remove the installer file after installation to clean up | ||
| rm driver.run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
gcbrun |
|
/gcbrun |
1 similar comment
|
/gcbrun |
|
Thanks @cjac, is this PR ready to be merged? |
|
It appears that the rocky 8 tests were disabled. I'm going to re-enable them and test your change in my dev environment. |
|
/gcbrun |
|
The good news is that 2.1-rocky8 tests pass. The 2.2-rocky9 tests, however, are failing and causing the check to fail. |
|
I found the problem with rocky9. It was the one I thought might be to blame. I'm exercising the change on rocky 9 and if it resolves the issue in my repro environment, I'll commit and push. |
|
in addition to the move of the kernel packages to the vault, 2.4.1 will not compile against the kernel in Rocky 9. I'm not sure whether you want to:
My recommendation would be to use gpu/install_gpu_driver.sh instead, since it already handles all of these issues, and it would reduce the maintenance load. |
Can you please explain how to choose the appropriate CUDA and driver versions depending on the Dataproc version being used?
I agree, but the MIG issue is blocking us from doing that #1269 (comment) Update: |
You're right. I need to make it reboot safe. What do you think about changing the test suite such that instead of a fire-and-forget approach to the initialization action, we start a screen session which executes the main script and installs a "nanny" systemd service to poll every N seconds. If it finds that the success criteria have not been met, but there is not presently a screen session running the installer, it runs the script in a screen session. That way, the process becomes immune from some environmental factors which might terminate its connection to the controlling terminal of the parent process. It would also mean that the script log would not always be printing to /var/log/dataproc-initialization-script-0.log ; this might break the ABI of some use cases. Perhaps when clusters which have not completed initialization re-run their startup script on boot. I will investigate and let you know as I know more. |
This commit integrates changes to enable the spark-rapids initialization action on Dataproc 2.1-rocky8 images. - Updates the NVIDIA driver installation process in `spark-rapids.sh` for Rocky Linux: - Uses `curl` with retry and fail-fast options for downloading the CUDA installer. - Executes the NVIDIA installer with `--silent --driver --toolkit --no-opengl-libs` flags and wraps it in `execute_with_retries`. - Modifies `test_spark_rapids.py` to enable tests for Rocky Linux on Dataproc 2.1 and below, while keeping them skipped for 2.2+ (Rocky 9). This resolves the installation issues on Rocky 8. Further work is required to support Rocky 9 (Dataproc 2.2).
|
/gcbrun |
|
This change makes 2.1-rocky8 functional. 2.2-rocky9 will take quite a bit more effort, and I think it would be better to resolve the issues with install_gpu_driver.sh instead of putting that effort into a script we intend to replace moving forward. |
cjac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is sufficient to get 2.1-rocky8 up and running.
For 2.2-rocky9, we should build from github.
Thanks @cjac for working on this! I’m open to switching I'm happy to create an issue/ticket for us to track. |
Resolves #1358
Updated the installation process for the NVIDIA GPU driver on Rocky Linux. The script now installs kernel development packages directly and downloads the CUDA installer run file, executing it in silent mode. The installer file is removed post-installation to clean up.