Skip to content

Commit 29a0df9

Browse files
committed
[gpu] strict driver and cuda version assignment
Roll forward GoogleCloudDataproc#1275 gpu/install_gpu_driver.sh * updated supported versions * moved all code into functions, which are called at the footer of the installer * install cuda and driver exclusively from run files * extract cuda and driver version from urls if supplied * support supplying cuda version as x.y.z instead of just x.y * build nccl from source * poll dpkg lock status for up to 60 seconds * cache build artifacts from kernel driver and nccl * use consistent arguments to curl * create is_complete and mark_complete functions to allow re-running * Tested more CUDA minor versions * Printing warnings when combination provided is known to fail * only install build dependencies on build cache miss * added optional pytorch install option * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum * verified that proprietary kernel drivers work with older dataproc images * clear dkms key immediately after use * cache .run files to GCS to reduce fetches from origin * Install nvidia container toolkit and select container runtime * tested installer on clusters without GPUs attached * fixed a problem with ops agent not installing ; using venv * Older CapacityScheduler does not permit use of gpu resources ; switch to FairScheduler on 2.0 and below * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * Hold all NVIDIA-related packages from upgrading unintenionally * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability * harden sshd config * install spark rapids acceleration libraries gpu/manual-test-runner.sh * order commands correctly gpu/run-bazel-tests.sh * do not retry flakey tests gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * remove skip of rocky9 tests
1 parent 292d67f commit 29a0df9

File tree

6 files changed

+1527
-511
lines changed

6 files changed

+1527
-511
lines changed

gpu/Dockerfile

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,25 @@ RUN apt-get -qq update \
1515
curl jq less screen > /dev/null 2>&1 && apt-get clean
1616

1717
# Install bazel signing key, repo and package
18-
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
19-
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"
18+
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \
19+
bazel_version=7.4.0 \
20+
bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8" \
21+
DEBIAN_FRONTEND=noninteractive
2022

2123
RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
2224
| gpg --dearmor -o "${bazel_kr_path}" \
2325
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
2426
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
2527
&& apt-get update -qq
2628

27-
RUN apt-get autoremove -y -qq && \
28-
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
29+
RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
30+
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
2931
apt-get clean
3032

33+
# Set bazel-${bazel_version} as the default bazel alternative in this container
34+
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
35+
update-alternatives --set bazel /usr/bin/bazel-${bazel_version}
36+
3137
# Install here any utilities you find useful when troubleshooting
3238
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean
3339

0 commit comments

Comments
 (0)