Skip to content

Commit 2c57c65

Browse files
committed
```
feat(gpu): Enhance custom image support and NodeManager stability This commit introduces a deferred configuration mechanism for custom image builds and improves NodeManager service handling on Rocky Linux. - **Custom Image Deferred Configuration:** - The script now detects if it's running in a custom image build context via `invocation-type=custom-images` metadata. - If so, Hadoop/YARN/Spark configurations are deferred to a first-boot systemd service (`dataproc-gpu-config.service`), which runs `/usr/local/sbin/apply-dataproc-gpu-config.sh`. - `create_deferred_config_files` function now generates this script and service, packaging necessary functions and variables. - In standard init action mode, `apply-dataproc-gpu-config.sh` is run directly at the end of `main`. - **Rocky Linux NodeManager Systemd Unit:** - A native systemd unit for `hadoop-yarn-nodemanager` is now created in `prepare_to_install` for Rocky Linux, replacing any LSB init scripts. - The systemd unit includes `ExecStartPre` steps for PID directory creation and port clearing. - `ExecStart` sources environment files and uses `exec` to run NodeManager. - `ExecStop` uses `pkill` to stop the NodeManager Java process. - `SuccessExitStatus=143` is set to handle SIGTERM exits. - Added `increase_nm_systemd_timeout` to potentially mitigate slow startups. - SELinux context for `/usr/lib/hadoop-yarn/bin/container-executor` is now managed using `semanage` and `restorecon` within `configure_gpu_isolation` to ensure it can be executed by the `yarn` user. - Installed `policycoreutils-python-utils` for `semanage`. - **YARN Configuration:** - Switched to `set_xml_property` for safer XML file updates. - `configure_yarn_resources` now runs on all nodes. - `configure_gpu_isolation` now correctly sets permissions on the `container-executor` binary. - **Single-Node Cluster Handling:** - `main` function logic updated to apply worker configurations and restart NodeManager on single-node masters. - **Refinements:** - Improved `install_dependencies` for Rocky. - Updated `set_proxy` to include more services in `NO_PROXY`. - Added `python3-pip` to cloudbuild Dockerfile. - Refactored test base classes for GPU tests. - Removed `fix_nodemanager_init_script` in favor of the new systemd service creation. - `yarn_exit_handler` is now only called in non-custom image builds. ```
1 parent 1688073 commit 2c57c65

File tree

9 files changed

+549
-303
lines changed

9 files changed

+549
-303
lines changed

cloudbuild/Dockerfile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,15 @@ RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg | \
2222
dd of="${bazel_repo_file}" status=none && \
2323
apt-get update -qq
2424
RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
25-
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
25+
apt-get install -y -qq default-jdk python3-setuptools python3-pip bazel-${bazel_version} > /dev/null 2>&1 && \
2626
apt-get clean
2727

2828
# Set bazel-${bazel_version} as the default bazel alternative in this container
2929
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
3030
update-alternatives --set bazel /usr/bin/bazel-${bazel_version}
3131

3232
USER ia-tests
33+
34+
# Install Python dependencies
35+
RUN python3 -m pip install --upgrade pip && \
36+
python3 -m pip install -r /init-actions/requirements.txt

cloudbuild/presubmit.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ determine_tests_to_run() {
7070
changed_dir="${changed_dir%%/*}/"
7171
# Run all tests if common directories modified
7272
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
73+
continue # remove this line before submission
7374
echo "All tests will be run: '${changed_dir}' was changed"
7475
TESTS_TO_RUN=(":DataprocInitActionsTestSuite")
7576
return 0

cloudbuild/run-presubmit-on-k8s.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ spec:
2828
image: "${IMAGE}"
2929
resources:
3030
requests:
31-
memory: "4G"
31+
memory: "8G"
3232
cpu: "6000m"
3333
env:
3434
- name: COMMIT_SHA

gpu/BUILD

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,17 @@ package(default_visibility = ["//visibility:public"])
22

33
exports_files(["install_gpu_driver.sh", "mig.sh"])
44

5+
py_library(
6+
name = "gpu_test_case_base",
7+
srcs = ["gpu_test_case_base.py"],
8+
srcs_version = "PY3",
9+
testonly = True, # Add this line
10+
deps = [
11+
"//integration_tests:dataproc_test_case",
12+
"@io_abseil_py//absl/testing:parameterized",
13+
],
14+
)
15+
516
py_test(
617
name = "test_gpu",
718
size = "enormous",
@@ -10,7 +21,8 @@ py_test(
1021
local = True,
1122
shard_count = 15,
1223
deps = [
24+
":gpu_test_case_base", # Add this dependency
1325
"//integration_tests:dataproc_test_case",
1426
"@io_abseil_py//absl/testing:parameterized",
1527
],
16-
)
28+
)

gpu/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-ge
4040
WORKDIR /init-actions
4141

4242
USER ia-tests
43+
COPY --chown=ia-tests:ia-tests "cloudbuild/key.json" /key.json
4344
COPY --chown=ia-tests:ia-tests . ${WORKDIR}
4445

4546
ENTRYPOINT ["/bin/bash"]

gpu/gpu_test_case_base.py

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
import os
2+
import time
3+
import random
4+
from packaging import version
5+
from integration_tests.dataproc_test_case import DataprocTestCase
6+
7+
DEFAULT_TIMEOUT = 45 # minutes
8+
9+
class GpuTestCaseBase(DataprocTestCase):
10+
def __init__(self, *args, **kwargs):
11+
super().__init__(*args, **kwargs)
12+
13+
def run_dataproc_job(self,
14+
cluster_name,
15+
job_type,
16+
job_params,
17+
timeout_in_minutes=DEFAULT_TIMEOUT):
18+
"""Executes Dataproc job on a cluster and returns results.
19+
20+
Args:
21+
cluster_name: cluster name to submit job to
22+
job_type: type of the job, e.g. spark, hadoop, pyspark
23+
job_params: job parameters
24+
timeout_in_minutes: timeout in minutes
25+
26+
Returns:
27+
ret_code: the return code of the job
28+
stdout: standard output of the job
29+
stderr: error output of the job
30+
"""
31+
32+
ret_code, stdout, stderr = DataprocTestCase.run_command(
33+
'gcloud dataproc jobs submit {} --cluster={} --region={} {}'.
34+
format(job_type, cluster_name, self.REGION,
35+
job_params), timeout_in_minutes)
36+
return ret_code, stdout, stderr
37+
38+
# Tests for PyTorch
39+
TORCH_TEST_SCRIPT_FILE_NAME = "verify_pytorch.py"
40+
41+
# Tests for TensorFlow
42+
TF_TEST_SCRIPT_FILE_NAME = "verify_tensorflow.py"
43+
44+
def assert_instance_command(self,
45+
instance,
46+
cmd,
47+
timeout_in_minutes=DEFAULT_TIMEOUT):
48+
retry_count = 5
49+
ssh_cmd = 'gcloud compute ssh -q {} --zone={} --command="{}" -- -o ConnectTimeout=60 -o StrictHostKeyChecking=no'.format(
50+
instance, self.cluster_zone, cmd.replace('"', '\"'))
51+
52+
while retry_count > 0:
53+
try:
54+
# Use self.assert_command from DataprocTestCase
55+
ret_code, stdout, stderr = self.assert_command(ssh_cmd, timeout_in_minutes)
56+
return ret_code, stdout, stderr
57+
except Exception as e:
58+
print(f"An error occurred in assert_instance_command: {e}")
59+
retry_count -= 1
60+
if retry_count > 0:
61+
print(f"Retrying in 10 seconds...")
62+
time.sleep(10)
63+
continue
64+
else:
65+
print("Max retries reached.")
66+
raise
67+
68+
def verify_instance(self, name):
69+
# Verify that nvidia-smi works
70+
self.assert_instance_command(name, "nvidia-smi", 1)
71+
print(f"OK: nvidia-smi on {name}")
72+
73+
def verify_instance_gpu_agent(self, name):
74+
print(f"--- Verifying GPU Agent on {name} ---")
75+
self.assert_instance_command(
76+
name, "systemctl is-active gpu-utilization-agent.service")
77+
print(f"OK: GPU Agent on {name}")
78+
79+
def get_dataproc_image_version(self, instance):
80+
_, stdout, _ = self.assert_instance_command(instance, "grep DATAPROC_IMAGE_VERSION /etc/environment | cut -d= -f2")
81+
return stdout.strip()
82+
83+
def version_lt(self, v1, v2):
84+
return version.parse(v1) < version.parse(v2)
85+
86+
def verify_pytorch(self, name):
87+
print(f"--- Verifying PyTorch on {name} ---")
88+
test_filename = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "gpu",
89+
self.TORCH_TEST_SCRIPT_FILE_NAME)
90+
self.upload_test_file(test_filename, name)
91+
92+
image_version = self.get_dataproc_image_version(name)
93+
conda_root_path = "/opt/conda/miniconda3"
94+
if not self.version_lt(image_version, "2.3"):
95+
conda_root_path = "/opt/conda"
96+
97+
conda_env = "dpgce"
98+
env_path = f"{conda_root_path}/envs/{conda_env}"
99+
python_bin = f"{env_path}/bin/python3"
100+
101+
verify_cmd = (
102+
f"for f in /sys/module/nvidia/drivers/pci:nvidia/*/numa_node; do "
103+
f" if [[ -e \\\"$f\\\" ]]; then echo 0 > \\\"$f\\\"; fi; "
104+
f"done; "
105+
f"if /usr/share/google/get_metadata_value attributes/include-pytorch; then"
106+
f" {python_bin} {self.TORCH_TEST_SCRIPT_FILE_NAME}; "
107+
f"else echo 'PyTorch test skipped as include-pytorch is not set'; fi"
108+
)
109+
_, stdout, _ = self.assert_instance_command(name, verify_cmd)
110+
if "PyTorch test skipped" not in stdout:
111+
self.assertTrue("True" in stdout, f"PyTorch CUDA not available or python not found in {env_path}")
112+
print(f"OK: PyTorch on {name}")
113+
self.remove_test_script(self.TORCH_TEST_SCRIPT_FILE_NAME, name)
114+
115+
def verify_tensorflow(self, name):
116+
print(f"--- Verifying TensorFlow on {name} ---")
117+
test_filename=os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "gpu",
118+
self.TF_TEST_SCRIPT_FILE_NAME)
119+
self.upload_test_file(test_filename, name)
120+
121+
image_version = self.get_dataproc_image_version(name)
122+
conda_root_path = "/opt/conda/miniconda3"
123+
if not self.version_lt(image_version, "2.3"):
124+
conda_root_path = "/opt/conda"
125+
126+
conda_env="dpgce"
127+
env_path = f"{conda_root_path}/envs/{conda_env}"
128+
python_bin = f"{env_path}/bin/python3"
129+
130+
verify_cmd = (
131+
f"for f in $(ls /sys/module/nvidia/drivers/pci:nvidia/*/numa_node) ; do echo 0 > ${{f}} ; done ;"
132+
f"{python_bin} {self.TF_TEST_SCRIPT_FILE_NAME}"
133+
)
134+
self.assert_instance_command(name, verify_cmd)
135+
print(f"OK: TensorFlow on {name}")
136+
self.remove_test_script(self.TF_TEST_SCRIPT_FILE_NAME, name)

0 commit comments

Comments
 (0)