Skip to content

Conversation

zhangfengcdt
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

Add retry logic with exponential backoff for Spark downloads to make the workflow resilient to temporary network issues.

How was this patch tested?

test on ci run

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

@zhangfengcdt
Copy link
Member Author

Currently Apache Archive downloading is very slow.

Tested locally:

  • Speed: ~44 KB/s (43,968 bytes/sec)
  • 10MB would take: ~4 minutes
  • 400MB Spark would take: ~2.5 hours!

Also, the spark package cache can NOT be found on github:

gh api repos/apache/sedona/actions/caches --jq '.actions_caches[].key'

@jiayuasu
Copy link
Member

Maybe we should see if we can download the full distribution via pyspark? see how we do it in PySpark to avoid touching archive.apache.org: https://github.com/apache/sedona/blob/master/.github/workflows/python.yml#L157

@zhangfengcdt zhangfengcdt changed the title [GH-2351] [CI] Fix R CI flakiness with Spark download retry logic and timeout [GH-2351] [CI] Fix R CI flakiness with Spark download from PySpark Sep 16, 2025
@zhangfengcdt
Copy link
Member Author

Maybe we should see if we can download the full distribution via pyspark? see how we do it in PySpark to avoid touching archive.apache.org: https://github.com/apache/sedona/blob/master/.github/workflows/python.yml#L157

Great suggestion! I have implemented the same way as python.yml and it works well without timeout. Now R tests finishes around 10 minutes.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses CI flakiness by changing the Spark installation method from using cached Spark installations to using PySpark with a pre-installed Spark setup.

  • Replaces Spark caching mechanism with PySpark installation and direct JAR management
  • Adds support for using SPARK_HOME environment variable in R test connections
  • Implements retry logic with exponential backoff for downloading JAI libraries

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/r.yml Replaces cached Spark installation with PySpark setup, adds JAI library downloads with retry logic, and sets SPARK_HOME
R/tests/testthat/helper-initialize.R Adds SPARK_HOME support for test connections and conditional logic for local vs CI environments

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jiayuasu jiayuasu merged commit 6c8f66c into apache:master Sep 17, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants