This repository was archived by the owner on Feb 3, 2021. It is now read-only.

Description
When provisioning a new cluster with more than 50 nodes we start to see a large proportion of start task failures (25+%) when pulling the docker image from an ACR.
error pulling image
configuration: received unexpected HTTP status: 503 The server is busy
I raised a support issue with the ACR team and they said we were being throttled and recommend that we attempt to retry pulling the image when we get a 503.
Note our image is quite large (~7 GB) which may be why we experience this issue whilst other do not.
Here are some of the ways I think we could mitigate this:
-
Retry pulling the image when 503 returned (as recommended by the ACR team)
-
Retry the entire start-task when any failure occurs (it looks like there is support for this in the batch SDK?)
-
Configure the docker daemon to pull fewer layers in parallel using the max-concurrent-downloads option. I looked at whether this could be done using a plugin but I think plugins would be run too late?
What do you think the best approach would be? Can you recommend any others?