Start task failure: error pulling image: 503 The server is busy

When provisioning a new cluster with more than 50 nodes we start to see a large proportion of start task failures (25+%) when pulling the docker image from an ACR.

```
error pulling image
configuration: received unexpected HTTP status: 503 The server is busy
```

I raised a support issue with the ACR team and they said we were being throttled and recommend that we attempt to retry pulling the image when we get a 503.

Note our image is quite large (~7 GB) which may be why we experience this issue whilst other do not.

Here are some of the ways I think we could mitigate this:

1) Retry pulling the image when 503 returned (as recommended by the ACR team)

2) Retry the entire start-task when any failure occurs (it looks like there is support for this in the [batch SDK](https://github.com/Azure/azure-sdk-for-python/blob/41e37c8a10876db40697a63e828ed7cafc19c7d6/azure-mgmt-batch/azure/mgmt/batch/models/start_task.py#L70)?)

3) Configure the docker daemon to pull fewer layers in parallel using the [max-concurrent-downloads](https://docs.docker.com/engine/reference/commandline/pull/#concurrent-downloads) option. I looked at whether this could be done using a plugin but I think plugins would be run too late?

What do you think the best approach would be? Can you recommend any others?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Start task failure: error pulling image: 503 The server is busy #659

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Start task failure: error pulling image: 503 The server is busy #659

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions