Skip to content
This repository was archived by the owner on Feb 3, 2021. It is now read-only.
This repository was archived by the owner on Feb 3, 2021. It is now read-only.

Start task failure: error pulling image: 503 The server is busy #659

@lachiemurray

Description

@lachiemurray

When provisioning a new cluster with more than 50 nodes we start to see a large proportion of start task failures (25+%) when pulling the docker image from an ACR.

error pulling image
configuration: received unexpected HTTP status: 503 The server is busy

I raised a support issue with the ACR team and they said we were being throttled and recommend that we attempt to retry pulling the image when we get a 503.

Note our image is quite large (~7 GB) which may be why we experience this issue whilst other do not.

Here are some of the ways I think we could mitigate this:

  1. Retry pulling the image when 503 returned (as recommended by the ACR team)

  2. Retry the entire start-task when any failure occurs (it looks like there is support for this in the batch SDK?)

  3. Configure the docker daemon to pull fewer layers in parallel using the max-concurrent-downloads option. I looked at whether this could be done using a plugin but I think plugins would be run too late?

What do you think the best approach would be? Can you recommend any others?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions