Skip to content

Conversation

inf3rnus
Copy link
Contributor

@inf3rnus inf3rnus commented Mar 21, 2025

What does this PR do?

Enables the parallel downloading of models with sharded weight files. Can decrease the time to load large models significantly, often times producing speed ups of greater than 50%.

While downloading is already parallelized at the file level when HF_HUB_ENABLE_HF_TRANSFER is enabled, HF_ENABLE_PARALLEL_DOWNLOADING parallelizes the number of files that can be concurrently downloaded. Which can greatly speed up downloads if the machine you're using can handle it in terms of network and IO bandwidth.

e.g. With only HF_HUB_ENABLE_HF_TRANSFER you can hit a peak network throughput of ~1.5GB/s which is about in line for s3's limitations for a single file that's around 5GB in size. With this change you can hit a peak network throughput of ~6.5GB/s

e.g. here's a comparison for facebook/opt-30b on an AWS EC2 g4dn.metal:

  • HF_HUB_ENABLE_HF_TRANSFER enabled, HF_ENABLE_PARALLEL_DOWNLOADING disabled

    • ~45s download
  • HF_HUB_ENABLE_HF_TRANSFER enabled, HF_ENABLE_PARALLEL_DOWNLOADING enabled

    • ~12s download

That's roughly a 73% speed up!

To fully saturate a machine capable of massive network bandwidth, set HF_ENABLE_PARALLEL_DOWNLOADING="true" and HF_HUB_ENABLE_HF_TRANSFER="1"

I've included an additional env var HF_PARALLEL_DOWNLOADING_WORKERS.

Determines how many threads should be used when the parallel downloading of model weight shards is enabled.

Default is 8 although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download.

e.g. if there are 2 shard files, but 8 workers are specified, only two workers will spawn.

Like the PR that optimizes sharded model loading, this should also collectively save thousands of dollars monthly if not more globally for anyone using HF on the cloud.

Before submitting

@ArthurZucker
@Rocketknight1

…NABLE_HF_TRANSFER is enabled. Create documentation for it.
@github-actions github-actions bot marked this pull request as draft March 21, 2025 01:57
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@inf3rnus inf3rnus marked this pull request as ready for review March 21, 2025 02:18
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle not against this! Mostly want to here @Wauplin's opinion on the PR as it's hub stuffs!

-->

# Environment Variables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a warning as this should not really be ran on login nodes or nodes that share bandwith, at the risk of having a lot of anger from your colleagues ! 🤣

Suggested change
- Debugging issues involving hf_transfer can be extremely difficult, making maintenance and issue resolution frustrating.
- User experience may be slightly degraded, with a more fragmented progress bar, corner cases when using Ctrl+C, and the overhead of launching a subprocess.
- hf_transfer only provides a speed boost if the machine has sufficient bandwidth—otherwise, performance remains the same or may even regress.
- Spawning a subprocess on each CPU core can heavily impact system performance, potentially freezing or significantly slowing down laptops.
- Some features are not supported, including resumable downloads, proxies, and others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True!

The intention behind having this available is for primarily anyone working with HF on the cloud.

Copy link
Contributor Author

@inf3rnus inf3rnus Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other note, we're using threads to avoid the painful overhead of python multiprocessing 😄 (takes an additional 4-6s to instantiate them vs. threads), but the laptop being totally overwhelmed by the processing load will indeed happen for smaller machines!


Determines how many threads should be used when the parallel downloading of model weight shards is enabled.

Default is `8` although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that can be automatized !˜

Copy link
Contributor Author

@inf3rnus inf3rnus Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely true! With that said, I feel like if this PR were to be merged that should be an itr 2 enhancement.

Argument being that this is something people are going to need to likely tune anyway to squeeze every last drop out of it, although there's probably a simple sane heuristic we could use 🤔

@ArthurZucker ArthurZucker requested a review from Wauplin March 21, 2025 09:29
@Wauplin
Copy link
Contributor

Wauplin commented Mar 21, 2025

Hey there! Thanks for the ping :) It's very interesting to see that even hf_transfer can be too limited on very large bandwidth (>1.5GB/s). I'm however a bit wary about the implementation itself and the timeline. To be honest, I'd prefer to see this logic live in huggingface_hub directly. For context there are a few changes going on in the download workflow and it would be preferable to get these things settled before starting to add other environment variables to comply with in transformers. Namely:

  • the huggingface_hub library now have serialization methods meant to work with sharded weights (see package reference). The goal is to unify the logic between transformers/diffusers/accelerate/etc. on how to save/load sharded models. There isn't a public-facing equivalent for get_checkpoint_shard_files yet but @hanouticelina is soon about to work on integrating these serialization methods into transformers (to be fair, they have been taken from transformers in the first place so shouldn't pose too many problems)
  • there is an ongoing company-wide project to refactor our storage backend to efficiently store large files (see Xet blog). This project will be impactful for transformers users for a few reasons:
    • huggingface_hub will soon use hf_xet as download client instead of hf_transfer. It is a new rust-based client optimized to download Xet files from the Hub. Any optimization (e.g. parallelization) done in the future should be performed at this level instead. The main difference of hf_xet is that it works at a chunk-level instead of a file-level and can therefore heavily parallelize downloads across files (while hf_transfer is working file by file - which your PR tries to circumvent).
    • for Xet-enabled repos (i.e. understand "all repos" mid-term), there won't be any file size limit anymore. This means people will be able to upload model weights as a single file even if it's a 150GB one. No more shards to work with :)

These 2 topics make me think that it'd be preferable not to move forward right now on this PR. Sorry if it's means slower performances short-term but it is for the greater good (and better maintainability) long-term. If download speed is critical for your use case, I would suggest to download the weights first and only then run your transformers script. Since cache is shared, previously downloaded weights will be available right away. The benefit of downloading the weights first is that you can build you own script on top of huggingface_hub download methods to maximize throughput.

@ArthurZucker
Copy link
Collaborator

Thanks a lot for the great explanation!

@inf3rnus
Copy link
Contributor Author

inf3rnus commented Mar 21, 2025

@ArthurZucker I think @Wauplin is spot on, this change should live in huggingface_hub thanks for looking at this 🙏, may circle back in a bit...

@Wauplin Everything you said resonates completely with me, all downloading functionality should ideally be located in a single place.

Any idea on how long until we'll see a release for chunk based downloading with xet?

I ask because if it's two months or more away, I'd really like to have this functionality owned by huggingface instead of having my team manage it as a fork with each release.

Would this change be sufficient to get this fast tracked for release with the huggingface_hub repo provided docs and tests are updated?

This is in snapshot_download, end of the function...

The proposal is this, change this guy:

    if constants.HF_HUB_ENABLE_HF_TRANSFER:
        # when using hf_transfer we don't want extra parallelism
        # from the one hf_transfer provides
        for file in filtered_repo_files:
            _inner_hf_hub_download(file)
    else:
        thread_map(
            _inner_hf_hub_download,
            filtered_repo_files,
            desc=f"Fetching {len(filtered_repo_files)} files",
            max_workers=max_workers,
            # User can use its own tqdm class or the default one from `huggingface_hub.utils`
            tqdm_class=tqdm_class or hf_tqdm,
        )

to this:

# Second condition allows it to skip serial file downloads with HF_HUB_ENABLE_HF_TRANSFER and instead also use the thread pool

if constants.HF_HUB_ENABLE_HF_TRANSFER and not constants.HF_ENABLE_PARALLEL_DOWNLOADING:
        for file in filtered_repo_files:
            _inner_hf_hub_download(file)
else:
    thread_map(
        _inner_hf_hub_download,
        filtered_repo_files,
        desc=f"Fetching {len(filtered_repo_files)} files",
        max_workers=max_workers,
        # User can use its own tqdm class or the default one from `huggingface_hub.utils`
        tqdm_class=tqdm_class or hf_tqdm,
    )

Also, thank you for pointing this out to me, because if my team does have to maintain this separately, this is a very minor change!

Best,
Aaron

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants