Improve Model Download Speeds By ~3x For Large Models #36870

inf3rnus · 2025-03-21T01:57:40Z

What does this PR do?

Enables the parallel downloading of models with sharded weight files. Can decrease the time to load large models significantly, often times producing speed ups of greater than 50%.

While downloading is already parallelized at the file level when HF_HUB_ENABLE_HF_TRANSFER is enabled, HF_ENABLE_PARALLEL_DOWNLOADING parallelizes the number of files that can be concurrently downloaded. Which can greatly speed up downloads if the machine you're using can handle it in terms of network and IO bandwidth.

e.g. With only HF_HUB_ENABLE_HF_TRANSFER you can hit a peak network throughput of ~1.5GB/s which is about in line for s3's limitations for a single file that's around 5GB in size. With this change you can hit a peak network throughput of ~6.5GB/s

e.g. here's a comparison for facebook/opt-30b on an AWS EC2 g4dn.metal:

HF_HUB_ENABLE_HF_TRANSFER enabled, HF_ENABLE_PARALLEL_DOWNLOADING disabled
- ~45s download
HF_HUB_ENABLE_HF_TRANSFER enabled, HF_ENABLE_PARALLEL_DOWNLOADING enabled
- ~12s download

That's roughly a 73% speed up!

To fully saturate a machine capable of massive network bandwidth, set HF_ENABLE_PARALLEL_DOWNLOADING="true" and HF_HUB_ENABLE_HF_TRANSFER="1"

I've included an additional env var HF_PARALLEL_DOWNLOADING_WORKERS.

Determines how many threads should be used when the parallel downloading of model weight shards is enabled.

Default is 8 although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download.

e.g. if there are 2 shard files, but 8 workers are specified, only two workers will spawn.

Like the PR that optimizes sharded model loading, this should also collectively save thousands of dollars monthly if not more globally for anyone using HF on the cloud.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

@ArthurZucker
@Rocketknight1

…NABLE_HF_TRANSFER is enabled. Create documentation for it.

github-actions · 2025-03-21T01:57:51Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

ArthurZucker

In principle not against this! Mostly want to here @Wauplin's opinion on the PR as it's hub stuffs!

ArthurZucker · 2025-03-21T09:27:58Z

docs/source/en/reference/environment_variables.md

+-->
+
+# Environment Variables
+


We should add a warning as this should not really be ran on login nodes or nodes that share bandwith, at the risk of having a lot of anger from your colleagues ! 🤣

Suggested change

- Debugging issues involving hf_transfer can be extremely difficult, making maintenance and issue resolution frustrating.

- User experience may be slightly degraded, with a more fragmented progress bar, corner cases when using Ctrl+C, and the overhead of launching a subprocess.

- hf_transfer only provides a speed boost if the machine has sufficient bandwidth—otherwise, performance remains the same or may even regress.

- Spawning a subprocess on each CPU core can heavily impact system performance, potentially freezing or significantly slowing down laptops.

- Some features are not supported, including resumable downloads, proxies, and others.

True!

The intention behind having this available is for primarily anyone working with HF on the cloud.

One other note, we're using threads to avoid the painful overhead of python multiprocessing 😄 (takes an additional 4-6s to instantiate them vs. threads), but the laptop being totally overwhelmed by the processing load will indeed happen for smaller machines!

ArthurZucker · 2025-03-21T09:28:26Z

docs/source/en/reference/environment_variables.md

+
+Determines how many threads should be used when the parallel downloading of model weight shards is enabled.
+
+Default is `8` although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download.


that can be automatized !˜

This is absolutely true! With that said, I feel like if this PR were to be merged that should be an itr 2 enhancement.

Argument being that this is something people are going to need to likely tune anyway to squeeze every last drop out of it, although there's probably a simple sane heuristic we could use 🤔

Wauplin · 2025-03-21T10:32:21Z

Hey there! Thanks for the ping :) It's very interesting to see that even hf_transfer can be too limited on very large bandwidth (>1.5GB/s). I'm however a bit wary about the implementation itself and the timeline. To be honest, I'd prefer to see this logic live in huggingface_hub directly. For context there are a few changes going on in the download workflow and it would be preferable to get these things settled before starting to add other environment variables to comply with in transformers. Namely:

the huggingface_hub library now have serialization methods meant to work with sharded weights (see package reference). The goal is to unify the logic between transformers/diffusers/accelerate/etc. on how to save/load sharded models. There isn't a public-facing equivalent for get_checkpoint_shard_files yet but @hanouticelina is soon about to work on integrating these serialization methods into transformers (to be fair, they have been taken from transformers in the first place so shouldn't pose too many problems)
there is an ongoing company-wide project to refactor our storage backend to efficiently store large files (see Xet blog). This project will be impactful for transformers users for a few reasons:
- huggingface_hub will soon use hf_xet as download client instead of hf_transfer. It is a new rust-based client optimized to download Xet files from the Hub. Any optimization (e.g. parallelization) done in the future should be performed at this level instead. The main difference of hf_xet is that it works at a chunk-level instead of a file-level and can therefore heavily parallelize downloads across files (while hf_transfer is working file by file - which your PR tries to circumvent).
- for Xet-enabled repos (i.e. understand "all repos" mid-term), there won't be any file size limit anymore. This means people will be able to upload model weights as a single file even if it's a 150GB one. No more shards to work with :)

These 2 topics make me think that it'd be preferable not to move forward right now on this PR. Sorry if it's means slower performances short-term but it is for the greater good (and better maintainability) long-term. If download speed is critical for your use case, I would suggest to download the weights first and only then run your transformers script. Since cache is shared, previously downloaded weights will be available right away. The benefit of downloading the weights first is that you can build you own script on top of huggingface_hub download methods to maximize throughput.

ArthurZucker · 2025-03-21T11:53:52Z

Thanks a lot for the great explanation!

inf3rnus · 2025-03-21T18:51:11Z

@ArthurZucker I think @Wauplin is spot on, this change should live in huggingface_hub thanks for looking at this 🙏, may circle back in a bit...

@Wauplin Everything you said resonates completely with me, all downloading functionality should ideally be located in a single place.

Any idea on how long until we'll see a release for chunk based downloading with xet?

I ask because if it's two months or more away, I'd really like to have this functionality owned by huggingface instead of having my team manage it as a fork with each release.

Would this change be sufficient to get this fast tracked for release with the huggingface_hub repo provided docs and tests are updated?

This is in snapshot_download, end of the function...

The proposal is this, change this guy:

    if constants.HF_HUB_ENABLE_HF_TRANSFER:
        # when using hf_transfer we don't want extra parallelism
        # from the one hf_transfer provides
        for file in filtered_repo_files:
            _inner_hf_hub_download(file)
    else:
        thread_map(
            _inner_hf_hub_download,
            filtered_repo_files,
            desc=f"Fetching {len(filtered_repo_files)} files",
            max_workers=max_workers,
            # User can use its own tqdm class or the default one from `huggingface_hub.utils`
            tqdm_class=tqdm_class or hf_tqdm,
        )

to this:

# Second condition allows it to skip serial file downloads with HF_HUB_ENABLE_HF_TRANSFER and instead also use the thread pool

if constants.HF_HUB_ENABLE_HF_TRANSFER and not constants.HF_ENABLE_PARALLEL_DOWNLOADING:
        for file in filtered_repo_files:
            _inner_hf_hub_download(file)
else:
    thread_map(
        _inner_hf_hub_download,
        filtered_repo_files,
        desc=f"Fetching {len(filtered_repo_files)} files",
        max_workers=max_workers,
        # User can use its own tqdm class or the default one from `huggingface_hub.utils`
        tqdm_class=tqdm_class or hf_tqdm,
    )

Also, thank you for pointing this out to me, because if my team does have to maintain this separately, this is a very minor change!

Best,
Aaron

Parallelize model downloads, speeding up downloads even when HF_HUB_E…

8515200

…NABLE_HF_TRANSFER is enabled. Create documentation for it.

github-actions bot marked this pull request as draft March 21, 2025 01:57

inf3rnus marked this pull request as ready for review March 21, 2025 02:18

github-actions bot requested review from ArthurZucker and ydshieh March 21, 2025 02:19

ArthurZucker reviewed Mar 21, 2025

View reviewed changes

ArthurZucker requested a review from Wauplin March 21, 2025 09:29

inf3rnus mentioned this pull request Apr 8, 2025

Create an environment variable HF_ENABLE_PARALLEL_DOWNLOADING that al… huggingface/huggingface_hub#2989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Model Download Speeds By ~3x For Large Models #36870

Improve Model Download Speeds By ~3x For Large Models #36870

inf3rnus commented Mar 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Mar 21, 2025

Uh oh!

inf3rnus Mar 21, 2025

Uh oh!

inf3rnus Mar 21, 2025 •

edited

Loading

Uh oh!

ArthurZucker Mar 21, 2025

Uh oh!

inf3rnus Mar 21, 2025 •

edited

Loading

Uh oh!

Wauplin commented Mar 21, 2025

Uh oh!

ArthurZucker commented Mar 21, 2025

Uh oh!

inf3rnus commented Mar 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

+- Debugging issues involving hf_transfer can be extremely difficult, making maintenance and issue resolution frustrating.
+- User experience may be slightly degraded, with a more fragmented progress bar, corner cases when using Ctrl+C, and the overhead of launching a subprocess.
+- hf_transfer only provides a speed boost if the machine has sufficient bandwidth—otherwise, performance remains the same or may even regress.
+- Spawning a subprocess on each CPU core can heavily impact system performance, potentially freezing or significantly slowing down laptops.
+- Some features are not supported, including resumable downloads, proxies, and others.


		Determines how many threads should be used when the parallel downloading of model weight shards is enabled.

		Default is `8` although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download.

Improve Model Download Speeds By ~3x For Large Models #36870

Are you sure you want to change the base?

Improve Model Download Speeds By ~3x For Large Models #36870

Conversation

inf3rnus commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

inf3rnus Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

inf3rnus Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

inf3rnus Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wauplin commented Mar 21, 2025

Uh oh!

ArthurZucker commented Mar 21, 2025

Uh oh!

inf3rnus commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

inf3rnus commented Mar 21, 2025 •

edited

Loading

inf3rnus Mar 21, 2025 •

edited

Loading

inf3rnus Mar 21, 2025 •

edited

Loading

inf3rnus commented Mar 21, 2025 •

edited

Loading