Skip to content

Conversation

inf3rnus
Copy link

@inf3rnus inf3rnus commented Apr 8, 2025

Per the discussion here huggingface/transformers#36870

I've added this change to the huggingface_hub package.

Repost from the original PR:

What does this PR do?
Enables the parallel downloading of models with sharded weight files. Can decrease the time to load large models significantly, often times producing speed ups of greater than 50%.

While downloading is already parallelized at the file level when HF_HUB_ENABLE_HF_TRANSFER is enabled, HF_ENABLE_PARALLEL_DOWNLOADING parallelizes the number of files that can be concurrently downloaded. Which can greatly speed up downloads if the machine you're using can handle it in terms of network and IO bandwidth.

e.g. With only HF_HUB_ENABLE_HF_TRANSFER you can hit a peak network throughput of ~1.5GB/s which is about in line for s3's limitations for a single file that's around 5GB in size. With this change you can hit a peak network throughput of ~6.5GB/s

e.g. here's a comparison for facebook/opt-30b on an AWS EC2 g4dn.metal:

HF_HUB_ENABLE_HF_TRANSFER enabled, HF_ENABLE_PARALLEL_DOWNLOADING disabled

~45s download
HF_HUB_ENABLE_HF_TRANSFER enabled, HF_ENABLE_PARALLEL_DOWNLOADING enabled

~12s download
That's roughly a 73% speed up!

To fully saturate a machine capable of massive network bandwidth, set HF_ENABLE_PARALLEL_DOWNLOADING="true" and HF_HUB_ENABLE_HF_TRANSFER="1"

Like the https://github.com/huggingface/transformers/pull/36835 that optimizes sharded model loading, this should also collectively save thousands of dollars monthly if not more globally for anyone using HF on the cloud.

@Wauplin I have preemptively submitted this to get your eyes on it, I still need to confirm this produces a speed up for .xet, but I don't see why it wouldn't, unless the .xet downloader manages its chunks across multiple files in aggregate. Any clarity on this would be greatly appreciated. If such data is required to merge this PR, I will provide it.

Best,
Aaron

…lows for using HF_TRANSFER with the download thread pool.
@inf3rnus
Copy link
Author

inf3rnus commented Apr 9, 2025

This is confirmed to be working.

It looks like there is no speed up when used with .xef. It appears the .xef client always runs if it's a .xef file.

I was seeing a max throughput of ~1GB/s even with concurrency enabled when using .xef.

This was with meta-llama/Llama-4-Scout-17B-16E on a g6.48xlarge instance.

Will need to dig deeper, but it would be desirable to ignore .xef if your concern is cold boot performance.

@hanouticelina
Copy link
Contributor

Hi @inf3rnus,
thanks a lot for opening this PR and for your detailed explanation!
following up on the previous discussion in huggingface/transformers#36870, we're still not convinced on adding this specific approach directly into huggingface_hub at this time, for two main reasons:

  • as mentioned by @Wauplin, our download system is currently undergoing significant updates as we work on integrating our new backend Xet. This means the code for downloading files is likely to change quite a bit in the coming weeks, and the core logic of snapshot_download itself is likely to be revisited and potentially refactored as part of this integration.

  • also, hf_transfer already makes single file downloads faster by fetching chunks simultaneously. adding another layer to download multiple files in parallel gets complicated quickly, and for users with typical internet speeds, this extra complexity could actually make downloads slower overall. we definitely don't want to degrade the UX for the average users.

in the meantime, if achieving this maximum download speed is critical for your specific use case, we would suggest implementing this logic as a standalone script on your end. you could simply replicate the logic of snapshot_download:

  1. get the list of files and commit sha using HfApi().repo_info.
  2. Use a concurrent executor, like tqdm.contrib.concurrent.thread_map() to call hf_hub_download for each filename in your list, with HF_HUB_ENABLE_HF_TRANSFER=1.

this script is basically a custom wrapper around our existing download functions and should provide the immediate performance you need while we focus on improving the default download experience for all users in the coming weeks.

We hope this make sense for you and helps clarify our position, let us know if you have questions implementing the suggested script. thank you for your understanding!

@hanouticelina
Copy link
Contributor

closing this PR but feel free to comment if you have any question or feedback related to my previous comment.

@inf3rnus
Copy link
Author

inf3rnus commented Apr 15, 2025

Yeah I still think this is a minimal enhancement that yields a lot of value, but given that it may become legacy soon, I understand that ya'll don't want anything exploding as everything is switched over to .xet and the interference from any PR's that touch the meat and potatoes of this is undesirable.

I am going to be critical for a second, I am wary of .xet doing anything in particular to speed up cold download speeds. It largely looks like it's useful in reducing storage costs and making subsequent model downloads faster.

Some feedback:

  • In order for .xet to beat the performance of hf_transfer + downloading multiple files simultaneously for cold downloads, it needs to provide byte level parallelism for file downloads. It also needs to allow parallel file downloads. While .xet will save HF a lot on costs, and save time for subsequent loads, based off of cursory testing I did with llama 4, it does not provide any speed up for cold downloads. I recognize this may change.

  • hf_transfer is only one piece of the puzzle in terms of making downloads as fast as possible. It provides byte level parallelism for file downloads. To saturate the network bandwidth of powerful cloud machines you need to couple its use with also downloading files themselves in parallel, which is a one line change. So I'm aware that hf_transfer provides a speed up, but by itself it does not touch the upper limit of what's possible in terms of throughput.

  • If you are trying to arbitrarily test models, downloading the entire HF repo is not satisfactory as many models contain multiple versions of the same weights. If the objective is the fastest start up time possible, that is not a viable approach. pipeline() abstracts this by inferring the framework. It would be nice if HF provided a tool that spits out the files to download for a given repo. May already exist, but I 'd have to dig through the code, last time I hunted for this I could not find an easy way to do it.

So, if there was one thing I'd ask HF for, it would be to test .xet with a larger model on a powerful cloud instance and make sure it's hitting the upper limits of the machine's bandwidth on a cold downlload, otherwise it will still be some factor slower than it can be.

Thanks,
Aaron

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants