-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Improve Model Download Speeds By ~3x For Large Models #36870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve Model Download Speeds By ~3x For Large Models #36870
Conversation
…NABLE_HF_TRANSFER is enabled. Create documentation for it.
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle not against this! Mostly want to here @Wauplin's opinion on the PR as it's hub stuffs!
--> | ||
|
||
# Environment Variables | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a warning as this should not really be ran on login nodes or nodes that share bandwith, at the risk of having a lot of anger from your colleagues ! 🤣
- Debugging issues involving hf_transfer can be extremely difficult, making maintenance and issue resolution frustrating. | |
- User experience may be slightly degraded, with a more fragmented progress bar, corner cases when using Ctrl+C, and the overhead of launching a subprocess. | |
- hf_transfer only provides a speed boost if the machine has sufficient bandwidth—otherwise, performance remains the same or may even regress. | |
- Spawning a subprocess on each CPU core can heavily impact system performance, potentially freezing or significantly slowing down laptops. | |
- Some features are not supported, including resumable downloads, proxies, and others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True!
The intention behind having this available is for primarily anyone working with HF on the cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other note, we're using threads to avoid the painful overhead of python multiprocessing 😄 (takes an additional 4-6s to instantiate them vs. threads), but the laptop being totally overwhelmed by the processing load will indeed happen for smaller machines!
|
||
Determines how many threads should be used when the parallel downloading of model weight shards is enabled. | ||
|
||
Default is `8` although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that can be automatized !˜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely true! With that said, I feel like if this PR were to be merged that should be an itr 2 enhancement.
Argument being that this is something people are going to need to likely tune anyway to squeeze every last drop out of it, although there's probably a simple sane heuristic we could use 🤔
Hey there! Thanks for the ping :) It's very interesting to see that even
These 2 topics make me think that it'd be preferable not to move forward right now on this PR. Sorry if it's means slower performances short-term but it is for the greater good (and better maintainability) long-term. If download speed is critical for your use case, I would suggest to download the weights first and only then run your |
Thanks a lot for the great explanation! |
@ArthurZucker I think @Wauplin is spot on, this change should live in @Wauplin Everything you said resonates completely with me, all downloading functionality should ideally be located in a single place. Any idea on how long until we'll see a release for chunk based downloading with xet? I ask because if it's two months or more away, I'd really like to have this functionality owned by huggingface instead of having my team manage it as a fork with each release. Would this change be sufficient to get this fast tracked for release with the This is in The proposal is this, change this guy: if constants.HF_HUB_ENABLE_HF_TRANSFER:
# when using hf_transfer we don't want extra parallelism
# from the one hf_transfer provides
for file in filtered_repo_files:
_inner_hf_hub_download(file)
else:
thread_map(
_inner_hf_hub_download,
filtered_repo_files,
desc=f"Fetching {len(filtered_repo_files)} files",
max_workers=max_workers,
# User can use its own tqdm class or the default one from `huggingface_hub.utils`
tqdm_class=tqdm_class or hf_tqdm,
) to this: # Second condition allows it to skip serial file downloads with HF_HUB_ENABLE_HF_TRANSFER and instead also use the thread pool
if constants.HF_HUB_ENABLE_HF_TRANSFER and not constants.HF_ENABLE_PARALLEL_DOWNLOADING:
for file in filtered_repo_files:
_inner_hf_hub_download(file)
else:
thread_map(
_inner_hf_hub_download,
filtered_repo_files,
desc=f"Fetching {len(filtered_repo_files)} files",
max_workers=max_workers,
# User can use its own tqdm class or the default one from `huggingface_hub.utils`
tqdm_class=tqdm_class or hf_tqdm,
) Also, thank you for pointing this out to me, because if my team does have to maintain this separately, this is a very minor change! Best, |
What does this PR do?
Enables the parallel downloading of models with sharded weight files. Can decrease the time to load large models significantly, often times producing speed ups of greater than 50%.
While downloading is already parallelized at the file level when
HF_HUB_ENABLE_HF_TRANSFER
is enabled,HF_ENABLE_PARALLEL_DOWNLOADING
parallelizes the number of files that can be concurrently downloaded. Which can greatly speed up downloads if the machine you're using can handle it in terms of network and IO bandwidth.e.g. With only
HF_HUB_ENABLE_HF_TRANSFER
you can hit a peak network throughput of ~1.5GB/s which is about in line for s3's limitations for a single file that's around 5GB in size. With this change you can hit a peak network throughput of ~6.5GB/se.g. here's a comparison for
facebook/opt-30b
on an AWS EC2g4dn.metal
:HF_HUB_ENABLE_HF_TRANSFER
enabled,HF_ENABLE_PARALLEL_DOWNLOADING
disabledHF_HUB_ENABLE_HF_TRANSFER
enabled,HF_ENABLE_PARALLEL_DOWNLOADING
enabledThat's roughly a 73% speed up!
To fully saturate a machine capable of massive network bandwidth, set
HF_ENABLE_PARALLEL_DOWNLOADING="true"
andHF_HUB_ENABLE_HF_TRANSFER="1"
I've included an additional env var
HF_PARALLEL_DOWNLOADING_WORKERS
.Determines how many threads should be used when the parallel downloading of model weight shards is enabled.
Default is
8
although less may run if the number of shard files is less than the number of workers. i.e. it takes the min() of HF_PARALLEL_DOWNLOADING_WORKERS and the number of shard files to download.e.g. if there are 2 shard files, but 8 workers are specified, only two workers will spawn.
Like the PR that optimizes sharded model loading, this should also collectively save thousands of dollars monthly if not more globally for anyone using HF on the cloud.
Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
@ArthurZucker
@Rocketknight1