Skip to content

Conversation

@Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Nov 7, 2025

The num_workers (threads) spawned by hub appears to be excessive on large core count systems, i.e. 96 cores. Having 94 cores threads doesn't help, only hurts, with upload speed or with hash as it assumes no other tasks are running on the system.

Also there is a strange default min of 2 for threads on a micro-vm with only 1 core. Not sure why 2 is used for system with 1 cpu core.

This PR changes default num_workers to cpu cores // 2 and min 1.

@Wauplin
Copy link
Contributor

Wauplin commented Nov 7, 2025

Hey @Qubitium, thanks for raising the question. So far we've never seen any complaints regarding (almost) maxxing the CPU cores. We consider that someone using hf upload-large-folder wants to upload as fast as possible their data, hence trying to use as many cores as possible when that's possible. I admit it's not perfect but having a correct rule that works for everyone is close to impossible. That's why the recommended approach is to use the --num-workers parameter, since only the user knows how their data look like, what machine and bandwidth they have, etc.

I am therefore not convinced we should change the current rule. Happy to reconsider if we get more reports about that.

@Qubitium
Copy link
Contributor Author

Qubitium commented Nov 7, 2025

Hey @Qubitium, thanks for raising the question. So far we've never seen any complaints regarding (almost) maxxing the CPU cores. We consider that someone using hf upload-large-folder wants to upload as fast as possible their data, hence trying to use as many cores as possible when that's possible. I admit it's not perfect but having a correct rule that works for everyone is close to impossible. That's why the recommended approach is to use the --num-workers parameter, since only the user knows how their data look like, what machine and bandwidth they have, etc.

I am therefore not convinced we should change the current rule. Happy to reconsider if we get more reports about that.

A single thread with large enough file can sufficiently saturate a 10Gb/s connection. So I have just proven that threads != upload speed. Threads only help if you have lots of very small files.

The code I am changing is the default case not the per user optimal case which users can set num_workers.

From my point of view, threads are not free nor are they useful in large numbers. You only want to launch as many as you need and no more. It is actually better to under-thread than over-thread.

So far we've never seen any complaints regarding (almost) maxxing the CPU cores.

I am that person.

@Wauplin
Copy link
Contributor

Wauplin commented Nov 7, 2025

Threads only help if you have lots of very small files.

This is actually very true, ... and happening :) We do have users uploading thousands of files to a dataset.

A single thread with large enough file can sufficiently saturate a 10Gb/s connection

The threads are doing 2 things: hashing files and uploading them. Hashing requires disk I/O + compute time while upload requires disk I/O and "internet" I/O. So even if you have a thread saturating a 10GB/s connection, you would still want to compute the hash of the next file in the background. I know the GIL is limiting this a lot but without knowing what is the limiting factor between disk I/O, outbound I/O and compute, you can't make much assumptions.

From my point of view, threads are not free nor are they useful in large numbers. You only want to launch as many as you need and no more. It is actually better to under-thread than over-thread.

Kinda agree with that, though not convinced we should change to half the cores by default.

I am that person.

A few more upvotes and happy to reconsider

@Qubitium
Copy link
Contributor Author

Qubitium commented Nov 7, 2025

Threads only help if you have lots of very small files.

This is actually very true, ... and happening :) We do have users uploading thousands of files to a dataset.

So the default is optmize for this case? I consider this not the default case. This PR title has title better default for num_workers for a reason. What about models?

A single thread with large enough file can sufficiently saturate a 10Gb/s connection

The threads are doing 2 things: hashing files and uploading them. Hashing requires disk I/O + compute time while upload requires disk I/O and "internet" I/O. So even if you have a thread saturating a 10GB/s connection, you would still want to compute the hash of the next file in the background. I know the GIL is limiting this a lot but without knowing what is the limiting factor between disk I/O, outbound I/O and compute, you can't make much assumptions.

Ah. Now you are finally talking about merit instead of blasting about upload speed. Having100 threads reading from disk also does not speed up disk i/o. It does speed hashing after loading from disk i/o but you can check any nvme or linux disk i/o stack and no i/o is optimized for 100 random i/o threads in normal/default setup, unless you have like tons of nvme drives or cluster nfs/io. Again, we are talking about default. I am not trying to optimize anything. Your default is.

It is likely that the imaginarly speed up you users are seeing is actually in reality a slowdown.

Bechmark it, you will see that launching so many threads will actually hurt your disk i/o.

If threads are so good, your datbase will be spawning theads == cpu cores but they don't for a reason and datbases are kind of like this thousands of small i/os you are talking about.

So lots of threads only:

  1. not help disk i/o: it can only hurt
  2. not help network i/o: it can only hurt
  3. may help hashing if system is not doing something else. Hasing is cpu only will scale linearly. So this part is perfect for large threads except python has a gil threadlock. You may need to benchmark this too. It may not be that much faster with so many threads.

From my point of view, threads are not free nor are they useful in large numbers. You only want to launch as many as you need and no more. It is actually better to under-thread than over-thread.

Kinda agree with that, though not convinced we should change to half the cores by default.

I am that person.

A few more upvotes and happy to reconsider

Why the min value 2 threads btw. which another point in this PR? What if I have an vm with 1 core?

This PR optmizes nothing. It moves the threads to better mean value. The current default is the optimized version based on assumptions.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the tone of the conversation but we've discussed it a bit internally and fine for this change. Thanks for suggesting and arguing for it.

@Wauplin Wauplin merged commit 2a9efce into huggingface:main Nov 12, 2025
17 of 18 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Qubitium Qubitium deleted the default-workers branch November 12, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants