checkpointing/HTTPTransport: added streaming serialization and parallel transfer support #106

d4l3k · 2025-02-11T02:19:41Z

This makes HTTP recovery much much faster with a few key changes:

use RWLock so multiple readers can be fetching the state_dict at the same time
use _streaming_save/_streaming_load when available which is ~2x faster on average and avoids the 2x memory overhead (requires PT nightly)
support parallel transfers by using pytree to divy up the leaf values into chunks which can be sent via parallel HTTP requests

This doesn't change the behavior of Manager by default as it seems that using parallel chunks w/ torch.save/load can actually increase time significantly as PyTorchStreamReader holds the GIL during deserialization.

The optimal config is with streaming and chunking enabled

Test plan:

pytest

Testing with 12GB total and 1MB tensors

pytorch nightly

$ python torchft/checkpointing/http_transport.py --num-chunks 0
INFO:__main__:fetching checkpoint took 6.626614563167095s
$ python torchft/checkpointing/http_transport.py --num-chunks 10
INFO:__main__:fetching checkpoint took 3.0460726767778397s
$ python torchft/checkpointing/http_transport.py --num-chunks 0 --device cuda
INFO:__main__:fetching checkpoint took 6.147395346313715s
$ python torchft/checkpointing/http_transport.py --num-chunks 10 --device cuda
INFO:__main__:fetching checkpoint took 2.9234009198844433s

pytorch 2.6.0

$ python torchft/checkpointing/http_transport.py --num-chunks 0
INFO:__main__:fetching checkpoint took 17.019980508834124s
$ python torchft/checkpointing/http_transport.py --num-chunks 10
INFO:__main__:fetching checkpoint took 40.383272521197796s

H-Huang

Looks good, thanks for the change!

python torchft/checkpointing/http_transport.py --num-chunks 0 --device cuda
INFO:main:fetching checkpoint took 12.673462141305208s

How come --num-chunks 0 with cuda is slower than CPU?

H-Huang · 2025-02-11T15:55:25Z

torchft/checkpointing/http_transport.py

+    return output_list
+
+
+def bench_main() -> None:


nit: maybe we could move the benchmarking code to its own folder

Moved to a _bench.py file and added a small test to make sure benchmark doesn't regress

H-Huang · 2025-02-11T15:57:54Z

torchft/checkpointing/http_transport.py

+            return tree_unflatten(values, spec)
+
+
+def _to_cpu(values: List[T], pin_memory: bool) -> List[T]:


nit: could you use tree_map here?

tree_map does the same flatten+unflatten so I think I'll just keep it like this to avoid duplicate mapping

…el transfer support

d4l3k · 2025-02-11T21:23:30Z

Fixed slowness w/ CUDA due to duplicate transfers to CPU

d4l3k requested review from H-Huang and fegin February 11, 2025 02:19

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2025

d4l3k requested a review from mikaylagawarecki February 11, 2025 02:24

d4l3k force-pushed the d4l3k/fast_http branch from acb04b0 to 72432f1 Compare February 11, 2025 02:27

H-Huang approved these changes Feb 11, 2025

View reviewed changes

fegin approved these changes Feb 11, 2025

View reviewed changes

checkpointing/HTTPTransport: added streaming serialization and parall…

a877119

…el transfer support

d4l3k force-pushed the d4l3k/fast_http branch from 72432f1 to a877119 Compare February 11, 2025 19:48

d4l3k merged commit f44aaa5 into main Feb 11, 2025
6 checks passed

d4l3k deleted the d4l3k/fast_http branch February 11, 2025 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

checkpointing/HTTPTransport: added streaming serialization and parallel transfer support #106

checkpointing/HTTPTransport: added streaming serialization and parallel transfer support #106

Uh oh!

d4l3k commented Feb 11, 2025 •

edited

Loading

Uh oh!

H-Huang left a comment

Uh oh!

H-Huang Feb 11, 2025

Uh oh!

d4l3k Feb 11, 2025

Uh oh!

H-Huang Feb 11, 2025

Uh oh!

d4l3k Feb 11, 2025

Uh oh!

d4l3k commented Feb 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return tree_unflatten(values, spec)


		def _to_cpu(values: List[T], pin_memory: bool) -> List[T]:

checkpointing/HTTPTransport: added streaming serialization and parallel transfer support #106

checkpointing/HTTPTransport: added streaming serialization and parallel transfer support #106

Uh oh!

Conversation

d4l3k commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k commented Feb 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

d4l3k commented Feb 11, 2025 •

edited

Loading