[docs] tip for group offloding + quantization (#11576)

sayakpaul · a-r-r-o-w · web-flow · commit 6918f6d19a97 · 2025-05-19T14:49:15.000+05:30
* tip for group offloding + quantization

Co-authored-by: Aryan VS &lt;contact.aryanvs@gmail.com&gt;

* Apply suggestions from code review

Co-authored-by: Aryan &lt;aryan@huggingface.co&gt;

---------

Co-authored-by: Aryan VS &lt;contact.aryanvs@gmail.com&gt;
Co-authored-by: Aryan &lt;aryan@huggingface.co&gt;
diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
@@ -295,6 +295,13 @@ pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_d
 
 The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time.
 
+<Tip>
+
+The offloading strategies can be combined with [quantization](../quantization/overview.md) to enable further memory savings. For image generation, combining [quantization and model offloading](#model-offloading) can often give the best trade-off between quality, speed, and memory. However, for video generation, as the models are more
+compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading provides considerable benefits when weight transfers can be overlapped with computation (must use streams). When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it is usually not possible to *fully* overlap weight transfers if the compute kernel finishes faster, making it communication bound between CPU/GPU (due to device synchronizations).
+
+</Tip>
+
 ## Layerwise casting
 
 Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.