Skip to content

Commit 6918f6d

Browse files
sayakpaula-r-r-o-w
andauthored
[docs] tip for group offloding + quantization (#11576)
* tip for group offloding + quantization Co-authored-by: Aryan VS <[email protected]> * Apply suggestions from code review Co-authored-by: Aryan <[email protected]> --------- Co-authored-by: Aryan VS <[email protected]> Co-authored-by: Aryan <[email protected]>
1 parent 915c537 commit 6918f6d

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

docs/source/en/optimization/memory.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,13 @@ pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_d
295295

296296
The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time.
297297

298+
<Tip>
299+
300+
The offloading strategies can be combined with [quantization](../quantization/overview.md) to enable further memory savings. For image generation, combining [quantization and model offloading](#model-offloading) can often give the best trade-off between quality, speed, and memory. However, for video generation, as the models are more
301+
compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading provides considerable benefits when weight transfers can be overlapped with computation (must use streams). When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it is usually not possible to *fully* overlap weight transfers if the compute kernel finishes faster, making it communication bound between CPU/GPU (due to device synchronizations).
302+
303+
</Tip>
304+
298305
## Layerwise casting
299306

300307
Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.

0 commit comments

Comments
 (0)