Compare microbatch forward outputs and gradients #246

xmfan · 2025-11-11T22:51:09Z

Stacked PRs:

Currently the forward matches per microbatch (no batch invariance)

But for the backward, all grads are None

Intended usage:

> torchrun --standalone --nproc-per-node 4 examples/example_ds3_local_map.py --rng-seed 42; torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 42

> (a) [14:56:32] ~/core/a/autoparallel (branch2) > diff out/0/diff.log out/1/diff.log 
--- out/0/diff.log      2025-11-13 14:38:31.018089358 -0800
+++ out/1/diff.log      2025-11-13 14:39:34.621585827 -0800
@@ -15,62 +15,62 @@
 [mb13 fwd out] hash=18431966637432242176, norm=2024.0
 [mb14 fwd out] hash=18356249868697075712, norm=2016.0
 [mb15 fwd out] hash=9121302173425074176, norm=2024.0
-[grad tok_embeddings.weight] hash=9152616264584134656, norm=782336.0
-[grad layers.0.attention.wq.weight] hash=4509440233137242112, norm=4521984.0
-[grad layers.0.attention.wkv_a.weight] hash=9191424626998116352, norm=6389760.0
-[grad layers.0.attention.kv_norm.weight] hash=9340817470887297024, norm=577536.0
-[grad layers.0.attention.wkv_b.weight] hash=63648529108697088, norm=6684672.0
-[grad layers.0.attention.wo.weight] hash=18415711457527201792, norm=5275648.0
-[grad layers.0.attention_norm.weight] hash=9332549143446421504, norm=1040384.0
-[grad layers.0.ffn_norm.weight] hash=29554872554618880, norm=172032.0
-[grad layers.0.moe.experts.w1] hash=4516723398159630336, norm=405504.0
-[grad layers.0.moe.experts.w2] hash=4651444358887768064, norm=798720.0
-[grad layers.0.moe.experts.w3] hash=9277591154243665920, norm=456704.0
-[grad layers.0.moe.router.gate.weight] hash=13836676536398249984, norm=444416.0
-[grad layers.0.moe.shared_experts.w1.weight] hash=4634309569680506880, norm=1966080.0
-[grad layers.0.moe.shared_experts.w2.weight] hash=18442557133430980608, norm=2555904.0
-[grad layers.0.moe.shared_experts.w3.weight] hash=18363075636882309120, norm=2621440.0
-[grad layers.1.attention.wq.weight] hash=18434148068501749760, norm=2088960.0
-[grad layers.1.attention.wkv_a.weight] hash=13734325197991837696, norm=3719168.0
-[grad layers.1.attention.kv_norm.weight] hash=9253525043734904832, norm=194560.0
-[grad layers.1.attention.wkv_b.weight] hash=9254369468665036800, norm=4161536.0
-[grad layers.1.attention.wo.weight] hash=13764267098639433728, norm=3375104.0
-[grad layers.1.attention_norm.weight] hash=96229257662955520, norm=438272.0
-[grad layers.1.ffn_norm.weight] hash=26036435345735680, norm=91648.0
-[grad layers.1.moe.experts.w1] hash=13808106826262118400, norm=143360.0
-[grad layers.1.moe.experts.w2] hash=18375073507764600832, norm=231424.0
-[grad layers.1.moe.experts.w3] hash=4683075109395628032, norm=136192.0
-[grad layers.1.moe.router.gate.weight] hash=71248353479884800, norm=140288.0
-[grad layers.1.moe.shared_experts.w1.weight] hash=9272770895267495936, norm=917504.0
-[grad layers.1.moe.shared_experts.w2.weight] hash=9259084174524940288, norm=1581056.0
-[grad layers.1.moe.shared_experts.w3.weight] hash=13739356563200540672, norm=1441792.0
-[grad layers.2.attention.wq.weight] hash=4623754258053857280, norm=1073152.0
-[grad layers.2.attention.wkv_a.weight] hash=4683638059349049344, norm=2736128.0
-[grad layers.2.attention.kv_norm.weight] hash=9277239310522777600, norm=144384.0
-[grad layers.2.attention.wkv_b.weight] hash=4701089507905110016, norm=3112960.0
-[grad layers.2.attention.wo.weight] hash=9260878577501470720, norm=2490368.0
-[grad layers.2.attention_norm.weight] hash=70931694131085312, norm=226304.0
-[grad layers.2.ffn_norm.weight] hash=9248775153502912512, norm=67584.0
-[grad layers.2.moe.experts.w1] hash=38597256181448704, norm=65536.0
-[grad layers.2.moe.experts.w2] hash=4636702106982547456, norm=80384.0
-[grad layers.2.moe.experts.w3] hash=18401039574366158848, norm=98816.0
-[grad layers.2.moe.router.gate.weight] hash=53761720551735296, norm=39680.0
-[grad layers.2.moe.shared_experts.w1.weight] hash=9279455925964374016, norm=573440.0
-[grad layers.2.moe.shared_experts.w2.weight] hash=18333204104978890752, norm=901120.0
-[grad layers.2.moe.shared_experts.w3.weight] hash=13692420610834038784, norm=1236992.0
-[grad layers.3.attention.wq.weight] hash=13692490979578216448, norm=782336.0
-[grad layers.3.attention.wkv_a.weight] hash=4507716198904889344, norm=2621440.0
-[grad layers.3.attention.kv_norm.weight] hash=9268267295640125440, norm=126464.0
-[grad layers.3.attention.wkv_b.weight] hash=4751473528736317440, norm=2752512.0
-[grad layers.3.attention.wo.weight] hash=9285683559824097280, norm=2441216.0
-[grad layers.3.attention_norm.weight] hash=9234771773411557376, norm=182272.0
-[grad layers.3.ffn_norm.weight] hash=6896136929411072, norm=36096.0
-[grad layers.3.moe.experts.w1] hash=18395339706087768064, norm=35072.0
-[grad layers.3.moe.experts.w2] hash=13882029192020754432, norm=26240.0
-[grad layers.3.moe.experts.w3] hash=4743768151248863232, norm=48896.0
-[grad layers.3.moe.router.gate.weight] hash=18414057792039026688, norm=27392.0
-[grad layers.3.moe.shared_experts.w1.weight] hash=4598562247638253568, norm=471040.0
-[grad layers.3.moe.shared_experts.w2.weight] hash=18396852634087587840, norm=643072.0
-[grad layers.3.moe.shared_experts.w3.weight] hash=9344863673677512704, norm=802816.0
-[grad norm.weight] hash=9309010798518992896, norm=319488.0
-[grad output.weight] hash=0, norm=8650752.0
+[grad tok_embeddings.weight] None
+[grad layers.0.attention.wq.weight] None
+[grad layers.0.attention.wkv_a.weight] None
+[grad layers.0.attention.kv_norm.weight] None
+[grad layers.0.attention.wkv_b.weight] None
+[grad layers.0.attention.wo.weight] None
+[grad layers.0.attention_norm.weight] None
+[grad layers.0.ffn_norm.weight] None
+[grad layers.0.moe.experts.w1] None
+[grad layers.0.moe.experts.w2] None
+[grad layers.0.moe.experts.w3] None
+[grad layers.0.moe.router.gate.weight] None
+[grad layers.0.moe.shared_experts.w1.weight] None
+[grad layers.0.moe.shared_experts.w2.weight] None
+[grad layers.0.moe.shared_experts.w3.weight] None
+[grad layers.1.attention.wq.weight] None
+[grad layers.1.attention.wkv_a.weight] None
+[grad layers.1.attention.kv_norm.weight] None
+[grad layers.1.attention.wkv_b.weight] None
+[grad layers.1.attention.wo.weight] None
+[grad layers.1.attention_norm.weight] None
+[grad layers.1.ffn_norm.weight] None
+[grad layers.1.moe.experts.w1] None
+[grad layers.1.moe.experts.w2] None
+[grad layers.1.moe.experts.w3] None
+[grad layers.1.moe.router.gate.weight] None
+[grad layers.1.moe.shared_experts.w1.weight] None
+[grad layers.1.moe.shared_experts.w2.weight] None
+[grad layers.1.moe.shared_experts.w3.weight] None
+[grad layers.2.attention.wq.weight] None
+[grad layers.2.attention.wkv_a.weight] None
+[grad layers.2.attention.kv_norm.weight] None
+[grad layers.2.attention.wkv_b.weight] None
+[grad layers.2.attention.wo.weight] None
+[grad layers.2.attention_norm.weight] None
+[grad layers.2.ffn_norm.weight] None
+[grad layers.2.moe.experts.w1] None
+[grad layers.2.moe.experts.w2] None
+[grad layers.2.moe.experts.w3] None
+[grad layers.2.moe.router.gate.weight] None
+[grad layers.2.moe.shared_experts.w1.weight] None
+[grad layers.2.moe.shared_experts.w2.weight] None
+[grad layers.2.moe.shared_experts.w3.weight] None
+[grad layers.3.attention.wq.weight] None
+[grad layers.3.attention.wkv_a.weight] None
+[grad layers.3.attention.kv_norm.weight] None
+[grad layers.3.attention.wkv_b.weight] None
+[grad layers.3.attention.wo.weight] None
+[grad layers.3.attention_norm.weight] None
+[grad layers.3.ffn_norm.weight] None
+[grad layers.3.moe.experts.w1] None
+[grad layers.3.moe.experts.w2] None
+[grad layers.3.moe.experts.w3] None
+[grad layers.3.moe.router.gate.weight] None
+[grad layers.3.moe.shared_experts.w1.weight] None
+[grad layers.3.moe.shared_experts.w2.weight] None
+[grad layers.3.moe.shared_experts.w3.weight] None
+[grad norm.weight] None
+[grad output.weight] None

Currently, fw ins are the same, but the forward is being ran with different rng state between the two setups so there's some numerical differences

stack-info: PR: #246, branch: xmfan/stack/20

wconstab · 2025-11-14T18:33:44Z

granted the rng affects the grads, why does the diff show 'none' rather than a different hash?

wconstab · 2025-11-14T18:36:55Z

examples/example_ds3_local_map.py

+    if rng_seed is not None:
+        numerics_logger = NumericsLogger(logs_dir)
+    with AutoParallel(
+        model, input_fn, mesh, dynamic=True, numerics_logger=None


should this be numerics_logger = numerics_logger?

wconstab · 2025-11-14T18:41:16Z

examples/example_ds3_pp.py

+            return
+
+        rank = torch.distributed.get_rank()
+        if rank == 4:


can you somehow not hardcode this

sanketpurandare · 2025-11-14T19:25:18Z

autoparallel/graph_pp_runner.py

    action: _Action,
    ctx: _PipelineContext,
    numerics_logs: Optional[list[str]] = None,
+    forward_hook: Callable | None = None,


nit: Optional[Callable]

sanketpurandare · 2025-11-14T19:27:55Z

autoparallel/utils.py

        if self.rank == 0:
            print(f"Weight hashes written to {path}")
+
+    def log_pp_grads(self, orig_mod, stage_mods, num_world_stages, ranks):


What is num_world_stages?

sanketpurandare · 2025-11-14T19:32:53Z

examples/example_ds3_pp.py

+        rank = torch.distributed.get_rank()
+        if rank == 4:
+            numerics_logger.log_diff(
+                output, rank=4, prefix=f"mb{action.microbatch_index} fwd out"


Yeah, very confusing. Also do we care about pp_rank or global rank? Finally v style schedules will have last stage on rank 0?

sanketpurandare · 2025-11-14T19:35:03Z

But for the backward, all grads are None
Currently, fw ins are the same, but the forward is being ran with different rng state between the two setups so there's some numerical differences

If we land #250 first it fixes the grad issue.

sanketpurandare · 2025-11-14T19:36:03Z

granted the rng affects the grads, why does the diff show 'none' rather than a different hash?

There was a bug in gradient accumulation that is fixed by #250

xmfan added a commit that referenced this pull request Nov 11, 2025

Log forward intermediates hashes w/pp vs w/o pp

79bf049

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/19 branch from 0813cd5 to 580144b Compare November 11, 2025 22:51

xmfan force-pushed the xmfan/stack/20 branch from 72c4ffc to 79bf049 Compare November 11, 2025 22:51

This was referenced Nov 11, 2025

Log weight hashes for DSv3 w/ pp vs w/o pp #240

Merged

Custom opify triton kernel until local_map functionalization is fixed #245

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025

xmfan changed the title ~~Log forward intermediates hashes w/pp vs w/o pp~~ Log forward intermediates/output hashes w/o pp Nov 11, 2025

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 00:04

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

4b0b462

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 79bf049 to 4b0b462 Compare November 12, 2025 00:04

xmfan changed the title ~~Log forward intermediates/output hashes w/o pp~~ Log forward intermediates hashes w/pp vs w/o pp Nov 12, 2025

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 00:05

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:02

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

b9d82ef

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 4b0b462 to b9d82ef Compare November 12, 2025 05:02

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:02

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:09

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

adbd32c

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from b9d82ef to adbd32c Compare November 12, 2025 05:09

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:09

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 06:50

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

f984301

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from adbd32c to f984301 Compare November 12, 2025 06:50

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 06:50

xmfan marked this pull request as ready for review November 12, 2025 07:18

xmfan requested review from bdhirsh, sanketpurandare and wconstab November 12, 2025 07:19

xmfan force-pushed the xmfan/stack/19 branch from 6e8451c to 59670d0 Compare November 13, 2025 20:08

xmfan marked this pull request as draft November 13, 2025 20:09

xmfan changed the base branch from xmfan/stack/19 to main November 13, 2025 22:55

xmfan added a commit that referenced this pull request Nov 13, 2025

Compare microbatch forward outputs and gradients

e5c0227

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from f984301 to e5c0227 Compare November 13, 2025 22:55

xmfan changed the title ~~Log forward intermediates hashes w/pp vs w/o pp~~ Compare microbatch forward outputs and gradients Nov 13, 2025

xmfan marked this pull request as ready for review November 13, 2025 22:57

Compare microbatch forward outputs and gradients

7c45448

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from e5c0227 to 7c45448 Compare November 14, 2025 02:28

xmfan mentioned this pull request Nov 14, 2025

Grad scaling parity between both pp and non-pp #251

Open

wconstab reviewed Nov 14, 2025

View reviewed changes

examples/example_ds3_pp.py

return

rank = torch.distributed.get_rank()

if rank == 4:

Copy link

Contributor

wconstab Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you somehow not hardcode this

sanketpurandare reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compare microbatch forward outputs and gradients #246

Compare microbatch forward outputs and gradients #246

xmfan commented Nov 11, 2025 •

edited

Loading

Uh oh!

wconstab commented Nov 14, 2025

Uh oh!

wconstab Nov 14, 2025

Uh oh!

wconstab Nov 14, 2025

Uh oh!

sanketpurandare Nov 14, 2025

Uh oh!

sanketpurandare Nov 14, 2025

Uh oh!

sanketpurandare Nov 14, 2025

Uh oh!

sanketpurandare commented Nov 14, 2025

Uh oh!

sanketpurandare commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Compare microbatch forward outputs and gradients #246

Are you sure you want to change the base?

Compare microbatch forward outputs and gradients #246

Conversation

xmfan commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab commented Nov 14, 2025

Uh oh!

wconstab Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare commented Nov 14, 2025

Uh oh!

sanketpurandare commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xmfan commented Nov 11, 2025 •

edited

Loading