All the Transformer Math You Need to Know | How To Scale Your Model #6

In the section What Should You Take Away from this Section?, you write "The parameter count of the MLP block dominates the total parameter count and the MLP block also dominates the FLOPs budget until the sequence length $T > 8D$."
Question: Don't you mean "as long as the sequence length $T<8D$."? Just because when $T>8D$ then computing the attention matrix becomes the leading term in FLOPs.
In the Appendix A: How does Flash Attention work? at the beginning you say While it’s true that the attention QK product has shape $[B, S, T, H]$ where B is the batch size, S and T are the Q and K sequence dims, and H is the number of heads..." but above you defined $H$ as the attention head dimension, not the num of heads.
Question: Don't't you mean $N$? That's the number of attention heads.

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

(1) Good point, I've reworded this as you suggested. (2) Yes. Fixed. Thank you!

rahulbatra85 · 2025-02-10T21:47:13Z

rahulbatra85
Feb 10, 2025 — with giscus

In the transformer decoder architecture figure, the term K is overloaded(K=XWk and also K=number of KV heads).

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

@levskaya I have been running into this too, maybe we pick another symbol? N for total, M for KV heads maybe, as in "MQA". Maybe it's fine as is too, it's not that confusing

kerrywang · 2025-02-20T06:39:37Z

kerrywang
Feb 20, 2025 — with giscus

In the transformer decoder architecture plot. G is query heads per kv heads. I wonder should it instead be reversed? i.e. KV heads per Query Heads? Given group query attention we are grouping the query so query heads should always be smaller than KV heads?

1 reply

jacobaustin123 Feb 20, 2025
Maintainer Author

No, the key is to have fewer KV heads than query heads. We group query heads that attend to the same KV head.

FL33TW00D · 2025-02-24T11:49:29Z

FL33TW00D
Feb 24, 2025 — with giscus

From a hardware standpoint, this lets us fit our chunk of Q into VMEM (what the algorithm above calls on-chip SRAM) so we only have to load the KV chunks on each iteration, reducing the arithmetic intensity. We can also keep the running statistics in VMEM.

I might be incorrect here, but I was under the impression flash attention increased arithmetic intensity by reducing the number of memory accesses?

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

I think in this case it's ultimately the same thing: if you load the same Qs or Ks multiple times, you're increasing the bytes loaded (reducing arithmetic intensity) and adding more, smaller memory accesses.

maximilianmbeck · 2025-04-01T08:25:07Z

maximilianmbeck
Apr 1, 2025 — with giscus

typo in appendix?
In the first equation (where you define $S_{ij}$). Shouldn't be the index of $k$ in the denominator also $k$ (i.e. the sum index)?
I think in your last equation (where you describe the local contraction) it is correct. Thanks for this great book!

0 replies

ohodson · 2025-06-13T10:26:11Z

ohodson
Jun 13, 2025 — with giscus

Great resource, thank you.

The second reference on the Transformer Math page looks like it should have the author "Shazeer, N." rather than "Noam, S.".

0 replies

hammersam · 2025-06-20T09:45:32Z

hammersam
Jun 20, 2025 — with giscus

Thanks for sharing! Great resource for learning LLM.

0 replies

wenchenvincent · 2025-06-23T04:05:29Z

wenchenvincent
Jun 23, 2025 — with giscus

In the section "What Should You Take Away from this Section?", the Training FLOPs per layer for Vocab is shown as 12BTDV. I think it should be 6BTDV, and it should be total.

3 replies

jacobaustin123 Jun 23, 2025 — with giscus
Maintainer Author

Agree it's total. It's 12BTDV because the embedding is used for both embedding and un-embedding (the output projection). We call this weight tying. This means each parameter is used twice.

wenchenvincent Jun 28, 2025 — with giscus

That makes sense. In this case, you might want to include both embedding and un-embedding in the Section "Global FLOPs and Params Calculation: Other Operations" to make it consistent.

Kalamitous Aug 5, 2025 — with giscus

The explanation to question 1 uses the formula (3DF + 4DNH + D) + 2DV. If weight tying. this should be (3DF + 4DNH + D) + DV right? I think whether we are weight tying should be clarified here.

Embedding should also be mentioned in the subsection "General rule of thumb for Transformer FLOPs" since un-embedding is.

karunreddy30 · 2025-08-10T19:03:10Z

karunreddy30
Aug 10, 2025 — with giscus

In the "Global FLOPs and Params Calculation" for "Attention" - The QK matmul training flops is shown as 6BTSKGH. Isn't this param free and a matmul of [B, T, K, G, H] and [B, S, K, H], which should result in 2BTSKGH?

5 replies

jacobaustin123 Aug 10, 2025
Maintainer Author

Training FLOPs includes both forward and backward pass FLOPs, which is 6x rather than 2x

karunreddy30 Aug 10, 2025 — with giscus

But isn't this param free matmul of activations?

karunreddy30 Aug 10, 2025 — with giscus

I guess it has to compute the gradients to pass on during backprop even though it is param free. I get it now. Thanks!

jacobaustin123 Aug 10, 2025
Maintainer Author

Yes, but you still have to do dL / dQ and dL / dK in the backward pass, which gives you more FLOPs.

jacobaustin123 Aug 10, 2025
Maintainer Author

Exactly, I always have to draw this out to convince myself of this sort of thing

colefranks · 2025-09-11T09:01:52Z

colefranks
Sep 11, 2025 — with giscus

Typo: definition of running denominator L in the flash attention section seems to be missing an exp

0 replies

piyifan123 · 2025-10-02T17:17:30Z

piyifan123
Oct 2, 2025 — with giscus

do we need to mention that the 12BTSNH number for theoretical attention flops should be divided by 2 for casual attention? Theoretically, you can choose not to compute the part for the upper triangle for both QK^T and (QK^T)V by carefully tiling your attention computation (and IIUC, FA actually exploits this when you pass in is_casual=True which can theoretically leads to >1 mfu if they optimized well enough for a given hardware if possible).

I've seen some debates around this especially when reporting MFUs, and I believe that people in general use the factor 12 no matter it's casual or not to be "consistent" with their peers' numbers.

1 reply

jacobaustin123 Oct 2, 2025 — with giscus
Maintainer Author

Yes I think they generally should be divided by two, but in practice it depends how well implemented things are. For instance if you do any form of context parallelism, you need to be very careful about shuffling the sequence across devices to see a speedup (otherwise you're bottlenecked by the last shard). So generally yes but it's non-trivial to see in practice.

cquil11 · 2025-10-02T22:37:01Z

cquil11
Oct 2, 2025 — with giscus

I'm trying to do an exercise accounting for total FLOPs (inference) for a given ISL / OSL. In particular, I am trying to account for the extra FLOPs that come from dot-product attention operations. My intuition is to split up the prefill and decode steps, and then add them together.

For prefill, my understanding is that T = S, and that we can ignore the batch dimension B since we are just talking about requests of the form ISL / OSL (is this reasonable?). Therefore, we have that total prefill FLOPs = 4 * ISL * ISL * N * H = 4 * (ISL)^2 * N * H.

Then, for decode, my understanding is that T = 1, and then we will have OSL - ISL decode steps, with S \in {ISL + 1, ISL + 2, ..., ISL + OSL}. To simplify, we can just find the average S by deriving the summation:

Decode FLOPs = 4 · N · H · \sum(ISL + i) for i=1 to OSL
= 4 · N · H · [OSL · ISL + OSL(OSL + 1)/2]
= 4 · N · H · OSL · (ISL + (OSL + 1)/2)

Total FLOPs = Prefill FLOPs + Decode FLOPs
= 4 · ISL^2 · N · H + 4 · N · H · OSL · (ISL + (OSL + 1)/2)
= 4 · N · H · [ISL^2 + OSL · (ISL + (OSL + 1)/2)]
= 4 · N · H · [ISL^2 + OSL · ISL + OSL · (OSL + 1)/2]
= 4 · N · H · [ISL^2 + OSL · ISL + OSL^2/2 + OSL/2]

Not sure if I am dead wrong, overcomplicating, or both. Would appreciate any help!

3 replies

cquil11 Oct 3, 2025 — with giscus

Please ignore the part of the sentence "...and then we will have OSL - ISL decode steps...", I meant to delete this and just put my decode FLOPs calculation.

cquil11 Oct 3, 2025 — with giscus

I also posted here if it is easier to see:
https://ai.stackexchange.com/questions/49003/transformer-flop-accounting-during-forward-pass-with-isl-osl

cquil11 Oct 29, 2025 — with giscus

@jacobaustin123 would you be able to help with this please?

IrishWhiskey · 2025-10-26T16:45:42Z

IrishWhiskey
Oct 26, 2025

Why is the KV an array of shape [2,S,L,K,H]? Shouldn't that also include the batch size B?

1 reply

jacobaustin123 Oct 26, 2025
Maintainer Author

It depends whether youre talking about prefill or generation. During prefill you only have a single batch. During generation you batch multiple KV caches together, but by convention we talk about the kv cache only for a single sequence

All the Transformer Math You Need to Know | How To Scale Your Model #6

Uh oh!

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 14 comments · 22 replies

Uh oh!

main-horse Feb 4, 2025 — with giscus

Uh oh!

Uh oh!

main-horse Feb 4, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 4, 2025 — with giscus Maintainer Author

Uh oh!

spacewander Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 10, 2025 Maintainer Author

Uh oh!

spacewander Feb 11, 2025

Uh oh!

Uh oh!

burichh Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

rahulbatra85 Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

kerrywang Feb 20, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 20, 2025 Maintainer Author

Uh oh!

FL33TW00D Feb 24, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

maximilianmbeck Apr 1, 2025 — with giscus

Uh oh!

ohodson Jun 13, 2025 — with giscus

Uh oh!

hammersam Jun 20, 2025 — with giscus

Uh oh!

wenchenvincent Jun 23, 2025 — with giscus

Uh oh!

jacobaustin123 Jun 23, 2025 — with giscus Maintainer Author

Uh oh!

wenchenvincent Jun 28, 2025 — with giscus

Uh oh!

Kalamitous Aug 5, 2025 — with giscus

Uh oh!

karunreddy30 Aug 10, 2025 — with giscus

Uh oh!

jacobaustin123 Aug 10, 2025 Maintainer Author

Uh oh!

karunreddy30 Aug 10, 2025 — with giscus

Uh oh!

karunreddy30 Aug 10, 2025 — with giscus

Uh oh!

jacobaustin123 Aug 10, 2025 Maintainer Author

Uh oh!

jacobaustin123 Aug 10, 2025 Maintainer Author

Uh oh!

colefranks Sep 11, 2025 — with giscus

Uh oh!

piyifan123 Oct 2, 2025 — with giscus

Uh oh!

jacobaustin123 Oct 2, 2025 — with giscus Maintainer Author

Uh oh!

cquil11 Oct 2, 2025 — with giscus

Uh oh!

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 14 comments 22 replies

main-horse
Feb 4, 2025 — with giscus

jacobaustin123 Feb 4, 2025 — with giscus
Maintainer Author

jacobaustin123 Feb 10, 2025
Maintainer Author

burichh
Feb 10, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

rahulbatra85
Feb 10, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

kerrywang
Feb 20, 2025 — with giscus

jacobaustin123 Feb 20, 2025
Maintainer Author

FL33TW00D
Feb 24, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

maximilianmbeck
Apr 1, 2025 — with giscus

ohodson
Jun 13, 2025 — with giscus

hammersam
Jun 20, 2025 — with giscus

wenchenvincent
Jun 23, 2025 — with giscus

jacobaustin123 Jun 23, 2025 — with giscus
Maintainer Author

karunreddy30
Aug 10, 2025 — with giscus

jacobaustin123 Aug 10, 2025
Maintainer Author

jacobaustin123 Aug 10, 2025
Maintainer Author

jacobaustin123 Aug 10, 2025
Maintainer Author

colefranks
Sep 11, 2025 — with giscus

piyifan123
Oct 2, 2025 — with giscus

jacobaustin123 Oct 2, 2025 — with giscus
Maintainer Author

cquil11
Oct 2, 2025 — with giscus