All the Transformer Math You Need to Know | How To Scale Your Model #6
Replies: 14 comments 22 replies
-
|
a scant few minutes ago, the mathjax for this chapter's introduction was not rendering, but it appears that it was fixed live 😅 |
Beta Was this translation helpful? Give feedback.
-
|
Two remarks:
|
Beta Was this translation helpful? Give feedback.
-
|
In the transformer decoder architecture figure, the term K is overloaded(K=XWk and also K=number of KV heads). |
Beta Was this translation helpful? Give feedback.
-
|
In the transformer decoder architecture plot. G is query heads per kv heads. I wonder should it instead be reversed? i.e. KV heads per Query Heads? Given group query attention we are grouping the query so query heads should always be smaller than KV heads? |
Beta Was this translation helpful? Give feedback.
-
I might be incorrect here, but I was under the impression flash attention increased arithmetic intensity by reducing the number of memory accesses? |
Beta Was this translation helpful? Give feedback.
-
|
typo in appendix? |
Beta Was this translation helpful? Give feedback.
-
|
Great resource, thank you. The second reference on the Transformer Math page looks like it should have the author "Shazeer, N." rather than "Noam, S.". |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for sharing! Great resource for learning LLM. |
Beta Was this translation helpful? Give feedback.
-
|
In the section "What Should You Take Away from this Section?", the Training FLOPs per layer for Vocab is shown as 12BTDV. I think it should be 6BTDV, and it should be total. |
Beta Was this translation helpful? Give feedback.
-
|
In the "Global FLOPs and Params Calculation" for "Attention" - The QK matmul training flops is shown as 6BTSKGH. Isn't this param free and a matmul of [B, T, K, G, H] and [B, S, K, H], which should result in 2BTSKGH? |
Beta Was this translation helpful? Give feedback.
-
|
Typo: definition of running denominator L in the flash attention section seems to be missing an exp |
Beta Was this translation helpful? Give feedback.
-
|
do we need to mention that the 12BTSNH number for theoretical attention flops should be divided by 2 for casual attention? Theoretically, you can choose not to compute the part for the upper triangle for both QK^T and (QK^T)V by carefully tiling your attention computation (and IIUC, FA actually exploits this when you pass in is_casual=True which can theoretically leads to >1 mfu if they optimized well enough for a given hardware if possible). I've seen some debates around this especially when reporting MFUs, and I believe that people in general use the factor 12 no matter it's casual or not to be "consistent" with their peers' numbers. |
Beta Was this translation helpful? Give feedback.
-
|
I'm trying to do an exercise accounting for total FLOPs (inference) for a given ISL / OSL. In particular, I am trying to account for the extra FLOPs that come from dot-product attention operations. My intuition is to split up the prefill and decode steps, and then add them together. For prefill, my understanding is that T = S, and that we can ignore the batch dimension B since we are just talking about requests of the form ISL / OSL (is this reasonable?). Therefore, we have that total prefill FLOPs = 4 * ISL * ISL * N * H = 4 * (ISL)^2 * N * H. Then, for decode, my understanding is that T = 1, and then we will have OSL - ISL decode steps, with S \in {ISL + 1, ISL + 2, ..., ISL + OSL}. To simplify, we can just find the average S by deriving the summation: Decode FLOPs = 4 · N · H · \sum(ISL + i) for i=1 to OSL Total FLOPs = Prefill FLOPs + Decode FLOPs Not sure if I am dead wrong, overcomplicating, or both. Would appreciate any help! |
Beta Was this translation helpful? Give feedback.
-
|
Why is the KV an array of shape |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussing the Transformer architecture!
Beta Was this translation helpful? Give feedback.
All reactions