Skip to content

Conversation

rka97
Copy link
Contributor

@rka97 rka97 commented Apr 3, 2025

This is for the LM workload.

Copy link

github-actions bot commented Apr 3, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@priyakasimbeg priyakasimbeg changed the base branch from main to dev October 2, 2025 00:40
@fsschneider
Copy link
Contributor

I might be wrong here, but I just randomly saw that we are using a default MLP expansion factor of $4$ (i.e.,

).

However, it seems to me, that we are also using a gated linear unit, which uses an initial expansion that is twice as large (since it is split into the gate and the value), i.e.

self.fc1 = nn.Linear(dim, 2 * hidden_dim, bias=False)
self.fc2 = nn.Linear(hidden_dim, dim, bias=False)
self.glu = nn.GLU(dim=2)

I believe that the common expansion factor of $4$ is for models that use non-gated activation functions (e.g., original Transformer with ReLU or GPT-2 with GELU). To keep the number of parameters constant, people commonly divide by $2/3$ for gated versions. This keeps the total number of parameters in the MLP the same (some small approximation error due to the non-integer multiplication).

If you agree, we should either change the default to $4/3$, or adjust the computation of the hidden_dim in the MLP to account for this scaling.

@rka97 @priyakasimbeg @Niccolo-Ajroldi

@Niccolo-Ajroldi
Copy link
Member

@fsschneider agree, we should definitely adjust the expansion factor as you suggested. This was probably an oversight, as in plainLM we rescale when using GLU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants