The normalization layer should be applied after the residual connection, not before. At least this is how it is done in the Transformer community.
So the code:
.
.
x = layers.LayerNormalization(epsilon=1e-6)(x)
res = x + inputs
.
.
x = layers.LayerNormalization(epsilon=1e-6)(x)
return x + res
Should be replaced by:
.
.
x = x + inputs
res = layers.LayerNormalization(epsilon=1e-6)(x)
.
.
x = x + res
return layers.LayerNormalization(epsilon=1e-6)(x)
Kind regards