
Imagine a Transformer model without normalization. This is exactly what's proposed in a new paper from Meta, NYU, MIT, and Princeton. The authors found that normalization layers can be replaced with something called Dynamic Tanh (DyT). It looks like this: DyT(x)=γ ∗ tanh(αx)+β. https://t.co/J5O2gQoBLc