Summary of Normalization Layer Per-example Gradients Are Sufficient to Predict Gradient Noise Scale in Transformers, by Gavia Gray and Aman Tiwari and Shane Bergsma and Joel Hestness
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformersby Gavia Gray,…