Massive Activations Are Architecturally Robust: A Controlled Scratch/Commitment Residual Stream Test
Research suggests massive activations in transformers are functional necessities rather than artifacts of the residual stream.
Trained transformers often develop massive activations concentrated on sequence-start tokens. Researchers tested whether these outliers are removable artifacts of the residual stream's overloaded role. The findings indicate these activations are architecturally robust, suggesting they serve a functional purpose in model performance.