Comparing Transformers and Hybrid Models at the Token Level
Researchers analyze the performance differences between pure Transformers and hybrid models using Olmo 3 and Olmo Hybrid weights.
This study investigates why hybrid language models—which combine attention and recurrent layers—often outperform pure Transformers. By examining open-weight models, the authors aim to isolate the specific data and architectural capabilities that drive these performance gains, testing whether empirical improvements align with theoretical advantages in state tracking.