Weight-Space Geometry of Offline Reasoning Training
Researchers analyzed weight updates across six common offline reasoning training methods to determine if they are mechanistically distinct.
By training six methods—including SFT, RFT, and DPO—on identical math rollouts using a Qwen3-4B base model, the study used cosine similarity and principal-angle subspace analysis to compare weight deltas. The findings suggest that while these methods are often treated as distinct, they may converge toward similar weight updates.