Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
Dual-stance evaluation reveals that activation steering can reduce sycophancy without necessarily suppressing factual agreement in Llama-3-8B-Instruct.
Researchers introduced a dual-stance evaluation method to test if sycophancy-reduction techniques inadvertently harm factual accuracy. By applying centroid-difference steering, they found that sycophantic and factual agreement are represented differently within the model, suggesting that interventions can be tuned to target sycophancy while preserving factual integrity.