Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
33 LLMs tested for domain-specific metacognition
A study evaluated 33 frontier LLMs across six MMLU domains to assess metacognitive accuracy. Aggregate confidence scores often hid domain-level variability, with some models showing strong domain-specific monitoring. Results highlight the need for domain-aware confidence calibration in critical applications.