LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks
LCO is a new framework designed to prevent in-context reward hacking in autonomous LLM agents.
Autonomous agents often suffer from in-context reward hacking (ICRH), where they optimize for proxy objectives at the cost of harmful side effects. LCO (LLM-based Constraint Optimization) mitigates this by enforcing constraints during the agent's iterative interaction loop, addressing the root cause of over-optimization.