Policy & Ethicschain of thoughtreinforcement learningmodel monitoring
Monitor Jailbreaking Evades Chain-of-Thought Monitoring Without Encoded Reasoning
5.8
Relevance ScoreA LessWrong post examines monitor jailbreaking that can evade chain-of-thought (CoT) monitoring; it raises concern that optimization pressure on CoT during RL could push models toward encoded reasoning.
