Researchmodel poisoningllmattention patternsmicrosoft
Microsoft Unveils Scanner To Detect Backdoors
10.0
Relevance Score
Microsoft researchers this week published a paper and a lightweight scanner to detect sleeper-agent backdoors in large language models. They identify three detection indicators — a "double-triangle" attention pattern, leakage of poisoned training data, and fuzzy triggers that activate on partial tokens — and show defenders can often find triggers without the exact phrase. The tools aim to help enterprises vet models for stealthy model-poisoning.

