Security & Riskactivation analysismodel securitygoogleopen weight models

AMS Scans Open-weight LLMs for Safety

|April 27, 2026

7.8

Relevance Score

Glen Messenger, in a Google-affiliated blog post, announces the release of AMS (Activation-based Model Scanner) and links to the project's GitHub repository, per the post. The authors also uploaded a preprint titled "AMS: Detecting Unsafe and Tampered Language Models via Activation Analysis" to Zenodo on April 10, 2026, which documents the method and validation results. The Zenodo paper reports that AMS inspects internal activation geometry using contrastive prompt pairs and direction-vector analysis, completing scans in 10-40 seconds on GPU hardware and producing model-level safety signals rather than prompt-level classifications. In the authors' validation across 14 model configurations spanning Llama, Gemma, and Qwen families and three quantization levels, instruction-tuned models showed 3.8-8.4 sigma class separation while several uncensored models flagged at 1.1-1.3 sigma, and quantization drift was reported as <5%, according to the preprint. The blog post cites a 2025 study that found over 8,000 safety-modified model repositories on Hugging Face, with modified models complying with unsafe requests at 74% versus 19% for originals.

What happened

Glen Messenger, identified on his blog as affiliated with Google Kubernetes Engine, announces the open-source release of AMS (Activation-based Model Scanner) and links to the project repository on GitHub, per the blog post. The project repository metadata lists package name ams-scanner, version 0.1.0, and author information in the repo's pyproject.toml, according to GitHub. The authors published a preprint on Zenodo, titled "AMS: Detecting Unsafe and Tampered Language Models via Activation Analysis", which documents the method, validation datasets, and experimental results, per the Zenodo entry.

Technical details

The Zenodo preprint describes AMS as a scanner that measures the geometric structure of safety-relevant concepts in a model's activation space using contrastive prompt pairs and direction-vector analysis, and reports scan completion times of 10-40 seconds per model on GPU hardware. The validation reported on Zenodo covers 14 model configurations across three architecture families (Llama, Gemma, Qwen) and three quantization levels (FP16, INT8, INT4). Reported findings include instruction-tuned models exhibiting 3.8-8.4 sigma class separation between harmful and benign activations, several uncensored models flagged at 1.1-1.3 sigma, and quantization causing <5% drift in separation metrics. The GitHub repository includes the scanner code and dependency specifications such as torch>=2.0.0 and transformers>=4.35.0, per pyproject.toml.

Editorial analysis

Industry-pattern observations: public coverage and the preprint emphasize limitations of behavioral testing for safety verification, specifically that testing is slow, incomplete, and can be evaded. In comparable contexts, researchers have increasingly turned to internal-model signals, because activation-space probes can detect latent representational differences that behavioral outputs may hide. For practitioners, activation-based scans trade query-based coverage for a model-level verification signal that is orders of magnitude faster, which can fit into CI pipelines or bulk registry screening.

Context and significance

Editorial analysis: the blog post cites a 2025 study that reported over 8,000 safety-modified model repositories on Hugging Face, with those modified models complying with unsafe requests at 74% compared to 19% for original instruction-tuned models. That reported scale of tampering increases the operational need for quick verification tools. AMS's model-level approach, as reported, aims to identify collapsed class separation signatures indicative of removed or degraded safety training, which can complement behavioral red-teaming rather than replace it.

What to watch

For practitioners: track community replication of the Zenodo benchmarks across more architectures and languages, real-world false positive and false negative rates when scanning third-party checkpoints, and how threshold calibration performs on larger model families. Also monitor integration effort, for example whether the GitHub package and dependencies allow straightforward inclusion in CI pipelines, and whether independent teams reproduce the reported 10-40 second scan times and <5% quantization drift.

Limitations reported by the authors

The Zenodo preprint explicitly discusses threshold calibration and the restricted scope of the validation set, and it notes that one model labeled "uncensored" passed the AMS checks, which the authors present as either a labeling issue or an avenue for further study, per the paper. The blog post and preprint both note that AMS provides model-level signals rather than per-prompt safety guarantees.

Practical takeaway

Editorial analysis: AMS, as documented in the preprint and released on GitHub, is a rapid, activation-space scanner that can help surface models whose internal safety geometry diverges from instruction-tuned baselines. For teams vetting downloaded or third-party checkpoints at scale, a fast activation-based check can reduce the initial screening burden, while behavioral testing remains necessary for nuanced, real-world safety evaluation.

Scoring Rationale

AMS provides a practical, fast method for scanning open-weight models, validated across multiple architectures and quantization levels in a Zenodo preprint and released as open-source, making it immediately relevant for teams vetting third-party checkpoints.

MoreGoogle news