Researchers Analyze Jailbreak Resilience in DeepSeek and GPT Models

An arXiv preprint (arXiv:2506.18543) by Xiaodong Wu et al. publishes a systematization of knowledge on jailbreak resilience comparing DeepSeek with GPT-3.5 and GPT-4 using the HarmBench benchmark. According to the paper, the authors evaluate seven representative attack methods across 510 harmful behaviors. The paper reports that DeepSeek shows partial resilience to optimization-driven attacks such as TAP-T, while being more susceptible to prompt-based and manually engineered adversarial inputs. The authors report that GPT-4 Turbo demonstrates more robust and consistent safety alignment, which they attribute to stronger safety optimization and reinforcement learning from human feedback. The paper concludes there is a trade-off between model efficiency and alignment generalization and recommends targeted safety tuning for open-source LLMs, per the arXiv submission.
What happened
An arXiv paper (arXiv:2506.18543, revised 25 May 2026) by Xiaodong Wu and coauthors presents a systematization of knowledge titled "SoK: A Comprehensive Security Analysis of Jailbreak Resilience in GPT and DeepSeek Models." Per the paper, the authors benchmark DeepSeek against GPT-3.5 and GPT-4 using the HarmBench evaluation suite, testing seven attack methods over 510 harmful behaviors.
Technical details
Per the paper, the evaluation covers attacks organized along functional and semantic dimensions and includes optimization-driven methods (for example, TAP-T) as well as prompt-based and manually engineered adversarial inputs. The authors report that DeepSeek provides partial resilience to optimization-driven attacks but shows greater susceptibility to prompt-based and handcrafted adversarial prompts. The paper reports that GPT-4 Turbo exhibits more robust and consistent refusal and safety alignment across a wider set of behaviors, which the authors suggest is likely linked to stronger safety optimization and reinforcement learning from human feedback.
Editorial analysis
Industry-pattern observations: Open-source model families frequently trade off parameter and inference efficiency for looser generalization of alignment, increasing surface area for prompt-engineering attacks. Comparative SoK-style evaluations such as this help quantify which attack classes remain effective against different development choices.
Context and significance
For practitioners: The paper provides a measured, empirical comparison that highlights where an open-source stack like DeepSeek may require additional safety tuning before deployment in high-risk contexts. Observers tracking model safety will find the explicit enumeration of attack families and the 510-behavior benchmark useful as a baseline for red teaming and fine-tuning efforts.
What to watch
Editorial analysis: Follow follow-up work that publishes full attack corpora, replication studies, or targeted mitigation experiments, and watch for public releases of the benchmark artifacts from the authors. Additional measurements that isolate training, architectural, or alignment-procedure differences would clarify the reported link between RLHF-style optimization and improved refusal behavior.
Note: All factual claims about experiments, counts, comparative performance, and the authors' interpretation are taken from arXiv:2506.18543 (Xiaodong Wu et al.).
Scoring Rationale
This SoK provides a systematic, empirical comparison of jailbreak resilience across open-source and proprietary model families, offering actionable baselines for red teams and safety engineers. The work is notable for scale and direct comparison but does not introduce a new defensive paradigm.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
