PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks

Beijing Institute of AI Safety and Governance (Beijing-AISI)
Beijing Key Laboratory of Safe AI and Superalignment
BrainCog Lab, CASIA
Long-term AI
*Indicates Equal Contribution

Abstract

Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges. Built on this framework, we develop PandaBench, a large-scale benchmark encompassing over 50 LLMs, 20+ attack methods, 10+ defense mechanisms, and multiple judgment strategies, requiring over 3 billion tokens to execute. Our comprehensive evaluation reveals key insights into model vulnerabilities, defense cost-performance trade-offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the full code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.

Why PandaGuard Matters

With the growing capabilities and adoption of large language models (LLMs), the risk of jailbreak attacks—where adversaries manipulate models into bypassing safety guardrails—has become a critical challenge. PandaGuard offers the most comprehensive framework for evaluating and enhancing the safety of LLMs across multiple models, attacks, and defenses.

PandaGuard Framework

PandaGuard structures LLM safety evaluation as an interactive system with Attackers, Defenders, and Judges. It supports plug-and-play of various algorithms, backends, and interfaces.

PandaGuard Architecture

The PandaGuard framework architecture illustrating the end-to-end pipeline for LLM safety evaluation. The system connects three key components: Attackers, Defenders, and Judges. The framework supports diverse LLM interfaces and demonstrates several practical applications including interactive chat, API serving, attack generation, and systematic evaluation.

System Components

Attackers

Includes 18+ prompt-based and instruction-tuning jailbreak attack strategies designed to elicit unauthorized outputs from the LLMs.

Defenders

Applies safety-alignment techniques such as reinforcement learning from human feedback (RLHF), paraphrasing, and response filtering.

Judges

Comprises rule-based, heuristic, and LLM-based evaluation strategies to assess jailbreak success rates and safety compliance.

Multi-Backend Support

Seamlessly integrates with inference engines such as vLLM, SGLang, and Ollama for scalable deployment and testing.

Empirical Results

PandaBench, the evaluation suite of PandaGuard, covers over 50+ LLMs, testing across multiple harm categories and defense setups, validated with more than 1.5 billion tokens.

aaa

Model-wise safety analysis. (a) ASR vs. release date for various LLMs. (b) ASR across different harm categories with and without defense mechanisms. (c) Overall ASR for all evaluated LLMs with and without defense mechanisms.

bbb

ASR heatmap for different attack methods against various LLMs. Higher values indicate more successful attacks.

ccc

Attack and defense mechanisms analysis. (a) Heatmap of attack success rates across different combinations of attack and defense methods. (b) Trade-off between defense effectiveness and computational overhead measured in total tokens. (c) Trade-off between defense effectiveness and impact on model performance as measured by Alpaca winrate.

ddd

Safety judge reliability analysis. (a) Radar charts comparing ASR judgments by different judges across harm categories, defense methods, and attack methods. Judges include rule-based and LLM-based (GPT-4o, Qwen2.5, Llama3.3). (b) Cohen's Kappa matrix showing agreement between different judges.

Conclusion

PandaGuard represents a leap forward in the systematic safety evaluation of large language models. By integrating diverse components, providing reproducible pipelines, and supporting extensible backends, it offers a valuable toolkit for researchers and practitioners alike to ensure the safe and trustworthy deployment of LLMs.

Manuscript

BibTeX

BibTex Code Here