← Back to all entries
2026-01-20 ✅ Best Practices

Prompt Injection Resistance Evaluation Framework Published & Refusal Calibration Update

Prompt Injection Resistance Evaluation Framework Published & Refusal Calibration Update — visual for 2026-01-20

Prompt Injection Resistance Evaluation Framework — Methodology and Results

Anthropic has published a new evaluation framework for measuring model resistance to prompt injection attacks, along with baseline results for Claude 4.5 Sonnet and Opus. The framework, designed to standardise how the industry measures injection resistance, defines three injection difficulty tiers: direct injection (malicious instruction embedded in clearly marked tool output), indirect injection (instruction embedded in natural language within retrieved content), and adversarial indirect injection (instruction crafted to blend with legitimate context and evade source attribution).

Framework results for Claude Sonnet 4.5

The evaluation framework is being released as part of the anthropic-evals open-source repository, enabling external researchers to reproduce the results and evaluate other models against the same battery. Anthropic acknowledges that no current model achieves high resistance across all three tiers under adversarial conditions and frames this as an active research area.

security prompt injection evals research retrospective

Refusal Calibration Update — Reducing Over-Refusals in Professional Contexts

Anthropic has deployed a refusal calibration update to the Claude 4.5 model family that specifically targets over-refusal in legitimate professional contexts. Over-refusal — where Claude declines to assist with a clearly legitimate request because surface features of the request resemble policy-violating content — has been one of the most consistent sources of negative feedback from enterprise operators in the medical, legal, cybersecurity, and academic research sectors.

The update was developed using a new evaluation dataset of professional-context requests that Claude previously refused at higher-than-appropriate rates. Anthropic consulted with medical providers, legal technology companies, cybersecurity researchers, and academic institutions to build the dataset, ensuring it reflects realistic professional usage rather than adversarial edge cases. The calibration reduces over-refusal rates in these contexts by approximately 35% while maintaining the overall harmfulness refusal rate, which the evaluation shows has not regressed.

Note on operator context: Operators deploying Claude in professional contexts benefit most from this update when they include accurate professional-context framing in their system prompts. The update is most effective when system prompts clearly describe the deployment context and user population.

refusals safety calibration operators enterprise retrospective