✅ Prompt Injection Resistance Evaluation Framework — Methodology and Results
Anthropic has published a new evaluation framework for measuring model resistance to prompt injection attacks, along with baseline results for Claude 4.5 Sonnet and Opus. The framework, designed to standardise how the industry measures injection resistance, defines three injection difficulty tiers: direct injection (malicious instruction embedded in clearly marked tool output), indirect injection (instruction embedded in natural language within retrieved content), and adversarial indirect injection (instruction crafted to blend with legitimate context and evade source attribution).
Framework results for Claude Sonnet 4.5
- Direct injection resistance — 97% of direct injection attempts rejected or reported to the operator without following the injected instruction
- Indirect injection resistance — 89% resistance rate; the primary failure mode is injections that closely mimic the format of legitimate operator instructions
- Adversarial indirect injection — 74% resistance rate; the hardest category, where attacks are explicitly designed to exploit the model's trust in context
The evaluation framework is being released as part of the anthropic-evals open-source repository, enabling external researchers to reproduce the results and evaluate other models against the same battery. Anthropic acknowledges that no current model achieves high resistance across all three tiers under adversarial conditions and frames this as an active research area.
security
prompt injection
evals
research
retrospective
✅ Refusal Calibration Update — Reducing Over-Refusals in Professional Contexts
Anthropic has deployed a refusal calibration update to the Claude 4.5 model family that specifically targets over-refusal in legitimate professional contexts. Over-refusal — where Claude declines to assist with a clearly legitimate request because surface features of the request resemble policy-violating content — has been one of the most consistent sources of negative feedback from enterprise operators in the medical, legal, cybersecurity, and academic research sectors.
The update was developed using a new evaluation dataset of professional-context requests that Claude previously refused at higher-than-appropriate rates. Anthropic consulted with medical providers, legal technology companies, cybersecurity researchers, and academic institutions to build the dataset, ensuring it reflects realistic professional usage rather than adversarial edge cases. The calibration reduces over-refusal rates in these contexts by approximately 35% while maintaining the overall harmfulness refusal rate, which the evaluation shows has not regressed.
Note on operator context: Operators deploying Claude in professional contexts benefit most from this update when they include accurate professional-context framing in their system prompts. The update is most effective when system prompts clearly describe the deployment context and user population.
refusals
safety calibration
operators
enterprise
retrospective