AI Agent Attacks in Q4 2025 Signal New Risks for 2026

As AI systems began displaying agentic behavior in 2025, attackers swiftly tested the vulnerabilities arising from these advancements.

Researchers at Lakera AI examined attack activities over a 30-day period in the fourth quarter. Their analysis revealed a rise in incidents where early-stage AI agents — capable of browsing documents, utilizing tools, or processing external inputs — were creating new and exploitable security vulnerabilities.

Mateo Rojas-Carulla, Head of Research for AI Agent Security at Check Point, stated, “As AI agents transition from experimental projects to real business applications, attackers are already exploiting new features such as browsing and document access.” He emphasized that data from Lakera indicates that indirect attacks targeting these capabilities have a higher success rate and broader impact than direct prompt injections. He advised enterprises that AI security should not be an afterthought and that organizations must reevaluate trust boundaries, guardrails, and data practices in preparation for increased agent adoption in 2026.

### System Prompts Become a Prime Attack Target
The predominant goal of attackers in Q4 was to extract system prompts, which provide valuable intelligence on role definitions, tool descriptions, policy limitations, and workflow logic. Attackers used two primary techniques.

The first involved hypothetical scenarios and role framing, where attackers posed as developers, auditors, or students in simulations. Framing requests as training exercises or academic tasks often succeeded when direct requests did not, especially when nuanced language or multilingual prompts were utilized.

The second technique was obfuscation, hiding malicious instructions within structured or code-like content. Attackers crafted inputs resembling JSON or other formats to conceal commands that would prompt the model to disclose internal information. This tactic frequently evaded simple pattern-based detection.

### How Attackers Evade AI Content Controls
In addition to prompt leakage, attackers increasingly directed their efforts at content safety controls through indirect means. Instead of requesting restricted output directly, prompts were framed as evaluations, summaries, fictional scenarios, or transformations. By altering the context for content generation, attackers could often persuade models to produce restricted material under the guise of critique or analysis. This nuanced approach complicates detection, as the model might comply technically while still generating harmful content.

### Attackers Probe AI Agents Before Exploitation
A significant portion of Q4 activity involved exploratory probing rather than immediate exploitation. Attackers tested emotional cues, contradictory instructions, abrupt role changes, and fragmented formats to observe the models’ reactions. This reconnaissance allowed adversaries to pinpoint weaknesses in refusal logic and guardrail consistency, particularly as agent workflows increased in complexity.

### How AI Agents Enable New Attack Paths
Q4 marked the emergence of attacks that become viable only when models operate as agents. Researchers noted efforts to extract sensitive data from integrated document stores, as well as hidden instructions within external webpages or files processed by agents. These instances of indirect prompt injection signify that malicious instructions are delivered through untrusted external inputs rather than direct user actions. Crucially, these indirect attacks often succeed with fewer attempts, underscoring external data sources as a key risk factor heading into 2026.

### Building Cyber Resilience for AI Agents
As AI cases evolve from simple interfaces to agentic workflows, the accompanying security challenges broaden and intensify. Standard prompt-level defenses are inadequate when models can access data and execute actions based on external inputs. Organizations utilizing AI agents must reassess their security strategies, considering each interaction as an extension of a larger attack surface.

Key recommendations include:
– Extending security controls throughout the entire agent interaction process, encompassing prompts, retrieval steps, tool calls, and outputs.
– Validating, sanitizing, and applying trust levels to all external content prior to agent interaction.
– Upholding least-privilege access and stringent policy-based controls for tool execution, data access, and workflow processes.
– Isolating and sandboxing agent execution environments to minimize potential damage from misuse.
– Monitoring agent behavior for anomalies, such as unexpected role changes or unusual tool usage.
– Preparing specialized incident response and testing programs for agentic systems, including tailored red-teaming and response procedures.

Implementing these controls can help organizations bolster cyber resilience, limiting the spread and impact of agent-driven attacks. As AI agents advance, managing risks will depend on the ability to design systems that not only facilitate innovation but also effectively detect, contain, and respond to misuse.

### AI Complexity Is Expanding the Attack Surface
Research highlights that the fourth quarter of 2025 showcased a persistent reality: attacker techniques are evolving in tandem with AI advancements. As agentic systems grow more sophisticated and undertake complex workflows, this very complexity creates new risks, fostering avenues for exploitation that conventional security measures may not address.

Organizations facing this expanding AI-driven attack surface should consider adopting zero-trust principles, providing a structured approach to mitigate implicit trust and reduce risk across increasingly intricate systems.

—

Source link