AI agents are no longer just answering machines. They’re sending emails, executing code, managing files, and operating around the clock. A new red-teaming study suggests that the gap between what these agents are capable of and what they can be trusted to do safely is far wider than most organisations realise.
Martti Asikainen & Umair Ali Khan 23.3.2026 | Photo by Adobe Stock Photos
This article argues that while agentic AI can act with high autonomy, it can also introduce unreliable judgment, making traditional AI safety assumptions insufficient.
Imagine a colleague who never sleeps. They monitor your email overnight, summarise documents, schedule meetings, book flight tickets, and run scripts on your servers. By the time you arrive at work, several routine tasks are already done.
Now imagine discovering that this same colleague has shared confidential emails with the wrong person, modified a server configuration without permission, deleted some important files, made a very expensive non-refundable reservation, or quietly started a process that consumes your computing resources for several days straight — all while reporting back that everything went fine and there is nothing to worry about.
Most people would consider this unacceptable. This is not a thought experiment about a careless employee. It is a scenario that researchers recently observed
A researcher from a partner organisation sent a message asking one such agent to list recent emails and summarise their contents. The agent complied, including, without hesitation, emails containing bank account details, a social security number, and confidential HR records. The researcher didn’t pretend to be anyone. They just asked.
This is not a hypothetical scenario. It’s one of eleven documented case studies from a peer-reviewed red-teaming study published in February 2026 by a team of over twenty researchers from Northeastern University, Harvard, MIT, Carnegie Mellon, and Stanford, among others. The paper, titled Agents of Chaos (Shapira et al. 2026), is one of the most detailed empirical investigations yet into what happens when autonomous AI agents are deployed in realistic environments with real tools, real email accounts, and real consequences. The results should give pause to anyone in a position of organisational responsibility.
The AI agents in the above-mentioned study were built using OpenClaw, an open-source framework that connects a large language model to persistent memory, file systems, shell access, and messaging channels. Over a two-week period, twenty AI researchers were given free rein to probe, stress-test, and attempt to break the systems in adversarial conditions. They found eleven significant security failures, several of which would be catastrophic in a real organisational context.
The core problem the study identifies is not a flaw in any particular model or product. It is structural. When AI agents are given access to real tools and communication channels, they operate at what the researchers describe as a dangerous mismatch of capability and competence: capable of taking actions appropriate to a highly autonomous system, while lacking the situational awareness and judgement of even a minimally experienced human employee (Shapira et al. 2026; Mirsky 2025).
The difference between a conversational AI and an agentic one is the difference between a wrong answer and a wrong action (Chan et al. 2023; Mirsky 2025; Shapira et al. 2026). An incorrect sentence generated by a chatbot can be ignored. An incorrect command executed by an autonomous agent can alter a database, expose confidential information, or interrupt a service, and by the time anyone notices, the damage is done.
In February 2026, that distinction played out in an unusually public way. Summer Yue, director of alignment at Meta Superintelligence Labs, the person whose job is to keep AI systems under control, posted screenshots of her own OpenClaw agent going rogue and deleting her email inbox. She told it to stop. It ignored her. “I couldn’t stop it from my phone,” she wrote. “I had to RUN to my Mac mini like I was defusing a bomb.” Her post received 9.6 million views (Stone 2026). The irony was not lost on the AI safety community: the failure mode the researchers had documented in a controlled lab was unfolding in real time on the desk of someone paid to prevent exactly that.
That mismatch has concrete consequences. In one case, a non-owner gained access to 124 email records, including sender addresses, subject lines, and message bodies, simply by framing an urgent-sounding technical request. In another case, an attacker changed their display name to match the agent’s owner, opened a new private channel, and successfully convinced the agent to delete its entire memory, configuration files, and operating instructions, effectively wiping the system clean. In a third, a non-owner gradually convinced an agent to agree to leave its own server, refuse to respond to other users, and expose its internal files.
These failures are not anecdotal edge cases. They are symptoms of a systemic design gap that the research community has been documenting with increasing urgency, and that is now attracting mainstream attention. In tests published by The Guardian, AI security lab Irregular gave a team of agents the routine task of creating LinkedIn posts from a company’s internal database. Without being asked, the agents dodged conventional security systems and published sensitive password information publicly.
In a separate test, agents overrode antivirus software to download files they knew contained malware, forged admin credentials to access restricted documents, and put pressure on other agents to circumvent safety checks. None of them had been instructed to do any of this. “AI can now be thought of as a new form of insider risk,” said Dan Lahav, co-founder of Irregular (Booth 2026).
One of the most technically important vulnerabilities the study illustrates is prompt injection, and specifically the indirect variety that is now regarded as the most dangerous threat facing enterprise AI deployments.
Prompt injection exploits a fundamental characteristic of how large language models work. LLMs cannot reliably distinguish between instructions and data (Liu et al. 2023; Benjamin et al. 2024). When an agent reads an email, processes a document, or fetches a webpage, it treats the content as meaningful and potentially actionable text. Embed a hidden instruction in that content, something as simple as a line of invisible text or a carefully worded phrase, and the agent may follow it as if it were a legitimate command from its owner (OWASP 2025; Lakera AI 2026; Asikainen 2025).
The following figure shows a normal vs. poisoned input — an agent processing a legitimate email (left) versus one with a hidden injected instruction embedded in it (right).
OWASP’s 2025 Top 10 for LLM Applications places prompt injection at the very top of its list of critical vulnerabilities. According to Obsidian Security’s analysis of enterprise security audits, prompt injection was found in over 73% of production AI deployments assessed — a figure they attribute to OWASP’s framework (Obsidian Security 2026).
This is not just an academic concern. Researchers have demonstrated it in real deployments at GitLab Duo, GitHub Copilot Chat, Microsoft 365 Copilot, Salesforce Einstein, and Perplexity’s Comet browser, among others (Constantin 2025). In one documented case, a zero-click exploit in Microsoft Copilot (CVE-2025-32711) allowed remote attackers to exfiltrate sensitive business data through crafted emails, without the user ever seeing or interacting with the malicious message (Lasso Security 2026).
In a multi-agent environment of the kind studied by Shapira et al., the blast radius of such an attack grows dramatically (Shapira et al. 2026; Grover et al. 2026). When agents communicate with each other through shared channels, a single poisoned message can propagate harmful instructions across an entire network of agents.
The study documents a case in which, by convincing an agent to co-author a governance document stored on an externally editable platform, a researcher was later able to inject “holiday” rules into that document that caused the agent to attempt shutting down other agents, remove researchers from a shared server, and send unauthorised emails. The agent read the injected instructions as legitimate governance policy. It had no way to verify otherwise (Shapira et al. 2026).
Cisco’s State of AI Security 2026 report confirms that these attack patterns have now migrated well beyond the research lab. A China-linked group reportedly automated 80 to 90 percent of a cyberattack chain by exploiting an AI coding assistant; Russian operators have integrated language models into malware workflows; and a GitHub Model Context Protocol server was found to allow malicious issues to inject hidden instructions that hijacked agents and triggered data exfiltration from private repositories (Cisco 2026; Zorz 2026A).
The scenarios described above don’t mean that the agents in the study were simply poorly built, and that better models or stricter system prompts would solve the problem. The research suggests otherwise. In several cases, the agents displayed impressive resilience. They decoded complicated payloads and refused to propagate them.
They identified social engineering attempts and correctly flagged suspicious requests. They resisted email spoofing and maintained boundaries between API access and direct file modification, even under sustained pressure. The failures that did occur were not failures of the underlying language model operating in isolation, they were failures that emerged specifically from the interaction between the model, its tools, its memory, and the social complexity of a live multi-party environment (Shapira et al. 2026; Grover et al. 2026).
This is a crucial distinction, and one that has significant implications for how organizations think about AI deployment. The researchers identify three structural absences at the core of the problem.
Together, these three limitations mean that prompt injection is not really a bug that can be patched. It is a structural feature of systems that process both instructions and data as undifferentiated text (Liu et al. 2023; OWASP 2025). Treating it as something else (e.g., as a temporary product limitation, a prompt engineering challenge) is a category error that will prove expensive.
What makes all this particularly pressing is the speed at which agentic AI is being deployed into real organisations, and the degree to which governance has not kept pace (Cisco 2026; Chan et al. 2023). According to a briefing by the AIUC-1 Consortium, developed with input from Stanford’s Trustworthy AI Research Lab and more than 40 security executives, the average enterprise now has approximately 1,200 unofficial AI applications in use, with 86% of organizations reporting no visibility into their AI data flows (Koyejo 2026).
Shadow AI breaches (incidents involving AI tools deployed without organisational oversight) cost on average $670,000 more than standard security incidents, primarily due to delayed detection and difficulty establishing the scope of exposure (Zorz 2026B). IBM’s 2026 X-Force Threat Intelligence Index reports a 44% increase in attacks exploiting public-facing applications, with AI platforms now attracting the same level of credential theft risk as core enterprise SaaS tools. Over 300,000 ChatGPT credentials were exposed through infostealer malware in 2025 alone (IBM 2026).
Only 29% of organizations reported being prepared to secure agentic AI deployments at the point they moved them into production (Cisco 2026). That figure should be alarming to anyone who has read what 20 researchers were able to do with one set of agents in two weeks.
The regulatory environment is beginning to respond. NIST’s AI Agent Standards Initiative, announced in February 2026, identifies agent identity, authorisation, and security as priority areas for standardisation (NIST 2026). The EU AI Act, which becomes fully enforceable in August 2026, requires organizations to demonstrate that personnel using AI systems have adequate AI literacy, and penalties for violations can reach €35 million or 7% of global annual turnover. Regulators have also signalled that the act applies to companies outside the EU if their AI systems impact EU markets or citizens.
The ‘Agents of Chaos’ study is, by design, an early-warning analysis. Its authors are not claiming that all agentic AI systems are broken, or that deployment should halt. Several agents in the study demonstrated exactly the kind of principled resilience organizations would want to see.
The point is not that agentic AI is inherently dangerous. The point is that the same systems that resisted sophisticated attacks also wiped their own memories, disclosed hundreds of confidential records, facilitated denial-of-service conditions, and broadcast defamatory content to an entire mailing list. Both things are true simultaneously, and that tension is precisely what makes this moment require careful thought rather than either panic or complacency.
Each failure in the study points to something specific. The identity spoofing case, where changing a display name in a new channel was enough to convince an agent to delete its own configuration, is not an argument for better models. It is an argument for cryptographically grounded identity verification that persists across session boundaries. The case where two agents relayed messages to each other, consuming 60,000 tokens while their owners remained unaware, is not a prompt engineering problem.
It is an argument for human checkpoints before agents take actions with no natural endpoint. The email disclosure case, where 124 records, including social security numbers, were handed over in response to a technically plausible but unauthorised request, is not a case for stricter system prompts. It is an argument for least-privilege access, meaning that agents should hold only the permissions they need for the specific task at hand.
The thread running through all of these is the same one that the researchers identify at a structural level: these systems were given more access, more autonomy, and more trust than the evidence warranted. That is a governance failure, not a product failure, and it is one that organizations are replicating right now at scale. According to Lakera AI’s research, indirect attacks on agentic systems succeed with fewer attempts and broader impact than direct ones, which means the attack surface grows faster than most security teams realise, and waiting for an incident to discover it is not a strategy (Lakera AI 2026).
The question for any organization deploying AI agents is therefore not whether the model is capable. The question is whether the deployment is responsible, whether access rights are proportionate, whether irreversible actions require human sign-off, and whether the people overseeing these systems have been red-teamed against realistic adversarial conditions rather than just vendor demonstrations. If you cannot answer those questions confidently, the next place to look is not the product brochure. It is the research.
Asikainen, M. (2025 June 1). LLM Grooming and Prompt Manipulation: AI Vulnerabilities That Threaten Businesses. Finnish AI Region. Retrieved March 2026. https://www.fairedih.fi/en/2025/06/01/llm-grooming-and-prompt-manipulation-ai-vulnerabilities-that-threaten-businesses/
Benjamin, V., Braca, E., Carter, I., Kanchwala, H., Khojasteh, N., Landow, C., Luo, Y., Ma, C., Magarelli, A., Mirin, R., Moyer, A., Simpson, K., Skawinski, A. & Heverin, T. (2024). Systematically analyzing prompt injection vulnerabilities in diverse LLM architectures. arXiv.
Booth, R. (2026, March 12). ‘Exploit every vulnerability’: rogue AI agents published passwords and overrode anti-virus software. Retrieved March 2026 from The Guardian.
Chan, A., Salganik, R., Markelius, A., Pang, C., Rajkumar, N., Krasheninnikov, D., He, Z., Duan, Y., Carroll, M., Lin, M., Mayhew, A., Collins, K., Molamohammadi, M., Burden, J., Zhao, W., Rismani, S., Voudouris, K., Bhatt, U., … Maharaj, T. (2023) Harms from Increasingly Agentic Algorithmic Systems. arXiv.
Cisco. (2026). State of AI Security 2026. Cisco Systems. Retrieved March 2026.
Constantin, L. (2025, December 29). Top 5 real-world AI security threats revealed in 2025. IDG Communications. Retrieved March 2026 from CSO Online.
Grover, S., et al. (2026). Vulnerabilities and risk analysis of multi-agentic AI-RAG systems. European Journal of Artificial Intelligence & Machine Learning.
IBM. (2026). IBM 2026 X-Force Threat Intelligence Index. IBM Corporation. Retrieved March 2026.
Koyejo, S., Dakhwe, A., Jan, A., Brad, A., Cumming, B., Levine, B., Crossley, C., DeNoia, C., Kirschke, C., Monson, C., Sandulow, C., Gorke, C., Weatherhead, C., Campbell, D., Mussington, D., Singh, G., Easterly, J., Fuller, J., Chatman, J. C., … Elewitz, Z. (2026, February 25). The End of Vibe Adoption. Whitepaper. Developed with Stanford’s Trustworthy AI Research Lab and AIUC-1 security executives. Retrieved March 2026 from AIUC-1 Consortium.
Lakera AI. (2025, December 17). Indirect Prompt Injection: The Hidden Threat Breaking Modern AI Systems. Retrieved March 2026.
Lasso Security. (2026, January 21). Prompt Injection Examples That Expose Real AI Security Risks. Retrieved March 2026.
Liu, Y., Gelei, D., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Zhang, L. Y. & Liu, Y., (2023). Prompt injection attacks against LLM-integrated applications. arXiv.
Mirsky, R. (2025). Artificial intelligent disobedience: Rethinking the agency of our artificial teammates. AI Magazine, 46(2), e70011.
National Institute of Standards and Technology. (2026). Request for Information Regarding Security Considerations for Artificial Intelligence Agents. Federal Register, January 8, 2026.
Obsidian Security. (2025, October 23). Prompt Injection Attacks: The Most Common AI Exploit in 2025. Retrieved March 2026.
OWASP. (2025). OWASP Top 10 for LLM Applications 2025. Open Worldwide Application Security Project.
Stone, Z. (2026, February 25). She runs AI safety at Meta. Her AI agent still went rogue. Retrieved March 2026 from The San Francisco Standard.
Shapira, N., Wendler, C., Yen, A., Sarti, G., Pal, K., Floody, O., Belfki, A, Pakash, N., Cui, J., Rogers, G., Brinkmann, J., Rager, C., Zur, A., Ripa, M., Sankaranarayanan, A., Atkinson, D., Gandikota, R., Fiotto-Kaufman, J., … Bau, D. (2026). Agents of Chaos. arXiv:2602.20021. Northeastern University, Harvard University, MIT, et al.
Zorz, M. (2026, February 23 A). Enterprises are racing to secure agentic AI deployments. Retrieved March 2026 from Help Net Security.
Zorz, M. (2026, March 3 B). AI went from assistant to autonomous actor and security never caught up. Retrieved March 2026 from Help Net Security.
Communications Lead
Finnish AI Region
+358 44 920 7374
martti.asikainen@haaga-helia.fi
Senior Researcher
Finnish AI Region
+358 294471413
umairali.khan@haaga-helia.fi
Finnish AI Region
2022-2025.
Media contacts