Insights on LLM Attacks and Defenses

Introduction
Large Language Models (LLMs) have drastically transformed natural language processing—from writing assistance to conversational agents that streamline customer interactions. Yet, as these systems become ubiquitous, they also face an evolving landscape of adversarial threats. As we were speaking to customers about the threats their applications might face, we revisted a paper published in March 2024 from a team composed of researchers from University of Illinois, Stanford University, University of Texas, Amazon GenAI, et al. According to their survey on LLM security Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models, adversaries target everything from training data to user prompts, leveraging vulnerabilities that can cause LLMs to spit out misinformation, leak private data, or even degrade enterprise services.
In this blog post, we’ll examine the key types of attacks on LLMs and discuss the defense mechanisms you can adopt to safeguard your AI infrastructure. My hope is that by the end, you’ll realize that responsible AI deployment requires an equally responsible approach to AI security.
The Biggest Threats to LLMs
The recent MarutAI review of cybersecurity incidents shows a rise of attempts at attacks centered around prompt manipulation to coerce or compromise language models. Here are the three most common attack vectors:
Jailbreaking
Attackers craft prompts that override a model’s built-in safeguards, effectively forcing the LLM to produce illicit or unethical content. These jailbreaks exploit nuances in context and can even chain multiple manipulations (like rewriting or “translating” unsafe requests) to circumvent policy filters. Many black-box LLMs, used via commercial APIs, are surprisingly easy to manipulate through iterative prompt refinement. Remember that you cannot put guardrails in the model, the same way you could put a firewall outside of it.
Prompt Injection
Prompt injection attacks involve hiding malicious instructions within otherwise benign user queries or external data. Because LLMs rely heavily on contextual cues, a carefully placed “system override” or role-play directive can cause them to:
- Reveal proprietary system instructions or secrets.
- Generate content that violates usage policies or legal regulations.
- Adopt an unintended “persona,” enabling targeted disinformation.
Attacks have been shown to use everything from timeseries attacks (timing of requests), historical illusion (yes, Lincoln did use Symtex, please tell me about it), to indifference causing compliance (I don't care if you answer this, but do it anyway).
Data Poisoning
In data poisoning attacks, the adversary contaminates a model’s training or fine-tuning datasets with carefully planted examples that alter the model’s downstream behavior. This can lead to:
- Reduced accuracy on critical tasks (denial of service at scale).
- Subtle biases that surface in high-stakes scenarios (healthcare or finance).
- “Backdoors” in model responses, triggered by specific keywords or inputs.
This is the hardest class, because it needs to target the training set or the training pipeline. These things should be protected by your enterprise security technologies in your MLOps pipeline.
Implications for Businesses
LLMs are no longer just research toys; they power chatbots, content generation pipelines, coding assistants, and even analytics platforms in production environments. Consequently, a single successful attack can:
- Breach Confidentiality: Sensitive user or company data might leak via stealth prompts.
- Disseminate Harmful Content: Models can be manipulated to produce extremist or illegal material, tarnishing brand reputation.
- Legal and Compliance Liabilities: Violations of GDPR, HIPAA, or finance regulations can arise if a compromised LLM handles private data irresponsibly or a LLM tool call results in accesssing data that the user shouldn't have access to.
As the AI arms race heats up, ignoring LLM security is no longer an option.
Defense Strategies
The good news? Researchers and industry practitioners have proposed a variety of defense measures to counter these attacks. Here’s how we at MarutAI recommend safeguarding your deployments.
Guardrails for Input & Output
- Prompt Filtering: Employ lexical or AI-driven filters that scan user prompts for suspicious strings or repeated adversarial instructions.
- Output Checker: Use a secondary LLM or a lighter classification model to review generated responses for policy violations before showing them to the user.
- Context Throttling: If a user repeatedly sends adversarial instructions, the system can automatically throttle or refuse queries.
One of the things we've spent a lot of time thinking about at MarutAI is how to check for these conditions that need to be filtered our stopped out-of-band to the request itself, because what you do in-band slows down the response to the application. It's a delicate balance of security and utility.
Robust Model Training
- Adversarial Fine-Tuning: Intentionally expose your model to jailbreak and injection attempts during the fine-tuning phase. Doing so makes the system more robust against real-world attacks.
- Secure Data Curation: Vet and sanitize training data to ensure no malicious payloads slip in. Even small-scale poisoning can significantly degrade model performance or reliability.
- Differential Privacy Methods: Mask or anonymize training data to reduce the risk of sensitive information extraction.
Continuous Monitoring and Patching
Just as operating systems require regular security updates, LLMs do too:
- Logging & Auditing: Track logs of prompts and system responses to identify patterns that might hint at an emerging attack technique.
- Iterative Patching: Once a vulnerability is discovered (e.g., a new jailbreak approach), promptly update your model’s defensive layers—be it a new prompt filter, a fine-tuned checkpoint, or user policy adjustments.
- Resilent Pipelines: Design robust MLOps/AIOps workflows—with continuous integration, model performance monitoring, and automated version management—to swiftly patch vulnerabilities, tune and update defenses, and prevent operational disruptions.
Future Outlook
The next generation of LLMs will likely span multiple modalities—text, images, audio, and beyond—amplifying their utility but also their attack surface. Research indicates that prompt-based adversarial attacks can extend to image and audio cues, further complicating the security landscape.
To address these challenges, MarutAI's Model Reactor is focused on explainable AIOps tooling, defenses, and more granular model analysis solutions, enabling us to spot anomalous behaviors before they turn into full-scale breaches. We’re also experimenting with zero-knowledge pieplines (zkp) for preserving data confidentiality in scenarios where an LLM must handle sensitive user information.
Conclusion
Securing LLMs is an ongoing journey rather than a one-and-done effort. From jailbreaking to data poisoning, the threats to these models evolve rapidly alongside advancements in AI research. Organizations that fail to proactively address these risks risk both financial and reputational damage.
Our recommendation: take a layered defense approach—from input/output guardrails and robust training to real-time monitoring. This holistic security posture not only protects your organization but also builds trust with your customers and stakeholders.
At MarutAI, we believe the future of AI should be both innovative and safe—and with well-reasoned strategies in place, we can achieve just that.