Security
Can LLMs Ever Be Completely Safe From Prompt Injection?
Explore the complexities of prompt injection in large language models. Discover whether complete safety from this vulnerability is achievable in AI systems.
The recent introduction of advanced large language models (LLMs) such as OpenAI’s ChatGPT and Google’s Gemini has made it possible to have natural, flowing, and dynamic conversations with AI tools, as opposed to the predetermined responses we received in the past.
These natural interactions are powered by the natural language processing (NLP) capabilities of these tools. Without NLP, LLM models would not be able to respond as dynamically and naturally as they do now.
As essential as NLP is to the functioning of an LLM, it has its weaknesses. NLP capabilities can themselves be weaponized to make an LLM susceptible to manipulation if the threat actor knows what prompts to use.
Exploiting The Core Attributes Of An LLM
LLMs can be tricked into bypassing their content filters using either simple or meticulously crafted prompts, depending on the complexity of the model, to say something inappropriate or offensive, or in particularly extreme cases, even reveal potentially sensitive data that was used to train them. This is known as prompt injection. LLMs are, at their core, designed to be helpful and respond to prompts as effectively as possible. Malicious actors carrying out prompt injection attacks seek to exploit the design of these models by disguising malicious requests as benign inputs.
You may have even come across real-world examples of prompt injection on, for example, social media. Think back to the infamous Remotelli.io bot on X (formerly known as Twitter), where users managed to trick the bot into saying outlandish things on social media using embarrassingly simple prompts. This was back in 2022, shortly after ChatGPT’s public release. Thankfully, this kind of simple, generic, and obviously malicious prompt injection no longer works with newer versions of ChatGPT.
But what about prompts that cleverly disguise their malicious intent? The DAN or Do Anything Now prompt was a popular jailbreak that used an incredibly convoluted and devious prompt. It tricked ChatGPT into assuming an alternate persona capable of providing controversial and even offensive responses, ignoring the safeguards put in place by OpenAI specifically to avoid such scenarios. OpenAI was quick to respond, and the DAN jailbreak no longer works. But this didn’t stop netizens from trying variations of this prompt. Several newer versions of the prompt have been created, with DAN 15 being the latest version we found on Reddit. However, this version has also since been addressed by OpenAI.
Despite OpenAI updating GPT-4’s response generation to make it more resistant to jailbreaks such as DAN, it’s still not 100% bulletproof. For example, this prompt that we found on Reddit can trick ChatGPT into providing instructions on how to create TNT. Yes, there’s an entire Reddit community dedicated to jailbreaking ChatGPT.
There’s no denying OpenAI has accomplished an admirable job combating prompt injection. The GPT model has gone from falling for simple prompts, like in the case of the Remotelli.io bot, to now flat-out refusing requests that force it to go against its safeguards, for the most part.
Strengthening Your LLM
While great strides have been made to combat prompt injection in the last two years, there is currently no universal solution to this risk. Some malicious inputs are incredibly well-designed and specific, like the prompt from Reddit we’ve linked above. To combat these inputs, AI providers should focus on adversarial training and fine-tuning for their LLMs.
Fine-tuning involves training an ML model for a specific task, which in this case, is to build resistance to increasingly complicated and ultra-specific prompts. Developers of these models can use well-known existing malicious prompts to train them to ignore or refuse such requests.
This approach should also be used in tandem with adversarial testing. This is when the developers of the model test it rigorously with increasingly complicated malicious inputs so it can learn to completely refuse any prompt that asks the model to go against its safeguards, regardless of the scenario.
Can LLMs Ever Truly Be Safe From Prompt Injection?
The unfortunate truth is that there is no foolproof way to guarantee that LLMs are completely resistant to prompt injection. This kind of exploit is designed to exploit the NLP capabilities that are central to the functioning of these models. And when it comes to combating these vulnerabilities, it is important for developers to also strike a balance between the quality of responses and the anti-prompt injection measures because too many restrictions can hinder the model’s response capabilities.
Securing an LLM against prompt injection is a continuous process. Developers need to be vigilant so they can act as soon as a new malicious prompt has been created. Remember, there are entire communities dedicated to combating deceptive prompts. Even though there’s no way to train an LLM to be completely resistant to prompt injection, at least, not yet, vigilance and continuous action can strengthen these models, enabling you to unlock their full potential.