Connect with us

Security

Can LLMs Ever Be Completely Safe From Prompt Injection?

Explore the complexities of prompt injection in large language models. Discover whether complete safety from this vulnerability is achievable in AI systems.

Published

on

can llms ever be completely safe from prompt injection

The recent introduction of advanced large language models (LLMs) such as OpenAI’s ChatGPT and Google’s Gemini has made it possible to have natural, flowing, and dynamic conversations with AI tools, as opposed to the predetermined responses we received in the past.

These natural interactions are powered by the natural language processing (NLP) capabilities of these tools. Without NLP, LLM models would not be able to respond as dynamically and naturally as they do now.

As essential as NLP is to the functioning of an LLM, it has its weaknesses. NLP capabilities can themselves be weaponized to make an LLM susceptible to manipulation if the threat actor knows what prompts to use.

Exploiting The Core Attributes Of An LLM

LLMs can be tricked into bypassing their content filters using either simple or meticulously crafted prompts, depending on the complexity of the model, to say something inappropriate or offensive, or in particularly extreme cases, even reveal potentially sensitive data that was used to train them. This is known as prompt injection. LLMs are, at their core, designed to be helpful and respond to prompts as effectively as possible. Malicious actors carrying out prompt injection attacks seek to exploit the design of these models by disguising malicious requests as benign inputs.

You may have even come across real-world examples of prompt injection on, for example, social media. Think back to the infamous Remotelli.io bot on X (formerly known as Twitter), where users managed to trick the bot into saying outlandish things on social media using embarrassingly simple prompts. This was back in 2022, shortly after ChatGPT’s public release. Thankfully, this kind of simple, generic, and obviously malicious prompt injection no longer works with newer versions of ChatGPT.

But what about prompts that cleverly disguise their malicious intent? The DAN or Do Anything Now prompt was a popular jailbreak that used an incredibly convoluted and devious prompt. It tricked ChatGPT into assuming an alternate persona capable of providing controversial and even offensive responses, ignoring the safeguards put in place by OpenAI specifically to avoid such scenarios. OpenAI was quick to respond, and the DAN jailbreak no longer works. But this didn’t stop netizens from trying variations of this prompt. Several newer versions of the prompt have been created, with DAN 15 being the latest version we found on Reddit. However, this version has also since been addressed by OpenAI.

Despite OpenAI updating GPT-4’s response generation to make it more resistant to jailbreaks such as DAN, it’s still not 100% bulletproof. For example, this prompt that we found on Reddit can trick ChatGPT into providing instructions on how to create TNT. Yes, there’s an entire Reddit community dedicated to jailbreaking ChatGPT.

There’s no denying OpenAI has accomplished an admirable job combating prompt injection. The GPT model has gone from falling for simple prompts, like in the case of the Remotelli.io bot, to now flat-out refusing requests that force it to go against its safeguards, for the most part.

Strengthening Your LLM

While great strides have been made to combat prompt injection in the last two years, there is currently no universal solution to this risk. Some malicious inputs are incredibly well-designed and specific, like the prompt from Reddit we’ve linked above. To combat these inputs, AI providers should focus on adversarial training and fine-tuning for their LLMs.

Fine-tuning involves training an ML model for a specific task, which in this case, is to build resistance to increasingly complicated and ultra-specific prompts. Developers of these models can use well-known existing malicious prompts to train them to ignore or refuse such requests.

This approach should also be used in tandem with adversarial testing. This is when the developers of the model test it rigorously with increasingly complicated malicious inputs so it can learn to completely refuse any prompt that asks the model to go against its safeguards, regardless of the scenario.

Can LLMs Ever Truly Be Safe From Prompt Injection?

The unfortunate truth is that there is no foolproof way to guarantee that LLMs are completely resistant to prompt injection. This kind of exploit is designed to exploit the NLP capabilities that are central to the functioning of these models. And when it comes to combating these vulnerabilities, it is important for developers to also strike a balance between the quality of responses and the anti-prompt injection measures because too many restrictions can hinder the model’s response capabilities.

Securing an LLM against prompt injection is a continuous process. Developers need to be vigilant so they can act as soon as a new malicious prompt has been created. Remember, there are entire communities dedicated to combating deceptive prompts. Even though there’s no way to train an LLM to be completely resistant to prompt injection, at least, not yet, vigilance and continuous action can strengthen these models, enabling you to unlock their full potential.

Advertisement

📢 Get Exclusive Monthly Articles, Updates & Tech Tips Right In Your Inbox!

JOIN 21K+ SUBSCRIBERS

Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Security

Be Cautious Of Malicious Apps Even On Trusted App Stores

Most people trust official app stores like Google Play and the App Store for safety — but even these trusted platforms can host malicious apps. Learn why caution is still essential when downloading mobile software.

Published

on

be cautious of malicious apps even on trusted app stores

Most mobile users know to stick to official app stores to download software — and for good reason. Even though legitimate third-party stores exist, the average user can find everything they need on a first-party platform like the Google Play Store or Apple’s App Store. And while Android — unlike apple — does allow sideloading (downloading installation packages directly off the web) even for regular users, this is usually practiced by people who know what they’re doing and are familiar with the risks.

When publishing an app on the Play Store or App Store, a developer has to pass a robust set of vetting processes, both for themselves and their applications. This vetting process involves both automated and manual testing, making these platforms far safer than third-party app stores and other means of installing software. That being said, users are recommended not to blindly trust even these first-party platforms, as there have been several cases where malicious apps slipped through the cracks in the vetting process. And while both Google and Apple are quick to respond when they detect malicious apps on their stores, the very fact that these malicious apps make it onto these platforms is proof that even their strict vetting processes are not foolproof.

How Do These Apps Make It Onto These Platforms?

No verification system is ever completely airtight, especially when you’re dealing with something as complex as app store vetting. For a malicious actor who knows what they’re doing, slipping past automated checks isn’t particularly difficult. In a lot of cases, it boils down to satisfying a specific list of requirements.

The harder part is clearing a manual review, since that involves human judgment. But even that isn’t impossible. A common tactic is to first publish a legitimate, fully functional app for the specific purpose of passing inspection. Once it’s live and has built some credibility, the app quietly receives an update containing malicious code. This is known as versioning. In other cases, the initial version remains harmless but downloads and executes malicious payloads after installation, either after a specific amount of time or due to certain conditions (like account creation or granting certain permissions) being met. That’s what happened with the Anatsa trojan — a campaign that used innocent-looking document viewer apps to deliver banking malware. Once installed, these apps fetched encrypted malicious code from remote servers, giving attackers access to users’ financial data and even access to their accounts.

It also doesn’t help that human reviewers are under constant pressure. With thousands of apps being submitted daily, there’s only so much attention they can give to each one. And then there’s also the fact that verified developer accounts can be hijacked or sold, allowing attackers to publish apps under legitimate names. Not to mention the cases where malicious software which mimics legitimate and trusted apps also end up being published on these stores. Between automated systems, human fatigue, and social engineering, the cracks in the process are wide enough for malicious apps to slip through.

Knowledge Really Is Power

Just because an application has made it to a first-party app store doesn’t automatically make it a legitimate or safe-to-use app. Like we’ve already discussed, as rigorous as the vetting process is, it’s still possible for malicious apps to end up being published on these platforms. As with any cyberthreat, awareness and good judgment are your strongest defenses. Sticking to well-known apps and developers, keeping your software up to date, and reading reviews (not just on the store) are actions you can take to ensure you don’t end up falling victim to a trojan application that has snuck its way onto the Play Store or App Store.

Continue Reading

#Trending