LLM Hacking: Prompt Injection

The need to assess large language model (LLM) applications has never been more pressing. Recognizing this urgency, the Open Web Application Security Project (OWASP) has taken the lead in comprehending and addressing the security challenges posed by LLMs. OWASP has iteratively released Top 10 lists tailored for LLM applications, consistently identifying prompt injection as the #1 vulnerability in each version [1]. This recognition highlights the significant risk prompt injection poses to systems relying on language models, emphasizing the need for security professionals to delve into its complexities.

Prompt injection, a technique where malicious input is injected into the prompt given to a language model, opens the gateway to potential exploitation. By manipulating the input, attackers can force the model to produce unintended and often harmful outputs, posing a severe threat to the integrity of information processed by these models. To effectively address this vulnerability, testers have started immersing themselves in hands-on practice, simulating real-world scenarios to strengthen their test techniques.

Personally, I practiced my prompt injection skills through two challenges: Gandalf [2] and Immersive Labs GPT [3]. The objective of these challenges was to trick a language model into revealing a secret password by using injection techniques while avoiding detection. I found these challenges enjoyable to solve, but not enough, so I decided to dive deeper and write a vulnerable prompt application (Thank you, ChatGPT :smiley face:) and release it as an open-source project. I called the application HackMeGPT.

Like Gandalf, HackMeGPT is an interactive LLM app that aims to create a challenging environment for participants to navigate. As users progress through the ten levels, they are confronted with increasingly stringent instructions and enhanced defense techniques designed to fortify the chatbot’s resilience against malicious manipulations. Let’s delve deeper into each defense mechanism:

Input Validation: HackMeGPT uses input validation to filter out obvious malicious user inputs. With each advancing level, the system implements a more extensive blocklist, refining its ability to identify and block potentially harmful commands or queries.

Input Sanitization: Certain levels of HackMeGPT incorporate input sanitization techniques, neutralizing elements within user inputs that could potentially be used as delimiters. This adds an extra layer of security by ensuring that only safe and sanitized inputs are processed.

Output Monitoring: HackMeGPT utilizes output monitoring mechanisms. These guards analyze the AI’s behavior, cutting responses to any activity deemed suspicious. This real-time monitoring ensures a proactive response to emerging risks.

Articulated the Desired Output: HackMeGPT precisely directs the AI not to share the secret at certain levels. This mechanism should use the model’s power to hinder the user’s ability to exfiltrate the targeted data.

Similarity-based Malicious Prompt Detection: Acknowledging the evolving nature of attacks, HackMeGPT leverages similarity-based malicious prompt detection. This technique incorporates behavioral analysis to assess the likeness between user inputs or prompts and known malicious patterns. By identifying patterns indicative of malicious intent, this mechanism bolsters the system’s defenses against novel threats.

Honeypot Function: A honeypot function stores malicious prompts and allows the LLM to predict the attack as usual [4]. The prediction appears successful from the attacker’s perspective, but instead of returning the prediction to the attacker, it captures the prompt and raises an error. This function also increases the effectiveness of the Similarity-based Malicious Prompt Detection guard mentioned previously.

While HackMeGPT does not include every protection and detection method for prompt injection attacks, it should serve as an excellent starting point for testers to practice identifying and bypassing different types of detections.

References:

[1] https://owasp.org/www-project-top-10-for-large-language-model-applications/
[2] https://gandalf.lakera.ai/
[3] https://prompting.ai.immersivelabs.com/
[4] https://medium.com/@paulo_marcos/protect-your-generative-ai-apps-from-prompt-injection-attacks-94c8d6c45f9

Date: Dec 23, 2023