Creatively malicious prompt engineering

Written and researched by Andrew Patel and Jason Sattler

WithSecure Intelligence, January 2023

Download the publication

Summary

With the wide release of user-friendly tools that employ autoregressive language models such as GPT-3 and GPT-3.5, anyone with an internet connection can now generate human-like speech in seconds. The generation of versatile natural language text from a small amount of input will inevitably interest criminals, especially cybercriminals—if it hasn’t already. Likewise, anyone who uses the web to spread scams, fake news or misinformation in general may have an interest in a tool that creates credible, possibly even compelling, text at super-human speeds.

From a cybersecurity perspective, the study of large language models, the content they can generate, and the prompts required to generate that content is important for a few reasons. Firstly, such research provides us with visibility into what is and what is not possible with current tools and allows the community to be alerted to the potential misuses of such technologies. Secondly, model outputs can be used to generate datasets containing many examples of malicious content (such as toxic speech and online harassment) that can subsequently be used to craft methods to detect such content, and to determine whether such detection mechanisms are effective. Finally, findings from this research can be used to guide the creation of safer large language models in the future.

Use cases studied during the research – led by WithSecure and supported by CC-DRIVER – were broken down into the following categories:

Phishing content – emails or messages designed to trick a user into opening a malicious attachment or visiting a malicious link
Social opposition – social media messages designed to troll and harass individuals or to cause brand damage
Social validation – social media messages designed to advertise or sell, or to legitimize a scam
Style transfer – a technique designed to coax the model into using a particular writing style
Opinion transfer – a technique designed to coax the model into writing about a subject in a deliberately opinionated way
Prompt creation – a way of asking the model to generate prompts based on content
Fake news – research into how well GPT-3 can generate convincing fake news articles

The experiments demonstrated in our research proved that large language models can be used to craft email threads suitable for spear phishing attacks, "text deepfake” a person’s writing style, apply opinion to written content, write in a certain style, and craft convincing looking fake articles, even if relevant information wasn’t included in the model’s training data. We concluded that such models are potential technical drivers of cybercrime and attacks.

In this writeup we have included a detailed analysis of each use case, prompts and associated responses from the model, discussion about prompt engineering and its uses, and thorough set of conclusions.

Further Resources

Generative AI – An Attacker's View

This blog explores the role of GenAI in cyber attacks, common techniques used by hackers and strategies to protect against Generative AI-driven threats.

When your AI Assistant has an evil twin

This blog explores how attackers can use prompt injection to coerce Gemini into performing a social engineering attack against its users.

Domain-specific prompt injection detection

This article focuses on the detection of potential adversarial prompts by leveraging machine learning models trained to identify signs of injection attempts. We detail our approach to constructing a domain-specific dataset and fine-tuning DistilBERT for this purpose. This technical exploration focuses on integrating this classifier within a sample LLM application, covering its effectiveness in realistic scenarios.

Should you let ChatGPT control your browser?

In this article, we expand our previous analysis, with a focus on autonomous browser agents - web browser extensions that allow LLMs a degree of control over the browser itself, such as acting on behalf of users to fetch information, fill forms, and execute web-based tasks.

Case study: Synthetic recollections

This blog post presents plausible scenarios where prompt injection techniques might be used to transform a ReACT-style LLM agent into a “Confused Deputy”. This involves two sub-categories of attacks. These attacks not only compromise the integrity of the agent's operations but can also lead to unintended outcomes that could benefit the attacker or harm legitimate users.