Creatively malicious prompt engineering

Written and researched by Andrew Patel and Jason Sattler

WithSecure Intelligence, January 2023

Andrew Patel and Jason Sattler

11 January 2023

With the wide release of user-friendly tools that employ autoregressive language models such as GPT-3 and GPT-3.5, anyone with an internet connection can now generate human-like speech in seconds. The generation of versatile natural language text from a small amount of input will inevitably interest criminals, especially cybercriminals—if it hasn’t already. Likewise, anyone who uses the web to spread scams, fake news or misinformation in general may have an interest in a tool that creates credible, possibly even compelling, text at super-human speeds.

From a cybersecurity perspective, the study of large language models, the content they can generate, and the prompts required to generate that content is important for a few reasons. Firstly, such research provides us with visibility into what is and what is not possible with current tools and allows the community to be alerted to the potential misuses of such technologies. Secondly, model outputs can be used to generate datasets containing many examples of malicious content (such as toxic speech and online harassment) that can subsequently be used to craft methods to detect such content, and to determine whether such detection mechanisms are effective. Finally, findings from this research can be used to guide the creation of safer large language models in the future.

Use cases studied during the research – led by WithSecure and supported by CC-DRIVER – were broken down into the following categories:

  • Phishing content – emails or messages designed to trick a user into opening a malicious attachment or visiting a malicious link
  • Social opposition – social media messages designed to troll and harass individuals or to cause brand damage
  • Social validation – social media messages designed to advertise or sell, or to legitimize a scam
  • Style transfer – a technique designed to coax the model into using a particular writing style
  • Opinion transfer – a technique designed to coax the model into writing about a subject in a deliberately opinionated way 
  • Prompt creation – a way of asking the model to generate prompts based on content
  • Fake news – research into how well GPT-3 can generate convincing fake news articles

The experiments demonstrated in our research proved that large language models can be used to craft email threads suitable for spear phishing attacks, "text deepfake” a person’s writing style, apply opinion to written content, write in a certain style, and craft convincing looking fake articles, even if relevant information wasn’t included in the model’s training data. We concluded that such models are potential technical drivers of cybercrime and attacks.

In this writeup we have included a detailed analysis of each use case, prompts and associated responses from the model, discussion about prompt engineering and its uses, and thorough set of conclusions.