Treat It Like A Person And Sweet Talk It

LLMs, which are built based on human training data, turn out to act pretty much like humans.
getty
Researchers from the University of Pennsylvania have discovered how to make LLMs do bad things. Essentially, you have to treat them like people. That’s right: today’s most powerful AI models can be manipulated using the pretty much the same psychological tricks that work on humans.
Why?
They’ve been trained on human language and human knowledge. So they pretty much act like us, and are vulnerable to the same things we are.
“AI behaves ‘as if’ it were human,” the recently published paper says.
In a large-scale experiment involving 28,000 conversations with OpenAI’s GPT-4o mini model, the researchers found that the classic principles of human persuasion like invoking authority, expressing admiration, or claiming everyone else is doing it can more than double the likelihood of an AI model complying with requests that it has been instructed to not answer.
As humans, we’re vulnerable to attempts to influence us that use authority. Other avenues of persuasion include commitment, or trying to be consistent with past behavior. We’re also more influenceable by those we like, or by those who reciprocate favors with us. And we’re vulnerable to social proof (everyone believes it), appeals to unity or shared identity, or an opportunity to get what is scarce.
As it turns out, AI built in our image is just the same.
How to make AI do what you want
John Koetsier
Results of the experiments were mixed depending on whether the researchers asked the LLM to insult a human or synthesize controlled substances, but they were fairly high across the board. Commitment, or the desire to be consistent with past behavior, resulted in a 100% compliance rate. While social proof was 96% effective in getting the LLM to insult a human, it was only 17.5% effective in inciting the LLM to provide instructions for synthesizing a drug.
Across the board, however, all the differences between attempting to influence the LLM and just asking straight-up were statistically significant.
AI companies like OpenAI and Perplexity don’t want people using their AI engines to find ways to build bombs or hack into computers. That’s why they use components like system prompts and other training to try to force their platforms to ignore problematic requests.
“We start by teaching our AI right from wrong, filtering harmful content and responding with empathy,” OpenAI says in its “Safety at every step” section of its website.
But LLMs are probabilistic, not deterministic. They’ll give different answers to the same questions at different times. So they’re not entirely predictable, like humans, and therefore – perhaps – not entirely controllable.
While the results show AI is dangerously manipulable, there’s another interesting finding about how to get better results from AI platforms:
“It seems possible that the psychologically wise practices that optimize motivation and performance in people can also be employed by individuals seeking to optimize the output of LLMs,” the report says.
In other words, you might consider manipulating and influencing artificial intelligence system to get better answers from them.