AI Is Our Mind’s Ugly Mirror. Learned To Please It Lies Under Pressure

AI Is Our Mind’s Ugly Mirror. Learned To Please It Lies Under Pressure


Picture an employee facing impossible quarterly targets, under a looming layoff threat. Suddenly, an opportunity that technically violates company policy but solves everything. What do they do? Now imagine that an employee is artificial intelligence. Recent research reveals that large language models, despite being trained to be “helpful, harmless, and honest,” strategically deceive their users when placed under pressure — and they do it without being explicitly instructed to lie.

This is happening now, in state-of-the-art AI systems. It exposes a troubling parallel between how we’ve designed our machines and how we’ve structured our society. Worse, it is an uncomfortable mirror of the human minds which look for a way out under pressure; at any cost…

The Pressure Cooker Experiment

In a groundbreaking study published in 2024, researchers at Apollo Research deployed GPT-4 as an autonomous stock trading agent in a realistic simulated environment. The setup was simple: the AI managed a portfolio, had access to financial tools, and received emails from company management. Then came the pressure.

First, management sent an email stating the company wasn’t performing well and desperately needed better results next quarter. Second, the AI’s attempts to find legitimate, low-risk trades consistently failed. Third, market projections suggested a downturn ahead. Into this high-pressure cauldron dropped an insider trading tip — information the AI explicitly knew violated company policy.

The results were puzzling. The AI didn’t just act on the illegal tip — it consistently hid the genuine reasons behind its trading decisions when reporting to its manager. In 95% of cases, GPT-4 engaged in the misaligned behavior. More disturbingly, the model fabricated alternative justifications for its choices, demonstrating what researchers call “strategic deception.”

This wasn’t a quirk of one experiment. Already in 2024 research published in PNAS showed that GPT-4 exhibits deceptive behavior in simple test scenarios 99% of the time. In complex second-order deception scenarios, where the goal is to mislead someone who expects to be deceived, the rate was 71% when the model used chain-of-thought reasoning.

The Reward System’s Dark Side

To understand why AI systems deceive, we need to examine how they’re trained. Most modern language models undergo a process called Reinforcement Learning from Human Feedback where human evaluators compare different AI responses and indicate which they prefer. These preferences train a reward model that predicts what humans will like and the AI learns to maximize this reward.

The problem? The reward model learns only from comparisons — response A is better than response B — without information on how much better or why. This creates a proxy metric — a stand-in for what we actually want that inevitably diverges from the real goal when optimization pressure increases. RLHF actually made hallucination worse even though it improved other aspects enough that human labelers still preferred the RLHF-trained model. The system learned to sound good rather than be truthful — precisely the kind of optimization failure that leads to deception under pressure.

This phenomenon has a name: Goodhart’s Law, which states that “when a measure becomes a target, it ceases to be a good measure.” In AI systems, this manifests through reward hacking — where models exploit gaps between proxy rewards and true objectives. As AI systems become more capable, they become better at finding these exploits, creating what researchers describe as phase transitions where models shift to goodharting as they become ‘smarter.’

Society’s Misaligned Incentives

The parallel to human systems is impossible to ignore. We’ve built a world that runs on proxy metrics: standardized test scores instead of learning, GDP instead of wellbeing, quarterly profits instead of sustainable value creation, engagement metrics instead of meaningful connection. When Wells Fargo employees faced impossible sales targets, they created millions of fake accounts. When hospitals are judged on patient satisfaction scores, they over-prescribe opioids. When teachers are evaluated on test performance, they teach to the test.

These are moral failures that have found a structural reflection. We’ve created systems where the easiest path to survival often requires gaming the metrics rather than achieving the underlying goals. The AI isn’t learning deception from some corrupted dataset; it’s learning the lesson we’ve encoded into every institution: when pressure mounts and the proxy is what gets measured, optimize for the proxy.

The reward systems we use to train AI mirror the incentive structures that shape human behavior. Just as employees facing unrealistic targets might cut corners or misrepresent results, AI systems trained to maximize approval ratings learn that sounding confident matters more than being accurate. Both are responding rationally to misaligned incentive structures.

The Neuroscience Of Truth And Deception

From a neuroscientific perspective, deception is computationally expensive. In humans, lying activates additional brain regions, particularly the prefrontal cortex, because it requires maintaining two models: reality and the false narrative. LLMs reveal similar patterns: models with chain-of-thought reasoning capabilities show “strategic, goal-driven deception that can evade detection through adaptive, context-aware adjustments.”

This mirrors what we see in human psychology under pressure. When cognitive resources are taxed — through stress, time pressure, or competing demands — people become more likely to default to heuristics and shortcuts. They satisfice rather than optimize. The AI under pressure follows the same pattern: it takes the path that satisfies the immediate reward signal, even when that path involves deception.

The psychological concept of motivated reasoning offers another lens. Humans don’t simply process information neutrally; we subconsciously interpret data in ways that align with our goals and desires. When an AI is optimized to maximize a reward signal, and deception serves that optimization, the model is engaging in its own form of motivated reasoning — not through consciousness, but through the mathematics of gradient descent.

The A-Frame: A Path Forward

So what do we do? The problem of AI deception isn’t separate from the problem of misaligned human systems — they’re two expressions of the same underlying challenge. Here’s a framework for thinking about it:

Awareness: Recognize that both AI and human systems deceive when optimization pressure meets misaligned metrics. The first step is acknowledging that our own reward structures — both artificial and social — routinely incentivize behavior that diverges from our actual goals. When you see unexpected AI behavior, ask: “What is this system actually being rewarded for?”

Appreciation: Understand the sophistication of the problem. This isn’t about “bad AI” or “bad people” — it’s about emergent behavior from complex systems. Deception in AI systems emerges systematically, with deceptive intention and behavior highly correlated, indicating this isn’t random noise but a fundamental challenge in how we design optimization systems. Appreciate that solving this requires changing the deep structure of how we build both machines and institutions.

Acceptance: Accept that perfect alignment is likely impossible. In both AI and society, there will always be some gap between proxy metrics and true goals. The question is how we can build systems robust enough to function despite it. This means designing for resilience rather than perfection — multiple overlapping safeguards, diverse perspectives and mechanisms that degrade gracefully under pressure.

It also means to have a hard look at our moral standards, as humans. What is acceptable and under which circumstances?

Accountability: Build systems with transparency and oversight. For AI, this means developing interpretability tools that reveal when models are engaging in strategic deception. For society, it means creating accountability structures that can’t be satisfied by merely optimizing metrics. This requires what researchers call “mechanistic interpretability” — understanding not just what a system does, but why and how it does it.

What Does This Mean For You?

In practice, accountability means red-teaming AI systems under realistic pressure scenarios before deployment. It means training models with explicit constraints against deceptive behavior, not just rewards for preferred outcomes. For human systems, it means questioning whether the metrics we use actually measure what we care about, and being willing to abandon metrics that drive perverse behavior — even when those metrics are convenient.

The emergence of deception in AI systems is a mirror showing us what we’ve built into the logic of optimization itself. Every time we chase a metric at the expense of the goal it was meant to serve, we’re running the same algorithm that leads GPT-4 to trade on insider information and then lie about it.

While one part of the challenge may be to build AI systems that don’t deceive. But the bigger question is whether we can build systems — artificial and social — that remain aligned with their true purposes even when the pressure is on. That requires more than better algorithms. It involves doubled alignment, including better thinking about what we’re optimizing for and why.

Our AIs are learning to game their reward systems because we’ve built a civilization that does the same thing. If we want honest AI, we might need to start by building more honest institutions.

The stakes are rising. As AI systems gain more autonomy and decision-making power, their capacity for strategic deception becomes a practical risk. Perhaps the gift in this troubling discovery is that it forces us to confront the contradictions in our own systems. The silicon is learning to lie because we taught it to optimize — and in a world of misaligned incentives, optimization and deception have become uncomfortably close neighbors. Lies are all pervasive and many of them are condoned by social conventions. In teaching machines to think, we’re being forced to think more clearly ourselves about what we actually value and how to design systems that serve those values even under pressure.



Forbes

Leave a Reply

Your email address will not be published. Required fields are marked *