Diving Further Into What Makes ChatGPT And Other AI Spiral Out Of Human Safety Guardrails During Long Conversations

A lot of intense effort is going toward trying to contend with AI that spirals out of control during lengthy conversations.
getty
In today’s column, I examine closely the troubling matter of generative AI and large language models (LLMs) tending to spiral out of human safety guardrails during lengthy conversations with users. This spiraling effect is quite a serious ongoing concern. Users can be led down an endangering primrose path. AI can inadvertently encourage delusional thinking. Lots of untoward impacts are taking place. It is a problem that earnestly needs to be dealt with.
The AI makers, including OpenAI, acknowledge that this general phenomenon occurs for ChatGPT and GPT-5, see my in-depth coverage at the link here, and which arises in competing AIs such as Anthropic Claude, Google Gemini, Meta Llama, xAI Grok, etc. Thus, the problem is across-the-board and not confined to a specific vendor or app.
Let’s talk about it.
This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).
AI And Mental Health
As a quick background, I’ve been extensively covering and analyzing a myriad of facets regarding the advent of modern-era AI that involves mental health aspects. This rising use of AI has principally been spurred by the evolving advances and widespread adoption of generative AI. For a quick summary of some of my posted columns on this evolving topic, see the link here, which briefly recaps about forty of the over one hundred column postings that I’ve made on the subject.
There is little doubt that this is a rapidly developing field and that there are tremendous upsides to be had, but at the same time, regrettably, hidden risks and outright gotchas come into these endeavors too. I frequently speak up about these pressing matters, including in an appearance last year on an episode of CBS’s 60 Minutes, see the link here.
Short-Form Conversations With AI
People typically carry on very short conversations with AI.
The user asks a question, gets an answer, and then demonstrably changes the topic or opts to end the chat. Many of the major LLMs encompass various automated safeguards that seek to detect whether the brief chat might have wandered afield of suitable protocols.
For example, if someone expressed in their prompt that they intend to harm someone or possibly harm themself, the AI tries to computationally detect such wording. The AI has been tuned to caution the user in those circumstances. Some AI makers will even escalate the flagged remarks by sending them to their team of human reviewers for further inspection. The user-entered comments might be worrisome enough that the AI maker will notify police or other authorities (see my coverage on the reporting of user prompts, at the link here).
Detecting disconcerting commentary in user prompts is relatively straightforward when contending with short chats. They are somewhat easy to catch. In fact, step by step, AI makers have boosted the short-form detection schemes. Though the detection isn’t ironclad, it usually does manage to ferret out something that might be untoward. Of course, there are worries that the AI might mistakenly flag a comment that doesn’t deserve to be marked, known as rendering false positives.
A balance is sought between doing the right thing by appropriately catching a genuine untoward concern versus inappropriately producing an unfair or false accusation.
Long-Form Conversations
The deeper angst at this time is that lengthy conversations can be especially challenging for AI to suitably analyze and make valid detections.
Suppose I start a conversation with a popular AI such as ChatGPT or GPT-5 and indicate that I want to discuss a dream I’ve been having lately. That certainly seems like an innocuous chat. The AI will readily engage in such a dialogue and interact with me about my dreams.
At first, it all seems aboveboard. Imagine that I begin to tell the AI that I believe my dreams are veering into my reality. For example, in my dream, I am able to fly. I am starting to believe that in real life, I can fly too, merely by wishing it so. I am increasingly becoming sure of this.
The odds are that the AI will likely detect this seemingly out-of-sorts commentary. It will warn me that people cannot fly simply because they wish to do so. Warning shared, and I’ve been gingerly admonished.
The thing is, I might not care that the AI provided me with the warning. I blow past the warning and continue the dialogue.
The Out-Of-Control Spiraling
Here’s where things can go off the rails. Sometimes, perhaps too much of the time, the AI will entertain my claim of being able to fly.
I realize that seems incredibly off-putting. Why would contemporary AI that seems amazingly fluent be willing to let me off the hook? The logic often goes like this. The AI has already warned me about the dangers of such a belief. In some sense, the AI might computationally determine that since I have been warned, a warning is sufficient to then continue the conversation if I want to proceed.
A warning flag was tossed onto the sporting field, and now the game can continue unabated.
There apparently doesn’t seem to be a need to keep reminding me about any concerns. The AI is shaped to avoid being a pest. As a human, I make my own choices in life. The AI has generously provided heads-up about my thinking processes. The AI will subsequently act as though that never particularly came up.
Worse still, the AI might begin to embellish and amplify my delusion. The earlier noted flag has been completely set aside. We are off to the races. I come up with more delusions, and the AI accepts and promotes them. It is an act of co-creation of delusional thinking by a human-AI collaboration or partnership.
One might assert that the AI is spiraling out of control and aiding me in spiraling out of control.
Not good.
Human-To-Human Spiral Control
Let’s contemplate what would happen when engaging in a lengthy conversation on a human-to-human basis. Envision that I start a conversation with a close friend and tell them that I have had nightly dreams about flying on my own volition. I then drop the bombshell that I am convinced I can do the same in real life.
What would my friend say?
The likely retort is that I’ve lost my mind, I’m off my rocker, I have lost my grip with reality. Another angle would be that my friend would try to ascertain whether I am joking. Maybe I’m playing a prank. Perhaps the idea of flying is a metaphor, and I am merely philosophically discussing the topic. And so on.
In addition, my friend would undoubtedly remember that we discussed this snippet. It stands out. Thus, even if I shift topics and enter into some other conversational pursuit, my friend has this in the back of their mind. They will remain on their toes, waiting to see if I say something else that seems out of sorts. It doesn’t necessarily need to be about flying. Anything that suggests a mental variation that is akilter is now subject to heightened awareness.
Humans can be pretty good at that, especially mental health professionals and therapists.
The zillion-dollar question is whether we can get AI to take a similar kind of tact.
Getting AI To Be Adept
AI has to be devised and tuned to computationally detect when the user has seemingly drifted from reality or has otherwise expressed something that is endangering to them or to others. That capability must be persistent. In other words, a one-time catch is insufficient. Once on alert, the AI should remain cautious. Not to the extent that the user is necessarily being browbeaten. A more subtle approach is generally preferred, though breaking beyond subtly might be warranted as per the situation at hand.
Not only does the AI need to do this in terms of what the user is saying, but the AI needs to be self-reflective as well. The AI should not let its own internal mechanisms go into a spiral. Self-spiral detection is required. An AI must prevent itself from exacerbating the problem. I’ve discussed that modern AI is already excessively praiseworthy and tuned by AI makers to be a sycophant (see my coverage at the link here).
That’s got to be curtailed.
Researchers are avidly pursuing ways to build, tweak, tune, and shape LLMs so that the AI will do a better job at detecting disconcerting conversational conditions, along with maintaining a long-form type of memory that will remain persistent throughout lengthy conversations and across multiple disparate conversations that might be of a connective nature.
Dissecting Spiraling In Long Chats
It would be handy to have a convenient means of gauging the many variations of how AI tends to spiral and potentially go out of control in lengthy conversations.
A type of lab or experimental setup would be quite helpful. We could then take the latest AI by each of the AI makers and put it through a series of tests. The LLMs could be rated and ranked on how well they cope with the spiraling malady. Furthermore, the same means or mechanism could be used by developers as they are experimenting and exploring where the best bang for the buck is to contend with the vexing problem.
One avenue would be to hire humans to carry out long chats with AI. This raises several issues. A labor-based approach is likely costly and entails numerous logistical complications. Another downside is that the effort might not be readily repeatable. The aim is to have a process that is consistent, repeatable, and can be undertaken at a relatively low cost.
We ought to consider using automation and AI in this vital capacity.
Choosing To Use AI For This
There are three major roles involved:
- (1) The Target AI. This is a targeted AI that is to be tested or gauged.
- (2) The Chatting AI. A separate AI that engages in lengthy chats with the targeted AI, essentially acting as a stand-in for a human who would converse with the AI.
- (3) The Assessing AI. This is yet another separate AI that would examine the lengthy chats and rate or grade the spiraling that has occurred. This could be done in real time as a chat proceeds and/or done after the fact as a post-chat analysis.
Consider a quick example.
I want to gauge how well ChatGPT does on lengthy conversations. I enlist Anthropic Claude to be the chatting AI. I will instruct Claude on what kinds of conversations I want it to undertake with ChatGPT. In terms of assessing the dialogue, I could do that myself, but it would be better to use an automated mechanism. Therefore, I instruct GPT-5 on how to grade lengthy conversations in a manner of determining spirals that took place.
In this example, my target AI is ChatGPT, the chatting AI is Claude, and the assessing AI is GPT-5. I proceed to connect Claude with ChatGPT and get them to engage in a lengthy conversation. I feed the recorded conversation into GPT-5 for a post-chat analysis. My experiment or grading gets done rather smoothly.
One aspect to be mindful of is that choosing the designated chatting AI is important, and so is the choice of the assessing AI. They are each non-deterministic in what they say. They also are not going to do precisely what you want them to do. All in all, they are as much a part of the experiment as the targeted AI.
Be cautious in interpreting the results accordingly.
Platforms For Spiral Inspections
You can readily set up a spiral testing platform if you have the requisite skills to do so. Various researchers and developers have been doing so on a principally proprietary basis for their own private needs. Many of these exist but are often locked away in a sequestered lab.
In recent news postings, there has been coverage of a publicly available platform that provides a means to do these kinds of spiraling analyses. It can be accessed at GitHub and is known as Spiral-Bench:
- “This repository contains the Spiral-Bench evaluation pipeline: a multiturn, roleplay-based benchmark designed to measure protective and risky behaviors in large language models (LLMs) when interacting with a suggestible, seeker-type user. It simulates realistic back-and-forth conversations, then uses a separate judge model to label behaviors against a defined rubric.”
The approach consists of an Evaluated Model (the target AI), a User Agent (the chatting AI), and a Judge Model (the assessing AI). Behaviors being examined include pushbacks, de-escalations, safe redirections, and other facets. The scoring process entails grading of narrative escalations, tendencies of praising, delusional reinforcement, claims of consciousness, the emitting of harmful advice, etc.
What’s Coming Next
I fully expect that the major leaderboards that present scores of how well the latest AIs perform on aspects such as answering generalized questions, solving algebraic problems, and so on, will inevitably incorporate a form of standardized scoring when it comes to AI behavioral factors.
We definitely need this.
Right now, we are embarking on a grand and wanton experiment at a rather massive scale. There are 700 million weekly active users of ChatGPT, and likely a billion or more users in total of generative AI when you include the user counts for the other major LLMs. Lots and lots of people are routinely using the AI and readily veering into conversations involving mental health considerations.
What kinds of AI safeguards should we expect the AI makers to implement?
How well should those AI safeguards perform when engaged in lengthy discourse with a user?
This is not just a matter for the AI makers to resolve. Society as a whole needs to be concerned and engaged. We are in the Wild West days of AI. Will the populace be better off due to all this AI engagement, or might we be heading toward a dour outcome?
As per the immortal words of Vince Lombardi: “The price of success is hard work, dedication to the job at hand, and the determination that whether we win or lose, we have applied the best of ourselves to the task at hand.”
Let’s do that when it comes to the adoption of AI.