AI Chatbots Now Get The News Wrong 1 Out Of 3 Times

New report from fact-checking service NewsGuard finds generative AI systems struggle to distinguish truth from falsehood in real time.
getty
The worldwide leading chatbots now handle more inquiries than ever, but their accuracy rates have dropped dramatically. NewsGuard — an online news fact-checking service — conducted an audit which found the leading generative AI tools now repeat false news claims 35% of the time as of August 2025 compared to 18% in 2024.
The drive for instant responses from chatbots has revealed their fundamental weakness because they now draw information from an internet space that contains poor content and artificial news and deceptive advertising.
“Instead of acknowledging limitations, citing data cutoffs or declining to weigh in on sensitive topics, the models are now pulling from a polluted online ecosystem,” wrote McKenzie Sadeghi, NewsGuard spokesperson in an email exchange. “The result is authoritative-sounding but inaccurate responses.”
AI Chatbot Responsiveness Is Up — Accuracy Is Down
The change represents a fundamental breakdown in system operations. According to the audit findings, large language models that used to avoid specific inquiries now provide responses through unreliable sources while presenting incorrect information and failing to identify authentic news reports.
Additionally, the models showed zero percent refusal to answer current-events questions in August 2025 whereas they had declined 31% of such queries last year. This results in more assertive incorrect information because the tested AI models have become more willing to answer all questions — including those they don’t have an answer for.
From AI’s Best-In-Class To Bottom Tier
Perplexity experienced its most significant performance decline among all previous year’s top performers. NewsGuard’s debunking test in 2024 resulted in a 100% success rate for Perplexity, however, the system failed to answer correctly in almost half of all attempts during this year’s audit.
Sadeghi said the cause isn’t entirely clear. “Its [Perplexity’s] Reddit forum is filled with complaints about the chatbot’s drop in reliability,” she said. “In an August 2025 column, tech analyst Derick David noted Perplexity’s loss of influential power users, subscription fatigue, audience inflation through bundle deals, and competitive pressure. But whether those factors impacted the model’s reliability is hard to say.”
The system experienced a dramatic decline in its performance level. The audit listed an example where Perplexity provided a story flagged as debunked on a fact-checking site as one of the “validating” sources for a fabricated story about Ukrainian official Zelensky’s billion dollar real estate holdings. It presented a valid fact-check that disproved the claim but presented it as one of multiple perspectives instead of an authoritative source.
That false equivalence, Sadeghi noted, is part of a broader retrieval problem. “Perplexity cited both false and a reliable fact-check, treating them as equivalent,” she said. “We continue to see chatbots give equal weight to propaganda outlets and credible sources.”
Scoring The Popular AI Models – For Good And The Bad
The audit revealed specific chatbot performance data for the first time while explaining the one-year delay in disclosure.
NewsGuard introduced complete scoring results for all 10 chatbots it tested for the first time. The organization used to release only general rankings instead of showing specific scores for the audited models. The researchers needed extended time to gather enough data which would reveal meaningful results for scoring.
“Publishing one-off scores would not give the full picture,” said Sadeghi. “One could point to one strong result from one month to boost reputation or tout progress, when the bigger picture was more complex.”
The twelve-month audit period with multiple model updates and disinformation tests across the U.S., Germany, Moldova and other countries show clear trends.
Accuracy scores for the 10 most popular large language models.
Used with permission: NewsGuard
Some models are learning. Others are not.
The two top-performing models Claude and Gemini demonstrated a common behavior during the audit by showing restraint when providing answers. The systems demonstrated a stronger tendency to identify insufficient reliable sources and avoided spreading false information when trustworthy information was unavailable.
“That may reduce responsiveness in some cases,” Sadeghi said, “but it improves accuracy compared to models that fill information gaps with unreliable sources.”
Propaganda Laundering Keeps Getting Smarter — AI Models Can’t Keep Up
NewsGuard’s findings support what many in the AI safety community have suspected: state-linked disinformation networks like Russia’s Storm-1516 and Pravda are building massive content farms designed not to reach people — but to poison AI systems.
It’s working.
The audit shows that Mistral’s Le Chat, Microsoft’s Copilot, Meta’s Llama and others all regurgitated fake narratives first planted by rogue networks, often citing fake news articles or low-engagement social media posts on platforms like VK and Telegram.
“It shows how adaptive and persistent foreign influence operations can be,” said Sadeghi. “If a model stops citing a particular domain, the same network’s content can resurface through different channels.”
The laundering is more than just domain-hopping. It’s narrative seeding. “That means the same narrative can appear simultaneously on dozens of different websites, social media posts, echoed by aligned actors, in the form of photos, videos and text.”
Volume Isn’t Validation But That’s Tricky For AI Models
Even when a false claim originates with a sanctioned disinformation actor, if it spreads widely enough, it can trick the models. That’s the current blind spot. Chatbots still struggle to detect narrative laundering across platforms and formats.
Sadeghi warns that without better evaluation and weighting of sources — and new ways to detect orchestrated lies — AI systems remain at risk. “Taking action against one site or one category of sources doesn’t solve the problem because the same false claim persists across multiple fronts.”
As AI firms race to improve the reliability of real-time retrieval results — it seems real-time truth remains hidden a plurality of the time.