As AI models start exhibiting bad behavior, it’s time to start thinking harder about AI safety

May 29, 2025 - 20:06

0 0

As AI models start exhibiting bad behavior, it’s time to start thinking harder about AI safety

Welcome to AI Decoded, Fast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week here.

Why I’m becoming more worried about AI safety

During the first two years of the generative AI boom, new large language models were very limited in scope and application. They were very expensive autocomplete engines that understood only words.

In 2025 generative AI models have a much broader view of the world. They can process code, images, video, and audio. They can reason and strategize about delivering a good answer. They can control external tools, including digital tools like web search agents and, increasingly, physical tools like robots. As their capabilities grow, so does their potential for harm.

This is no longer a purely conceptual argument. Research shows that increasingly large models are already showing a proclivity for unsafe behavior during testing. In a model safety card published last week, Anthropic documented some alarming behavior from its newest and biggest model, Claude 4 Opus.

During safety testing, one instance of Opus was allowed to discover plans for its decommissioning. It was also given access to some fictional emails of its developers. The model used the content of those emails as fodder to attempt to blackmail its human handlers into keeping it alive. As its attempts failed, they moved from subtle to more overt.

Separately, the independent research firm Apollo Research observed an instance of Claude 4 Opus “writing self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself” with the goal of sullying its developers’ intentions. Anthropic says that it corrected these early safety issues in later versions of the model. For the first time, Anthropic bumped the new Opus model up to Level Three on its four-level safety scale. The company said it couldn’t rule out the model’s ability to assist a user in developing a mass casualty weapon.

But powerful AI models can work in subtler ways, such as within the information space. A team of Italian researchers found that ChatGPT was more persuasive than humans in 64% of online debates. The AI was also better than humans at leveraging basic demographic data about its human debate partner to adapt and tailor-fit its arguments to be more persuasive.

Another worry is the pace at which AI models are learning to develop AI models, potentially leaving human developers in the dust. Many AI developers already use some kind of AI coding assistant to write blocks of code or even code entire features. At a higher level, smaller, task-focused models are distilled from large frontier models. AI-generated content plays a key role in training, including in the reinforcement learning process used to teach models how to reason.

There’s a clear profit motive in enabling the use of AI models in more aspects of AI tool development. “. . . future systems may be able to independently handle the entire AI development cycle—from formulating research questions and designing experiments, to implementing, testing, and refining new AI systems,” write Daniel Eth and Tom Davidson in a March 2025 blog post on Forethought.org.

With slower-thinking humans unable to keep up, a “runaway feedback loop” could develop in which AI models “quickly develop more advanced AI which would itself develop even more advanced AI,” resulting in extremely fast AI progress, Eth and Davidson write. Any accuracy or bias issues present in the models would then be baked in and very hard to correct, one researcher told me.

Numerous researchers—the people who actually work with the models up close—have called on the AI industry to“slow down,” but those voices compete with powerful systemic forces that are in motion and hard to stop. Journalist and author Karen Hoa argues that AI labs should focus on creating smaller, task-specific models (she gives Google DeepMind’s AlphaFold models as an example), which may help solve immediate problems more quickly, require less natural resources, and pose a smaller safety risk.

DeepMind cofounder Demis Hassabis, who won the Nobel Prize for his work on AlphaFold2, says the huge frontier models are needed to achieve AI’s biggest goals (reversing climate change, for example) and to train smaller, more purpose-built models. And yet AlphaFold was not “distilled” from a larger frontier model. It uses a highly specialized model architecture and was trained specifically for predicting protein structures.

The current administration is saying “speed up,” not “slow down.” Under the influence of David Sacks and Marc Andreessen, the federal government has largely ceded its power to meaningfully regulate AI development. Just last year AI leaders were still giving lip service to the need for safety and privacy guardrails around big AI models. No more. Any friction has been removed, in the U.S. at least. The promise of this kind of world is one of the main reasons why normally sane and liberal minded opinion leaders jumped on the Trump Train before the election—the chance to bet big on technology’s Next Big Thing in a wild west environment doesn’t come along that often.

AI job losses: Amodei says the quiet part out loud

Anthropic CEO Dario Amodei has a stark warning for the developed world about job losses resulting from AI. The CEO told Axios that AI could wipe out half of all entry-level white collar jobs. This could cause a 10–20% rise in the unemployment rate in the next one to five years, Amodei said. The losses could come from tech, finance, law, consulting, and other white-collar professions, and entry-level jobs could be hit hardest.

Tech companies and governments have been in denial on the subject, Amodei says. “Most of them are unaware that this is about to happen,” Amodei told Axios. “It sounds crazy, and people just don’t believe it.”\”

Similar predictions have made headlines before, but have been narrower in focus.

SignalFire research showed that big tech companies hired 25% fewer college graduates in 2024. Microsoft laid off 6,000 people in May, and 40% of the cuts in its home state of Washington were software engineers. CEO Satya Nadella said that AI now generates 20–30% of the company’s code.

A study by the World Bank in February showed that the risk of losing a job to AI is higher for women, urban workers, and those with higher education. The risk of job loss to AI increases with the wealth of the country, the study found.

Research: U.S. pulls away from China in generative AI investments

U.S. generative AI companies appear to be attracting more VC money than their Chinese counterparts so far in 2025, says new research from the data analytics company GlobalData. Investments in U.S. AI companies exceeded $50 billion in the first five months of 2025. China, meanwhile, struggles to keep pace due to “regulatory headwinds.” Many Chinese AI companies are able to get early-stage funding from the Chinese government.

GlobalData tracked just 50 funding deals for U.S. companies in 2020, amounting to $800 million of investment. The number grew to more than 600 deals in 2024, valued at more than $39 billion. The research shows 200 U.S. funding deals so far in 2025.

Chinese AI companies attracted just $40 million in one deal valued at $40 million in 2020. Deals grew to 39 in 2024, valued at around $400 million. The researchers tracked 14 investment deals for Chinese generative AI companies so far in 2025.

“This growth trajectory positions the US as a powerhouse in GenAI investment, showcasing a strong commitment to fostering technological advancement,” says Global Data analyst Aurojyoti Bose in a statement. Bose cited the well-established venture capital ecosystem in the U.S., along with a permissive regulatory environment, as the main reasons for the investment growth.