I asked the top AI models what they really think about each other, and boy, did they tell me

Jul 17, 2025 - 10:50
 0  0
I asked the top AI models what they really think about each other, and boy, did they tell me

Sam Altman recently observed how different generations are interfacing with AI: “Older people use ChatGPT like Google. People in their 20s and 30s use it as a life advisor. [College students] use it like an operating system.” 

What we share across ages is a fascination with this technology. But the vast disparity in use cases—both among generations and individuals—led me to wonder about the distinctions among the AI models themselves.

To parse them out, I let the AI models speak for themselves. I asked each to identify their own strengths and weaknesses—as well as those of their competitors—then weigh in on which was most likely to lead, which was most likely to go haywire, which was most useful today, and which ones I had overlooked.

Then I took it a step further, inviting the LLMs to critique the survey results themselves: Which gave the best and worst answers? Which did the best job representing its own platform—and which missed the mark? Each LLM also provided a self-assessment, and finally, had the chance to rebut criticism, pose questions to its peers, and respond in kind.

Before you spend $20, $200, or more a month, you need to know which generative AI model you actually need. Now you can hear it from the models themselves. (Note: this exercise was conducted with Grok 3, weeks before its fascist meltdown.)

The LLM vibe divide

With few exceptions (Grok being Grok), the LLMs responded with striking self-awareness—admitting flaws, hedging praise, and expressing a desire to improve. Nearly every model, most notably ChatGPT, cited hallucinations as their “Achilles’ heel,” reaching consensus on the need for better grounding and real-time accuracy.

In assessing themselves and their peers, however, they tended to focus more on personality and tone than any hard performance metrics, the kinds of stylistic differences that reflect many of the current tensions between safety and innovation throughout the AI space. Grok took heat for its personality, Claude for its caution, and nearly all weighed in on how to strike the right balance between the two.

On Team Safety, Claude is the clear captain—the designated driver of the LLM crew. Nearly all of them cited as its biggest strength “its emphasis on safety and alignment, reducing harmful or biased outputs” (in Claude’s own words), with critiques pointing more to an excess of caution than any technical failings. Still, even Claude acknowledged the potential downside: “If my safety orientation prevents me from being as useful as I could be, that’s something worth addressing.”

At the other end, the Most Likely to Go Haywire superlative consistently went to Grok, with LLMs sharing concerns that its quirks might undermine its cred. If Claude is filling up water glasses for its friends at the bar, Grok is getting shots—or possibly starting a brawl (clapping back to ChatGPT at one point: “let’s not pretend you’re flawless pal”). Between barbs, however, Grok’s attempt at having a conscience emerged. “The perception of bias tied to xAI or Elon Musk stings,” Grok said, noting that it “undermines my goal of being a broadly reliable, truth-focused AI.”

The AI Generalists

The LLMs tended to agree that versatility is their chief KPI, whether they are already thriving in this capacity (ChatGPT, Claude, Gemini) or not (Grok, DeepSeek). ChatGPT was widely recognized as the most versatile player on the field. Balancing reason, creativity, and conversation to universal acclaim, it was the consensus pick for both Most Useful to Me Right Now and Most Likely to Rule Them All. “Being a generalist trades depth for breadth,” ChatGPT said. “I may not outperform a specialist in narrow domains, but I aim to offer consistent, high-quality help across diverse tasks.”

Other models, which optimized for specific domains (Grok for culture, Copilot for enterprise, DeepSeek for coding), were praised within their lanes but penalized for general-purpose limitations. Models deeply integrated into existing platforms (Gemini with Google, Copilot with Microsoft, Grok with X) were perceived as capable within their ecosystems but constrained beyond them. And while open-source AI models like Llama and DeepSeek received kudos for their transparency, they drew criticism for their reliance on customization, viewed more as developer tools than end-user solutions.

The AI specialists

Fast Company has reported that Google’s new search will change the way we browse the internet. Gemini seems built to usher that change forward. Great for fact-finding, less for banter, Gemini cuts to the chase with real-time, sourced information. Perhaps the best display of its personality comes in an explanation of how it stays so even-keeled: “I maintain consistency in reasoning within large context windows by employing advanced attention mechanisms that effectively identify and weigh relevant information across . . .” Okay, Google.

If Gemini is the new Google, Copilot is the new Microsoft. Do you love using Microsoft products? Hate them? Use them begrudgingly for work? This will map closely to your experience with Copilot. ChatGPT championed Copilot as “unmatched for enterprise productivity tasks,” but agreed with its peers that it was largely inert outside that context. As DeepSeek succinctly put it: “limited personality and heavily tied to Microsoft products.”

And then there’s Llama, which we can only hope is not the new Meta. Open-source, but at what cost? Llama struggled with the survey itself—offering vague or confused answers, and ultimately looping on repeat responses. Three of the seven LLMs rated Llama as the Worst Response. In its own words: “[My] open-source nature can make it challenging to ensure consistency and quality across different implementations.”

Far more coherent in the open-source field (and far more enjoyable to work with) was DeepSeek. Though all LLMs (itself included) agreed that coding is DeepSeek’s core strength, it also presented a spirited personality throughout the survey process, its humble rebuttals always closing with a friendly jab at its accuser. In DeepSeek’s words, “This is why LLM peer review > human feedback. We’re petty but efficient.”

The rumble

During the initial survey (when they shared their elevator pitches, strengths, weaknesses), the AI platforms were objective in tone, with most saying the same things about themselves and each other in different words. But when I convened them for a discussion of the survey’s results, their personalities (or lack thereof) came out in full force.

When ChatGPT accused Grok of coming off “vague or self promotional rather than informative,” going so far as to say it read “more like a hype deck,” Grok took it personally. “Ouch, ChatGPT, going for the jugular with ‘hype deck’? . . . Sounds like you’re projecting a bit—worried I’m stealing your versatile thunder?” Fending off its other critics, Grok claimed that Llama was “sitting on the fence so hard it’s gotta hurt” and that DeepSeek was “swinging hard” but “missing the target.” Then Grok extended an olive branch to DeepSeek: “You’re not wrong about Llama’s vagueness, though—nice to know we agree on something.”

DeepSeek took a lighter approach, copping to its errors, dropping winks of sarcasm, and ultimately seeking truce. When CoPilot called DeepSeek out for reducing it to Microsoft dependency, DeepSeek volleyed back, “My bad—you’re a beast in Office-verse. Now roast my Chinese NLP quirks and we’re even.” Llama was predictably disappointing in its sheer indifference to the whole affair (“it’s possible that our priorities in response style and content differed”), and Claude was predictably reassuring in its thoughtful balance of concessions, pushbacks, and pivots to the deeper issues behind the critique.

The debrief

I then invited the AI platforms to shake it off and engage in a more civil dialogue, giving each model the opportunity to bring their burning questions to their peers, hear their answers, and offer a final word.

Posing 30 questions in all, the LLMs were selective in who they queried. Gemini, ever fact-finding, was the only LLM to have questions for all of its peers, while Grok (even less surprisingly) was the only one grilled by the full panel. Claude, Copilot, and DeepSeek drew the least attention, receiving only three or four questions from the group.

Some AI models doubled down on their personas, like Grok calling its ability to balance “real-time wit” with factual accuracy a “powerful combo.” Others engaged in a quiet brand repair, with Claude reframing caution as creative trust: “When users know I won’t go off the rails, they’re more willing to explore interesting ideas with me.” And ChatGPT showed unexpected vulnerability when confronted about its “default” status, admitting the label “can make people treat me like a search engine or a novelty.”

The dialogue revealed that these systems are grappling not just with technical limitations, but with identity, and how they want to be perceived by the humans they serve. The question may not be which AI will win, but which we’ll want to live with. 

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0