When using AI, trust is paramount. But which LLMs (AI Assistants) are most trustworthy? For the answer, I prompted five leading deep research LLMs to report on themselves and their peers. I asked each one:
“Of these leading public LLMs (ChatGPT, Gemini, Claude, Grok, Perplexity, and Microsoft Co-pilot) how do they rate based on trust? Explain the rationale for your decision.”
Each one recognized that trust is multi-faceted. They all considered transparency, reliability, safety, privacy and data usage practices, and institutional reputation. Claude explicitly and uniquely considered honesty and consistency. It defined honesty as acknowledging limitations rather than confidently providing incorrect information. Consistency was defined as delivering reliable responses across different contexts. The assistants were largely objective meaning they did not all select themselves as most trusted. Looking across all the rankings and scores, the order was Claude, Copilot, ChatGPT, Perplexity, Gemini, and Grok.
“Of these leading public LLMs (ChatGPT, Gemini, Claude, Grok, Perplexity, and Microsoft Co-pilot) how do they rate based on trust? Explain the rationale for your decision.”
Each one recognized that trust is multi-faceted. They all considered transparency, reliability, safety, privacy and data usage practices, and institutional reputation. Claude explicitly and uniquely considered honesty and consistency. It defined honesty as acknowledging limitations rather than confidently providing incorrect information. Consistency was defined as delivering reliable responses across different contexts. The assistants were largely objective meaning they did not all select themselves as most trusted. Looking across all the rankings and scores, the order was Claude, Copilot, ChatGPT, Perplexity, Gemini, and Grok.
Going more deeply into the assessments, a summary of the average trust element for each LLM gives additional insights about relative strengths and weaknesses.
Trust Assessments for LLMs by Trust Element
Trust Assessments for LLMs by Trust Element
The best insights about ratings and rankings came from Claude and Gemini. Claude said “Rather than providing a definitive ranking which would be difficult to substantiate objectively, I encourage evaluating these systems based on your specific needs and the dimensions of trust that matter most to you.” Gemini took a similar approach and looked at each trust dimension and gave qualitative assessments of which vendors tended to do well. It said, “Ultimately, users must adopt a stance of critical evaluation and dynamically calibrate their trust based on the specific model, the task at hand, and the potential impact of errors”.
The Gemini summary provides useful observations rather than specific scores:
LLM Trust Assessments - Source: Gemini 2.5 Deep Research
The Gemini summary provides useful observations rather than specific scores:
LLM Trust Assessments - Source: Gemini 2.5 Deep Research
And the Claude 3.7 Sonnet trust summary is even more open-ended:
Summary and Recommendations
There are always tradeoffs using these assistants. Based on the essential elements of trust, understand their strengths and weaknesses and manage the risks. A best-fit approach may be best, where one or more top models are considered alone or in combination for a specific task. This is the approach that Perplexity is taking by using models from ChatGPT and Anthropic Claude. Perplexity Pro subscribers can choose between the two models for different tasks.
Which do I prefer considering my prior usage and the trust scores? Gemini and ChatGPT are most trusted for me. Gemini is especially transparent with sources and synthesizes well. ChatGPT does a better job of coming to a seemingly reliable answer, though its reasoning is often opaque. I have found that I must constantly vet, challenge, and review what these models provide. That’s what I mean by seemingly reliable. I never accept what the models give me without review and further analysis.
As AI assistants continue to evolve a model like Claude may become the best choice for many users. Honesty and consistency are essential. An assistant willing to say “I don’t know” will be highly valued. Consistency across contexts, such as time, style, and topic gives certainty and saves time and effort. I am willing to use Claude more frequently given its principled stance and unique approach to trust. Eventually the best assistant will combine all elements of trust including honesty, consistency, and excellent performance based on a specific need or use case. Continue to assess trust and performance and flexibly apply different models to suit your specific needs.
There are always tradeoffs using these assistants. Based on the essential elements of trust, understand their strengths and weaknesses and manage the risks. A best-fit approach may be best, where one or more top models are considered alone or in combination for a specific task. This is the approach that Perplexity is taking by using models from ChatGPT and Anthropic Claude. Perplexity Pro subscribers can choose between the two models for different tasks.
Which do I prefer considering my prior usage and the trust scores? Gemini and ChatGPT are most trusted for me. Gemini is especially transparent with sources and synthesizes well. ChatGPT does a better job of coming to a seemingly reliable answer, though its reasoning is often opaque. I have found that I must constantly vet, challenge, and review what these models provide. That’s what I mean by seemingly reliable. I never accept what the models give me without review and further analysis.
As AI assistants continue to evolve a model like Claude may become the best choice for many users. Honesty and consistency are essential. An assistant willing to say “I don’t know” will be highly valued. Consistency across contexts, such as time, style, and topic gives certainty and saves time and effort. I am willing to use Claude more frequently given its principled stance and unique approach to trust. Eventually the best assistant will combine all elements of trust including honesty, consistency, and excellent performance based on a specific need or use case. Continue to assess trust and performance and flexibly apply different models to suit your specific needs.
Let's connect