
On June 6, 2025, Olhar Digital hosted one of the most anticipated events of the year in the world of artificial intelligence: a direct confrontation between the market's most advanced AIs. Contrary to what many expected — a clash based on charisma or popularity — the decisive point of the challenge was the accuracy of the responses. In this scenario, discovering how the test was structured and understanding the evaluation criteria reveals much more than simply pointing out a winner: it allows us to glimpse future trends in the sector, the limitations we still need to overcome, and the practical application opportunities for the general public.
The Structure of the Challenge: More Than Just a Battle of Promises
To ensure impartiality and objective comparisons, the challenge adopted a methodology based on three main pillars: recognized benchmarks, extended context evaluation, and multimodal tasks. Each of these elements directly impacted the final performance of the models and how we interpret their results.
Standardized Benchmarks and Their Importance
The foundation of any accuracy competition lies in consolidated metrics. In the case of the Olhar Digital test, two significant benchmarks were used:
- MMLU (Massive Multitask Language Understanding): evaluates everything from reading comprehension to logical reasoning and solving complex problems.
- HumanEval: focuses exclusively on code generation, measuring syntactic accuracy, functionality, and adherence to specifications.
By adopting these benchmarks, the challenge ensured that all AIs were subjected to the same set of questions and scenarios, eliminating task selection biases and allowing fair comparisons.
Extended Context: The Great Differential
A short, direct response is not always enough to evaluate the true coherence of an AI. Therefore, the test included long text passages and multiple interactions that required maintaining the narrative thread and remembering details previously mentioned. This dynamic is especially relevant for applications such as customer support, legal document analysis, or report writing, where virtual “memory” makes all the difference.
Gemini 2.5 Pro shone precisely at this point, showing great consistency even when dealing with text blocks over five thousand words. Meanwhile, ChatGPT o3, although quick in summarizing information, showed some decline in performance over prolonged interactions, revealing a tendency for partial “forgetting” of the initial content.
Multimodal Tasks and Blind Evaluations
To increase complexity, the challenge included tasks that went beyond pure text: identifying aspects in images, interpreting charts embedded in PDF documents, and analyzing large code sections with refactoring requests. Blind evaluations were also adopted, where human evaluators did not know which model produced each response, reducing subjective interferences in the final scoring.
Performance in Focus: Accuracy as the Supreme Criterion
Analyzing only who “responded the fastest” does not do justice to the real potential of an AI. Accuracy — understood as the ability to provide correct, contextualized, and complete answers — was the determining criterion. Below is a simplified comparison of key results:
Model | Performance in HumanEval | Accuracy in MMLU | Highlight Point |
Gemini 2.5 Pro | ≥ 90% | Superior to Gemini 1.5 and most | Coherence in extended and multimodal context |
ChatGPT o3 | ~ 87–90%* | Competitive, but lower than human accuracy in code | Agility in web search |
* Approximate values based on multiple public reports from May 2025.
Gemini 2.5 Pro: The Coherence Champion
Google's model stood out for its remarkable consistency in complex logic and code generation tasks. Its multimodal approach allowed it to seamlessly transition between text, images, and code structures without losing accuracy. Additionally, its fine-tuning structure based on updated and diverse data explains part of its superior performance in long memory tests.
ChatGPT o3: Speed and Naturalness, with Some Caution
While Gemini prioritized coherence, ChatGPT o3 remained strong in web search queries and in providing easily understandable natural language responses. However, the pressure to maintain logic and details in long contexts proved to be its weak point, raising discussions on how to balance speed with memory robustness.
Potentials and Limitations in the General Public Context
For a common user, understanding these nuances is essential when choosing an AI tool for everyday tasks, whether in content creation, problem-solving, or information search.
Practical Use Potentials
By offering high accuracy in extensive context and multimodal tasks, Gemini 2.5 Pro emerges as an alternative for professionals dealing with large volumes of technical texts, such as lawyers, journalists, and researchers. Meanwhile, ChatGPT o3, with its agility in searches and construction of easy-to-read dialogues, remains attractive for educators, content creators, and users seeking quick and well-formulated answers.
Limitations to Be Aware Of
Despite advances, both AIs still face significant challenges:
- Training Biases: responses may reflect imbalances present in the original data.
- Hallucinations: reduced accuracy on very specific or recent topics without real-time access to external databases.
- Integration with Legacy Systems: adapting to the workflow of companies using proprietary software is still complex.
Recognizing these limitations helps the user define response validation strategies and adopt cross-checking practices before using critical information.
Implications for the Search and Adoption Ecosystem
The impact of the challenge goes beyond simply choosing an “accuracy champion”. Data from Pew Research indicates that 47% of people already prefer AI tools over traditional search engines, reflecting a paradigm shift in obtaining information. Meanwhile, reports from Sparktoro reveal that 60% of Google searches do not generate clicks, highlighting the demand for direct and complete answers.
Technology companies, attentive to this movement, have been redirecting investments towards multimodal capabilities and “clickless search” solutions capable of serving users who require speed and reliability in a single interaction.
Final Considerations and Prospects for 2025–2026
At the end of this accuracy confrontation, it is clear that winning a challenge does not mean having the perfect solution for all scenarios. The Gemini 2.5 Pro impressed with coherence and multimodal versatility, while ChatGPT o3 maintained its value proposition in naturalness and quick access to information. For the general public, the choice between one and the other should consider the type of task, the need to maintain context history, and the preference for a more conversational interface.
Looking ahead, the expectation is that future versions will expand continuous learning capabilities, reduce biases, and further improve integration with real-time sources. The race for technical accuracy stimulates the development of ensemble learning methods and evaluations that consider not only “right” or “wrong”, but also ethical and safety aspects.
Sources
- https://www.entrepreneur.com/es/tecnologia/gemini-o-chatgpt-tu-decides/491649,
- https://www.iatransformers.academy/blog/esta-cambiando-la-busqueda-para-siempre-el-desafio-de-la-ia-a-google,
- https://www.forbesargentina.com/innovacion/la-ia-o3-chatgpt-supera-competidores-investigacion-web-hasta-donde-llegan-sus-capacidades-n72176,
- https://www.hostingtg.com/blog/gemini-2-5-previenueva-ia/,
- https://vegaconsultores.es/gemini-vs-chatgpt-que-ia-se-adapta-mejor-a-tu-empresa/,
- https://www.clarin.com/tecnologia/ranking-inteligencia-artificial-chatgpt-grok-mejores-conviene-usar_0_vy1UUCgViS.html
Add new comment