Google has once again climbed to the top of key AI benchmarks with its new Gemini 3.1 Pro. According to TechCrunch, the model promises significantly improved ability to handle complex work tasks – and the figures presented are in many ways striking.

Dramatic Leap in Abstract Reasoning

The most sensational result is Gemini 3.1 Pro's score on ARC-AGI-2, one of the most demanding tests for abstract and general reasoning. Here, the model achieves 77.1 percent – a doubling of Gemini 3 Pro's 31.1 percent, and the highest recorded result on this test as of February 2026, according to available benchmark data.

The model also tops GPQA Diamond, a benchmark for master's level science questions, surpassing both GPT-5.2 and the latest Claude models in these measurements.

77.1 %
ARC-AGI-2 (Gemini 3.1 Pro)
31.1 %
ARC-AGI-2 (Gemini 3 Pro)
Gemini 3.1 Pro Sets New Records – Again

Strong on Agent Tasks

Gemini 3.1 Pro also performs strongly on what are known as agent-based tasks – situations where the model must plan and execute a series of actions over time, such as autonomous web research or complex coding. On the APEX-Agents benchmark, the score is 33.5 percent, compared to 18.4 percent for its predecessor.

This is an area growing in practical relevance as companies increasingly deploy AI agents in production.

Record numbers on benchmarks do not necessarily mean the model wins in all real-world use cases.
Gemini 3.1 Pro Sets New Records – Again

Weaknesses Exist – Particularly in Secure Coding and Clinical Tasks

However, the picture is not entirely uniform. On SecRepoBench, a benchmark for secure coding, the best Gemini model scores 27.7 percent, while GPT-5 achieves 39.3 percent and OpenAI o3 reaches 32.4 percent. Here, the Google model lags behind.

In clinical assessments – specifically within anesthesiology – Claude Sonnet 3.5 ranks highest, while Google's predecessor model Gemini 2.0 received a score of 5.2 out of 10. Gemini 3.1 Pro's clinical performance is not yet well-documented in available research.

How Does It Fit Into the Big Picture?

On Chatbot Arena – where human users vote on model responses – Gemini 3.1 Pro preview ranks sixth with an Elo score of 1461. This places the model behind Claude Opus 4.5 (1468) and GPT-5.2-high (1471), but ahead of Gemini 3 Pro (1443).

This is a useful reminder that human preferences and benchmark results do not always point in the same direction.

Benchmarks vs. Reality

Research material from February 2026 shows that 78 percent of global companies use AI in at least one function, but only 27 percent have achieved full enterprise coverage. In practice, many companies choose hybrid setups where different models are used for different tasks – rather than blindly betting on the one that scores highest on a given benchmark.

Comments from the tech community, including on Hacker News, point out that Gemini's theoretical advances do not always translate directly into practical advantages compared to Claude or Codex in real development scenarios.

In a market where records are set and broken every month, it is production results – not benchmark figures – that ultimately determine the winner.

Gemini 3.1 Pro is nonetheless a clear signal that Google has no intention of relinquishing its position at the forefront of large language models. The model is available with a context window of one million tokens and is priced competitively against Claude Opus 4.6, among others, making it interesting for cost-conscious enterprises working with large-scale multimodal reasoning.