AI systems are now capable of competing with gold medalists at the Mathematical Olympiad — but when problems are truly new and unpublished, it turns out the gap to human experts is far larger than benchmark figures suggest.
First Proof: Ten problems no one has seen before
The First Proof challenge was launched on February 5, 2026, and consists of ten unpublished research-level mathematics problems drawn from fields such as algebraic combinatorics and representation theory. The solutions were released 8 days later, on February 13, according to the OpenAI blog.
The purpose is explicitly to test AI systems on problems that could not have snuck into the training data — a known weakness of existing benchmarks where models can effectively recognize solution patterns from previously seen material.
AI models can imitate known results, but they twist and distort them when the problems are genuinely new — Lauren Williams, mathematician

OpenAI's claims — with caveats
OpenAI presented its submitted material on February 14, 2026, the day after the solutions were published. The company claims that the model has produced correct proofs with a high degree of certainty for five of the ten problems: numbers 4, 5, 6, 9, and 10. Problems 9 and 10 were reportedly solved in an early phase, while 4 and 6 were added during further training.
However, it is important to emphasize that the remaining submissions are still under expert review, and as of February 21, 2026, there is no complete public results list of who solved what. The project invites submissions via the hashtag #1stProof and a Zulip channel hosted by ICARM, requiring transcription from before the solutions were released to be considered credible results.

Major difference between competition and research mathematics
The results from First Proof should be seen in the context of the broader picture of AI and mathematics in 2026. On standardized competition problems, performance is impressive: DeepMind's Aletheia scores 95.1 percent on IMO-ProofBench Advanced, and GPT-5.2 reportedly solved all 15 problems in AIME 2025 correctly, according to available benchmark data.
But when problems are genuinely unknown, performance drops dramatically. On Humanity's Last Exam — designed to counteract memorization — top models score below 40 percent. Experts cited by Phys.org point out that AI systems lack creative depth and the ability to make the original leaps that characterize real mathematical research. A particular concern is what is called "proof by intimidation": models can produce long, technically impressive proofs that in reality contain errors.
A lighter pipeline solved problem 4 — and more
Interestingly, a separate, lightweight AI pipeline reports having completely solved problem 4 from First Proof — verified by math experts — in addition to all problems in the related ICCM sets 1 and 2. This shows that the competition is not limited to the large laboratories, but that there is still great uncertainty regarding which systems actually perform best.
What do the mathematicians say?
Harvard mathematician Lauren Williams, who is among those who have observed the performance of AI systems on research problems, reportedly believes that the models are good at recreating known results, but that they fail systematically when the problems are genuinely new. This observation is supported by the fact that most models in First Proof did not come close to solving all ten problems.
Going forward, First Proof is planned as an ongoing benchmark, where new rounds will continue to test whether AI systems are actually advancing in the type of mathematics that happens at the frontier of research — not just on problems that are already understood and categorized.
