Turns out, AI can actually build competent Minesweeper clones — Four AI coding agents put to the test reveal OpenAI's Codex as the best, while Google's Gemini CLI as the worst

As corporations increasingly pursue AI advancements, a recent test conducted by Ars Technica evaluated four popular AI coding agents. The task was to create a web version of Minesweeper, incorporating sound effects, mobile touch support, and an engaging gameplay twist.

Minesweeper revolves around logic and offers a unique challenge, making its creation a worthy test for AI capability. However, underlying mechanics usually require a level of creativity that is typically human-driven.

The test featured paid versions of four coding agents: Claude Code from Anthropic, Gemini CLI from Google, Mistral Vibe, and OpenAI’s Codex, built on GPT-5. Each agent received identical instructions and produced results based solely on their original run without any human input.

OpenAI Codex – 9/10

Codex emerged as the top performer, successfully implementing visuals and including “chording,” a feature allowing players to reveal surrounding tiles when flags are placed correctly. This makes the clone feel more polished and enjoyable. Codex’s version featured functional buttons, including a sound toggle with authentic sound effects, and an on-screen guide for users. It even included a “Lucky Sweep” button that would reveal one safe tile when triggered. The interface was user-friendly, although the coding process was slower than other agents. Ars Technica rated Codex with a 9 out of 10.

Claude Code – 7/10

Coming in second was Claude Code, which performed well visually and generated the code in half the time compared to Codex. It boasted custom graphics and enjoyable sound effects. However, the lack of chording support was deemed unacceptable by testers. It offered a “Power Mode” for simple power-ups and a “Flag Mode” button for mobile, enhancing usability despite the absence of chording. The overall experience was solid, leading to a score of 7 out of 10, though it could have been higher with chording.

Mistral Vibe – 4/10

Mistral’s Vibe placed third, lacking the crucial chording feature and sound effects. It included a non-functional “Custom” button and did not offer any gameplay twists. Although it performed adequately and the coding interface was user-friendly, it failed to impress. Testers found the all-black emoji design unappealing; one of its gameplay modes even visually malfunctioned on mobile. Mistral Vibe earned a score of 4 out of 10, which some felt was lower than deserved given its performance.

Google Gemini – 0/10

In last place was Google’s Gemini CLI, which did not yield a playable game. It featured buttons but failed to present tiles, rendering the game unplayable. Despite visual similarities to Claude Code’s output, Gemini’s coding process was considerably slower, with frequent requests for external dependencies. After modifications were made to ensure compliance with HTML5, it still couldn’t produce viable code. Notably, Gemini CLI lacked access to the recent Gemini 3 models, relying instead on older systems. This disappointing performance ultimately resulted in a score of 0.

Overall, Codex led the test, followed closely by Claude Code and Mistral Vibe, while Google struggled to produce a functional game. This evaluation raises questions about the current state of AI development and its tangible benefits.

—

Source link