I Tested Grok 3 with a Reasoning & Coding Request - xAI isn't the Best!

I Tested Grok 3 with a Reasoning & Coding Request - xAI isn't the Best!

Is Grok 3 Really the Best? The Surprising Showdown Among 4 AI Giants

Is Grok 3 Really the Best? The Surprising Showdown Among 4 AI Giants

From tech enthusiasts to everyday curious minds, everyone has been buzzing about XAI’s Grok 3. Promoted as a revolutionary leap in large language models, Grok 3 arrived with a massive wave of hype—impressive demos, glowing testimonials, and the promise to blow competitors out of the water. But if there’s one thing we know about breakthroughs in artificial intelligence, it’s that real potential only emerges when you put them to the test.

In today’s ever-evolving world of AI, claims of “most powerful” or “game-changer” pop up by the hour. However, verifying these claims against real problem-solving scenarios is the key. Armed with a single prompt, one YouTuber (and AI enthusiast) decided to rigorously measure how Grok 3 handles a notoriously challenging task: producing a visual, code-based explainer of the Transformer architecture—the backbone technology behind many modern LLMs. To make the experiment robust, three additional heavyweight models entered the ring: DeepSeek R1, O1, and Gemini 2.0.

Why Transformers? They’re at the heart of modern AI success stories and known for being both powerful and complex. For an AI model to visually explain how Transformers work, it needs to reason with significant depth, break down abstract concepts, and then encode everything into a user-friendly piece of code. In short, it’s a tall order for any system, no matter how advanced its claim.

So, how do these four LLMs stack up? Which one will deliver the most informative, intuitive, and coherent explanation? Put on your seatbelt as we delve into the results, showcasing the showdown outcome of these top-tier AI contenders. You’ll see how some soared, how others flopped, and how the hype around Grok 3 actually stood up against rigorous testing.

Consider this your backstage pass into the next generation of AI. By the end, you’ll hold a unique vantage point—insider knowledge about where these models truly excel and where they stumble. If you’ve ever wanted to witness a fiery, real-world AI confrontation, you’ve come to the right place!

1. Laying Out the Challenge & Setting Up the Experiments

Before discussing the performance of each model, let’s break down the initial challenge. The goal was to test each AI’s ability to generate a code-driven, interactive web page (using HTML, CSS, and optionally JavaScript) that visually explains the Transformer architecture. This is no standard request for a textual summary:

  • Step One: The model must thoroughly understand how Transformers work (i.e., self-attention, encoder/decoder, and feed-forward networks).
  • Step Two: It must translate that understanding into a visual concept (diagrams, arrows, shapes, color-coded blocks, etc.).
  • Step Three: Finally, the model must deliver runnable code that can display this visual explanation in a web browser.

Why This Prompt Is So Challenging

Transformers aren’t trivial: they involve multi-headed self-attention, encoder-decoder stacks, positional embeddings, and a host of subtleties that can leave even experienced AI practitioners scratching their heads. Asking an AI model to simultaneously reason about conceptual clarity and visual design means bridging left-brain and right-brain thinking.

Most language models excel at generating text-based explanations. Yet, generating code for a dynamic, aesthetically pleasing, and structurally accurate diagram is in a whole different league. As soon as you demand that the output be visually interactive, the margin for error skyrockets. For instance, typical LLM answers about Transformers might produce bullet-point lists or ASCII art, but can they compile fully functional HTML and CSS with correct positioning and logic?

Setting Up the Testing Environment

To keep it fair and consistent, the YouTuber used the same prompt across all models. After receiving the code response, he ran it on Repl.it (a popular online coding environment) to ensure the code would launch a simple web page. Any errors or missing dependencies were noted. If the code required debugging, that was considered strike one against the model’s ability to produce a well-rounded solution.

By standardizing the approach—same prompt, same environment, same user interactions—bias or preference for any single platform is reduced. The next step was to check for each AI’s ability to create a visually intelligible, step-by-step breakdown of the Transformer flow. That included looking for features like:

  • Labels for the Encoder and Decoder modules.
  • Arrows that visually represent the flow of data from the input sequence to the output sequence.
  • Hover effects or interactive tooltips that describe each step (self-attention, feedforward, etc.).
  • Explanatory text with minimal jargon, so non-experts could follow along.

Initial Reactions & User Experience

Once the code was loaded, the immediate reaction was a sniff test: “Does this look like it might teach a newcomer how Transformers work?” This is often the best sanity check for such a visually oriented request—a newcomer can instantly see if the message is accessible or if they’re confronted with a baffling collage of unlabeled arrows or blocks.

While some might argue that coding up a visual explainer is purely front-end web development, it’s the reasoning layer that decides how each piece in the Transformer puzzle is rearranged visually. So, in short:

  • Clarity of Explanation: Is the text helpful? Does it avoid overwhelming jargon?
  • Visual Flow: Are the arrows and boxes meaningful? Do you see a natural left-to-right or top-to-bottom logic that clarifies the process?
  • Interactivity: Is there color coding? Are hover elements present and helpful?

Think of it like an ultimate test of AI creativity meets engineering meets pedagogy. It’s an opportunity to see if the model can step into the shoes of a teacher and deliver an engaging lesson—something that hits that perfect trifecta of readability, correctness, and style.

2. The Four AI Contenders: Grok 3, DeepSeek R1, O1, and Gemini 2.0

With the framework established, it’s time to see how each model fared. Although Grok 3 sparked the biggest hype wave, the other three models—DeepSeek R1, O1, and Gemini 2.0—are no pushovers. Some come from seasoned institutions with years of AI research under their belts; others brag about unique architectural improvements that supposedly give them the upper hand in reasoning tasks. Let’s analyze them in the order the user tested them: XAI’s newly released Grok 3, DeepSeek R1, O1, and then Gemini 2.0.

Grok 3: First Impressions

When we talk about social currency in the AI realm, XAI’s Grok 3 carries a golden aura. It’s fresh, it’s rumored to handle chain-of-thought reasoning like a champion, and it has a sleek interface. That alone makes people feel “in the know,” fueling excitement. The test with Grok 3 yielded a functional piece of HTML code, complete with an encoder-decoder design and a textual explanation. It was a relief to see that the code actually ran without major errors—definitely a plus for any first-day test.

Still, the reaction was a mix of mild satisfaction and mild disappointment. Yes, Grok 3 produced a coherent diagram with arrows connecting blocks like Input SequenceEncoderSelf-AttentionDecoderOutput. However, the aesthetics felt simplistic—mostly grayscale shapes. The text explanation was present, but not integrated into any interactive elements. There was no hover effect to reveal usage tips or in-depth insights about how attention heads operate, for example.

Given how people were raving on social media, some testers might have expected fireworks—something more dynamic or visually appealing. In the end, Grok 3’s output was serviceable but not revolutionary. It answered the project requirements but did not blow away the competition.

DeepSeek R1: A Disappointing Showing

Next was DeepSeek R1. Dubbed “the free, big-thinking model,” it brought high hopes. The user wanted to see if “deep” meant “profound reasoning.” The result? A minimalistic design with tiny mouse-hover popups that felt unpolished and disjointed. The code compiled, but the final page lacked coherence.

One might argue that “less is more,” but in a domain as complex as Transformers, minimal design can become an immediate barrier if it doesn’t convey enough detail. This attempt from DeepSeek R1 offered snippets of explanatory text but fell short in visual architecture clarity. Boxes were not well-labeled, arrows didn’t line up in ways that novices would find intuitive, and the entire user experience was a step behind Grok 3 in completeness.

Emotions ranged from mild frustration to disappointment among testers, especially if they arrived with the hope of a “dark horse surprise.” While it’s possible to tweak the code for better results, that’s not the point of the test. The prompt aimed to evaluate “out-of-the-box” model performance. DeepSeek R1 simply did not deliver that “aha” moment needed to truly learn the Transformer mechanism at a glance.

O1 Model: A Surprising Comeback

Third in line came O1. Originally tested some time ago, O1 had previously underwhelmed. However, in this demonstration, it seemed the model (or the service behind it) might have improved significantly. Like Grok 3, O1 delivered workable code right away—no immediate bugs or 404 errors. On top of that, the visual layout was more polished than expected, featuring color-coded boxes, some arrows for data flow, and hover-based tooltips that explained key concepts like self-attention, encoder-decoder attention, and feed-forward layers.

This level of interactivity pushed O1 ahead of DeepSeek R1 in terms of user-friendliness. Testers found it refreshing to hover over an element labeled “Self-Attention” and see a short, tooltipped paragraph explaining how queries, keys, and values link together. Such features add a new dimension of practical value for a learner who craves quick clarifications.

Of course, not all was perfect. Certain design choices, like color contrasts, font size, or insufficient arrow lines, still left some room for improvement. But for a “mid-tier” model not widely hyped, O1’s performance left many watchers thinking, “Wait, maybe we should pay more attention here.” The final verdict positioned O1 as a well-rounded competitor—arguably better than Grok 3 in the interactive department, if not in raw textual clarity.

3. Gemini 2.0 Takes Center Stage—And the Emerging Rankings

No showdown is complete without a rumored champion or a silent killer. Gemini 2.0, Google’s widely anticipated foray, has had its ups and downs with “next-gen thinking” hype. In the extended test, Gemini 2.0 was handed the exact same prompt. Right away, the user noted that the code included:

  • A well-structured HTML layout with color-coded blocks.
  • Clear top-to-bottom labeling of each major step in the Transformer pipeline.
  • Tooltip overlays or “pop-ups” that precisely define complex terms (like masked self-attention).

The comment heard the most was: “This is neat, visually appealing, and easy to grasp.” If that’s not a trifecta for an educational web page, then it’s certainly close. While some might say the color scheme or arrow logic in Gemini’s design could be improved, testers found it to be the most polished, interactive, and user-friendly code among the four attempts.

The Element of Surprise

What’s truly surprising here, from an emotional standpoint, is how Gemini 2.0—which many discounted or teased for lagging behind other faster, more “trendy” models—seemed to shine. The People Also Ask question might be: “Why did this happen?” Possibly, Google’s extensive R&D around user interfaces, combined with large swathes of training data, gave Gemini 2.0 a nuanced approach to structuring tutorials. Or, maybe it has more thorough code generation checks. Regardless, the result was enough to stun watchers who had higher expectations for Grok 3’s brand-new iteration.

Final Ranking & Takeaways

At the end of the day, it’s not about saying one model is absolutely better than all others. This was a single test. But in this context, the crowned winner was Gemini 2.0, followed closely by O1. Grok 3 trailed behind them primarily due to the lack of interactive visuals (even though it functioned just fine for a static-coded explanation). Meanwhile, DeepSeek R1 turned in the weakest performance.

This is an important reminder for AI enthusiasts, developers, and the general public: Don’t jump on any hype train without real testing. Social proof (people shouting about an LLM’s power online) counts as a form of public trigger, but public triggers are only as credible as the actual performance. In a domain where new releases happen weekly, the best approach to stay ahead—and to feel that sense of social currency for being in-the-know—is to watch real demonstrations, run your own tests, and compare results with your own eyes.

Conclusion: Are We on the Brink of a Game-Changer, or Is the Hype Overblown?

On the surface, Grok 3 is a robust model—no denying that. It provided stable code, a coherent diagram, and a decent textual rundown of Transformer mechanics. But if the question is, “Is it the undisputed champion of LLMs right now?” this user’s demonstration reveals a more nuanced story. Both O1 and Gemini 2.0 arguably outperformed Grok 3 under the constraints of a highly specialized prompt, one that demands coding skill plus instructional design sense.

Of course, that doesn’t mean Grok 3 will underperform in all areas. AI models are multi-faceted; they can excel in language comprehension but lag in code-based tasks, or vice versa. This deeper exploration encourages each of us to adopt a “trust, but verify” attitude. If you ask for straightforward text summaries, Grok 3 might outshine all. If you want a complex code-based demonstration, perhaps O1 or Gemini has a slight edge. Meanwhile, if you’re simply curious to tinker with a free platform, DeepSeek R1 might suffice for basic tasks, even if it lacks polish.

The bottom line? Keep your eyes wide open in this dynamic AI race. Don’t automatically assume the flashiest promotional video equates to the absolute best performance across the board. The more you test and compare, the more you’ll refine your understanding of each model’s strengths and weaknesses.

Now, we want to hear your thoughts! Did you expect Grok 3 to top the list? Has Gemini 2.0 caught you off guard with its well-crafted code demo? Every new piece of information helps shape the public’s understanding of how these AI systems measure up.

If you found this exploration helpful—or simply entertaining—please consider subscribing to my YouTube channel to stay ahead of the curve on further AI showdowns and technology breakdowns. You can find it right here: My YouTube Channel.

And if you’d like even more updates, insider AI commentary, and exclusive project releases, sign up for the official newsletter at Join the Newsletter. You’ll be among the first to see brand-new tests and behind-the-scenes analysis.

Finally, help us spread the word by sharing this blog with a friend or colleague who’s curious about LLMs and the inner workings of Transformers. Nothing builds our AI community faster than word-of-mouth from enthusiastic readers.

So, keep exploring, keep testing, and keep discovering the next big thing in AI—because the revolution is just getting started!

Back to blog