Not a subscriber?

Join 6,000+ getting a unique perspective every Saturday on growing their internet business with actionable AI workflows, systems, and insights.

You're in! Check your email

Oops! Something went wrong while submitting the form 🤔

February 22, 2025

I Tested Grok-3—Here’s What Surprised Me

The AI arms race just got more interesting.

You’ve probably seen the headlines: Grok-3 is here. Elon Musk’s xAI has officially dropped what he boldly calls “the smartest AI on Earth.”

Big claims? Definitely. But is it really the best? That’s where things get interesting.

I’ve spent time testing Grok-3, pitting it against some of the best models out there—ChatGPT’s o1, o3-mini and DeepSeek R1. I ran them through real-world tasks, from complex problem-solving to deep-dive research. The results? Let’s just say it’s not all hype, but it’s not a clear knockout either.

Here’s what I found—and why it matters for you.

What’s Grok-3 and Why Should You Care?

After weeks of rumors, xAI officially launched Grok-3 along with three other models:

Grok-3 mini (a smaller, faster version)
Grok-3 Reasoning
Grok-3 mini Reasoning

The names aren’t groundbreaking (AI labs clearly don’t hire branding experts), but the technology behind them is worth your attention.

So, what’s the deal with reasoning models?

Think of AI in two flavors:

Non-reasoning models → Fast thinkers, great for creative tasks and spitting out quick answers. (Think GPT-4o, the regular free ChatGPT.)
Reasoning models → The deep thinkers. They take time to “think aloud,” generating a chain of thoughts, reflecting on each step, and self-correcting if necessary. (Examples: OpenAI’s o1 and o3 models.)

How an LLM responds (left) and how a reasoning model does (right).

The definition of 'reasoning' varies between AI labs, but most agree it means the model takes extra time to review and refine its answers before responding.

Reasoning models are significantly more accurate and powerful than standard non-reasoning models. They're so effective that I now use reasoning models for almost everything, except when I need simple writing refinements or basic tasks (or when I use assistants (GPTs) in ChatGPT).

Grok-3 is a non-reasoning model that performs well against its peers. In contrast, “Grok-3 Reasoning” takes a more methodical approach.

You can access and try Grok-3 here. Switching between Reasoning and non-reasoning modes is as straightforward as flipping a switch. The deep search feature can be toggled just as easily. I'm genuinely impressed with Grok's interface—it's sleek and intuitive.

On Paper, Grok-3 Is a Beast. But Does That Matter?

Benchmark results suggest Grok-3 and its mini version are insanely powerful. Both appear on par—or even ahead—of OpenAI’s o3-mini reasoning model.

But here’s the catch: benchmarks don’t always translate to real-world performance. So, I ditched the theoretical metrics and put these models to the test on tasks that matter.

The Face-Off: Grok-3 vs. ChatGPT o3-mini vs. DeepSeek R1

Let me break down the current AI landscape for you. Right now, we've got three major players in the reasoning AI game:

DeepSeek (the free option - yes, really!)
ChatGPT by OpenAI (comes in two flavors: $20/month with some limits, or go all-in at $200/month)
Grok (the new kid on the block at $30/month for SuperGrok)

Each one brings something special to the table, though your wallet might have different feelings about each.

Here's what I'm going to do: I'll take these AI powerhouses for a spin, putting them through their paces on 4 real-world challenges. Nothing too academic or fancy - just practical, everyday stuff you might actually use. By the end of this letter, I'll give you my honest take on whether Grok-3 lives up to the hype.

We'll look at three key areas: how well they handle complex problem-solving, their ability to plan and organize (you know, the stuff that keeps us awake at night), and how deep they can dive when researching topics. Let’s dive in.

Challenge #1: Solving Complex Business Case Studies

Problem #1: "You're working for a bank to enhance performance of a corporate credit options product. The bank only makes money when a contract is activated. Total signed contracts consist of active contracts (which generate profit) and inactive contracts. Given that you have access to both the average revenue per activated contract and the average cost per contract, what is the formula for profit per signed contract? Assuming the average revenue per activated contract is $2,500 and we increased the number of activated contracts by 25% while the total numbed of signed contracts remained the same, what would be the increase in profit per contract?"

Here's some context that makes this problem fascinating: I used to pose this exact case study to junior strategy consultants during interviews. Over two years, I tested more than 150 candidates between 2019 and 2021 from the world's top universities - we're talking brilliant minds here. Want to know something interesting? Even with 20-40 minutes and the ability to ask questions, only about 25% could crack it with a solid explanation. Now let's see how our "AI friends" handle it...

The Results:

✅ Grok-3 + Reasoning (86s) : Nailed the equation, result, and provided a crystal-clear explanation.
✅ o3-mini (46s): Same flawless output but faster and more concise.
✅ o1 (45s): Delivered the cleanest explanation of all—short, sharp, and easy to follow.
❌ DeepSeek R1 (486s): Got tangled in its own logic, overthought the problem, and missed the solution entirely.

Solution example:

Example of a perfect and concise answer by o3-mini.

Takeaway: Grok-3 matches the best but doesn’t outshine them. DeepSeek, meanwhile, seems to get lost in its own thoughts.

Let's push this test further by adding an optimization challenge. We'll ask each AI to help us allocate our efforts more efficiently.

Problem#2: “You're working for a bank to enhance performance of a corporate credit options product. The bank only makes money when a contract is activated. Total signed contracts consist of active contracts (which generate profit) and inactive contracts. Given that you have access to both the average revenue per activated contract and the average cost per contract, what is the formula for profit per signed contract? Assume we have 4 contract categories.The average revenue per activated contract is $2,500, $1,000, $200, and $200 respectively. One unit of effort can enhance the number of activated contracts by 5%, 10%, 2%, and 40% respectively across the categories, while the total number of signed contracts remains the same. We have only 4 units of effort to spend. I we can’t use all the unit on one category. What's the best allocation of effort across categories to maximize the profit per contract?”

The Results:

✅ o1 (36s): PERFECT
✅ Grok-3 + Reasoning (84s): Correct equation + accurate result + thorough but overly complex explanation
❌ o3-mini (28s): Correct equation but wrong results due to incorrect reasoning and misinterpretation of the request.
❌ DeepSeek R1 (321s): Simply WRONG.

Solution example:

Here's an example of o1's perfect answer. The problem required complex reasoning since the model needed to evaluate multiple scenarios and determine the optimal solution while considering all constraints.

Takeaway: While Grok-3 takes longer to process, it's the only model that matches OpenAI's o1 in performance. The o3-mini struggles with moderately complex problems, and DeepSeek consistently loses track of its reasoning process.

Now, let's tackle a different kind of challenge.

Challenge #2: Planning

Problem#3: “Create a comprehensive process for managing a project portfolio in a large international IT department. Detail the following: idea collection and tagging, inclusion/exclusion criteria, responsibility assignments, task scheduling, resource allocation, security considerations, penetration testing, and prioritization methods. Specify required data collection, success metrics, communication channels, and decision-making processes. Provide a step-by-step implementation plan where each project cycle must complete within 6 months from ideation to delivery. Include team sizes and process details. Address how to handle a maximum number of concurrent and interdependent projects, with a specific example managing 20 overlapping projects, of which 10 are interdependent. Create an example and show me how the portfolio should be managed using a table. I can’t hire more than 5 people per team. I have only the necessary teams. You plan must include this contraint.”

The Results:

✅ o1 (36s): Actionable, clear, and handled project dependencies like a pro.
✅ o3-mini (59s): Solid output, though a bit surface-level.
🟠 Grok-3 + Reasoning (92s): Good effort, but not as deep as the others.
🟠 DeepSeek R1 (81s): Similar depth as Grok-3.

Solution example:

Here's an example of o1's perfect answer. My main criteria for judging is how quickly the solution can be implemented in a real-world situation without unnecessary fluff. I really like o1's answer. The example is very close what we actually see in the real-world.

Takeaway: If depth matters, o1 dominates. Grok-3 performs decently but lacks the nuance needed here: o1 is clearly superior at planning and handling complex processes.

Challenge #3: Deep Dive into a Niche Topic

Now, let's explore something really interesting - deep search capabilities. This is where AI can scan and analyze information across the internet to give you comprehensive overviews of any topic.

Several companies like Google, OpenAI, and Grok now offer this feature (not for free), which is super helpful for anyone needing to do thorough research for their work.

To test this out, I picked a topic I know inside and out from my researcher days. It's pretty niche, which makes it perfect for testing how deep these AIs can really go when gathering and analyzing information.

For this challenge, I'll be comparing the deep search features of Grok3, o3-mini, and DeepSeek R1. Just a heads up - I won't be including Google or Perplexity in this comparison.

Here's the problem:

Problem#4: "Provide a comprehensive overview of Mean Field Game theory for a general audience. Include all major branches and paradigms: game types, approaches, and methodologies. The explanation should deliver thorough, in-depth understanding while remaining accessible.”

(I know that most of you don't know what the heck Mean Field Game Theory is, but don't worry—it doesn't matter.)

The Results:

✅ Grok-3 + Deep Search (21s - 20 sources): Nailed it with depth and clarity.
✅ o3-mini + Deep Search (14s - 18 sources): Very good, but missed a few key angles.
🟠 DeepSeek R1 (86s): Passable, but nowhere near as thorough.

Takeaway: Grok-3 shines in niche deep dives, arguably outperforming the competition in both speed and depth.

So, Should You Switch to Grok-3?

Here’s the real question: Should you ditch your current AI tool for Grok-3?

It depends.

Pricing: Grok-3 costs $30/month for full access, while OpenAI’s top-tier models (o1/o3) will set you back $200/month.
Performance: From my tests (those included here and beyond), Grok-3 doesn’t outclass ChatGPT’s best models but comes close enough (and sometimes is better) to justify the price difference for most users.
Advanced Features: If you rely on GPTs (custom AI assistants), Grok-3 falls short. No assistants, no canvas, no advanced workflows… yet.

My verdict?

Let me be crystal clear: Based on what I've demonstrated here and my experience, even in cases where o1 performs better, paying $200 for unlimited OpenAI access isn't justified. At $30, Super Grok offers similar or better capabilities. OpenAI's premium subscription price simply doesn't make sense anymore.

If your work heavily relies on internet search rather than AI assistants (like GPTs), Grok-3 is your best bet. It offers the most compelling quality-to-price ratio.

For those starting fresh without needing custom GPTs (personalized assistants), Grok-3 is also an excellent choice. It combines smooth user experience, solid performance, and an unbeatable price point.

However, if your workflow depends on custom GPTs and requires substantial “brain power,” stick with ChatGPT for now—at least until Grok releases its own assistant features.

Honestly, I would've made the switch myself if Grok-3 included something like the personalized assistants (GPT) functionality.

The Bigger Picture: Intelligence Is Getting Cheaper

This isn’t just about Grok-3 vs. ChatGPT.

It’s about where AI is headed in 2025 and beyond. The cost of intelligence is dropping fast. Soon, it’ll be nearly free. If you’re not ready for that shift, it’ll hit you like a train.

Start learning AI right NOW. Open any free assistant and begin experimenting. Check out prompting guides (you'll find plenty on my social media pages @heyCharafeddine on X and Charafeddine Mouzouni on LinkedIn) and start building. Don't get discouraged if AI doesn't immediately solve your problems. That's completely normal—you'll need some practice and patience before things click and you start seeing real results.

Trust me, if you’re not using AI at a daily basis you need to start now.

Wrap Up

This wasn’t a hyper-scientific, meticulously controlled study. It was a hands-on experiment using real tasks to see what these models can actually do in practice.

If Grok-3 launches custom assistants soon, I might fully switch. Until then, I’ll keep straddling both platforms.

That said, if you're learning AI right now, don't obsess too much over picking the "right" model or tool—focus instead on understanding the tools, adapting quickly, and using them to build something meaningful.

Until next time,

—Charafeddine

Share this article on:

Next article >>