Back to Blog
Technology
September 18, 2025
9 min read
1,649 words

OpenAI's 'o1' Isn't Thinking. It's Just Guessing Faster. Why We Aren't Buying the Hype.

We spent $5 watching o1-preview 'think' for 60 seconds about a React bug. It failed. Reasoning models are just 'Retries-as-a-Service'. Here IS why we aren't buying.

OpenAI's 'o1' Isn't Thinking. It's Just Guessing Faster. Why We Aren't Buying the Hype.

We gave access to the new OpenAI o1-preview model to our Junior Dev. He pasted a gnarly React Race Condition bug. The model 'thought' for 45 seconds. It output a solution. It was wrong. It 'thought' again for 30 seconds. Wrong again. We spent $5 in API credits watching a probabilistic machine talk to itself in a loop. I realized: Reasoning Models are just Retrying Models. We can build a while loop that retries GPT-4o for $0.05. Why pay 100x?

The $5 "Debug" Session That Broke the Illusion

Last Tuesday, "Strawberry" (the codename for OpenAI's new reasoning model, o1) finally dropped. The Twitter timeline (or X, if you must) went into a frenzy. "AGI is here," claimed the influencers. "It solved the International Math Olympiad!" screamed the headlines.

At XQA, we don't care about Math Olympiads. We care about shipping software. So, we decided to put o1 to the test immediately.

We had a particularly nasty bug in our Next.js dashboard. It was a classic React useEffect race condition where three different API calls were firing asynchronously, and depending on which one returned first, the global state would corrupt. It was a heisenbug. It only happened 10% of the time, usually when the network was slow.

I sat down with our Junior Engineer, opened the new "o1-preview" interface, and pasted the 200 lines of spaghetti code.

"Find the race condition," I typed.

Then, we waited.

And waited.

The UI showed a mesmerizing little animation that said "Thinking...". It expanded to show steps: "Analyzing dependency array...", "Simulating network states...", "Checking closure staleness...".

It felt magical. It felt like a human was looking at the code. For 60 seconds, we held our breath.

Then, it spat out the answer: "You are missing abortController in your fetch call."

We implemented it. It didn't work. The bug persisted.

I told o1: "That didn't fix it. The state is still flickering."

It went back to "Thinking..." involved another 45 seconds of compute. It came back with a second hypothesis: "You need useLayoutEffect." (Which is terrible advice for data fetching, by the way).

We tried that. Wrong again.

By the end of the session, we had spent about $5.00 in API credits (o1 is expensive). We had zero fixes.

Ten minutes later, a Senior Engineer looked at the code and said, "Oh, you aren't clearing the previous state update before the new one comes in. Just add a request ID check."

One line of code. Fixed.

This experience forced me to dig deep into what "Reasoning" actually means in the context of LLMs. What I found was not magic. It was not AGI. It was a very clever, very expensive implementation of a concept we've known about for years: Inference-Time Compute.

Section 1: The Economics of "Hidden Tokens"

To understand why o1 is a trap for most businesses, you have to understand the pricing model. And to understand the pricing model, you have to understand Chain of Thought (CoT).

In a standard LLM (GPT-4), you input 100 tokens, and it outputs 100 tokens. You pay for 200 tokens.

In o1, you input 100 tokens. The model then generates 5,000 "Hidden Tokens" of internal monologue. It talks to itself. It tries Strategy A, realizes it's wrong, backtracks, tries Strategy B, verifies it, and then—finally—outputs 100 tokens of "Answer" to you.

The Billing Nightmare

Here is the kicker: You pay for the 5,000 hidden tokens.

Imagine hiring a taxi driver to take you to the airport. The airport is 10 miles away. But the driver decides to drive 500 miles in circles "thinking" about the best route. When he finally drops you off, he charges you for 510 miles.

You never saw the 500 miles. You just wanted to get to the airport.

OpenAI's pricing for o1-preview is roughly $15.00 per 1M input tokens and $60.00 per 1M output tokens. Those hidden tokens count as "output" tokens.

This changes the unit economics of AI features completely. If you are building a Customer Support bot, you cannot afford to pay $0.50 per query just so the bot can "think" about whether the user is asking for a refund or a return. For 99% of business use cases, this cost is prohibitive.

Section 2: "Thinking" is Just "Guessing in a Loop"

Let's demystify the magic. What is actually happening during that "Thinking" phase?

It is not "reasoning" in the human sense. It does not have a mental model of the world. It is performing Probabilistic Tree Search.

Imagine you are playing Chess. You look at a move. You think, "If I move here, he moves there. That's bad. Let me retract that mental move and try another."

o1 is doing this with text tokens.

  1. Draft 1: "The bug is in the fetch call." (Model checks itself: "Wait, is it? Let me look at the logs provided.")
  2. Correction: "No, logs show fetch succeeded. It must be state update."
  3. Draft 2: "The bug is in the setState." (Model checks: "Does this align with React docs?")
  4. Final Output: "Check your setState."

This is technically impressive. But functionally, it is merely Retrying-as-a-Service.

We have been doing this for years using agentic frameworks like LangChain or AutoGPT. We prompt GPT-4: "Write code." Then we prompt it: "Review your code for bugs." Then we prompt it: "Fix the bugs."

OpenAI has simply vertically integrated this loop into the inference engine and hidden it behind a loading spinner. They sold us a while loop and called it Intelligence.

Section 3: The Latency Problem (Why Real-Time is Dead)

In the world of SaaS, Latency is the enemy.

Google found that an extra 500ms of latency dropped search traffic by 20%. Amazon found that 100ms of latency cost them 1% in sales.

o1 takes anywhere from 10 seconds to 60 seconds to generate a response.

In a dashboard, a chat interface, or a copilot, 60 seconds is an eternity. It breaks the "Flow State." If a developer asks Copilot for a function and has to wait 20 seconds, they will just write it themselves.

The "Offline" Pivot

This relegates Reasoning Models to a very specific niche: Async Batch Processing.

  • "Here is a 50-page legal contract. finding the loopholes. Email me when you're done."
  • "Here is a CSV with 1 million rows. Find the anomaly."
  • "Here is the entire repo. Write a migration plan."

These are high-value tasks. But they are rare. Most interactions with AI are "Micro-interactions"—summarize this email, write this SQL query, fix this typo. For those, o1 is effectively useless.

Section 4: We Built Our Own "o1" for Free (And So Can You)

Here is the most cynical part of my analysis. You don't need o1.

We recently ran an experiment. We took Llama-3-70b (an open-source model you can run for cheap) and wrapped it in a Python script that implements the "Reflection Pattern."

The logic is simple:

def solve_with_reflection(problem):
    # Step 1: Draft
    solution = model.generate(f"Solve this: {problem}")
    
    # Step 2: Critique
    critique = model.generate(f"Find flaws in this solution: {solution}")
    
    # Step 3: Refine
    if critique != "No flaws found":
        final_solution = model.generate(f"Fix the solution based on this critique: {critique}")
    else:
        final_solution = solution
        
    return final_solution

The Results?

On the "HumanEval" coding benchmark, base Llama-3 scores around 75%. With this simple reflection loop, it jumps to 89%.

o1 scores around 92%.

So, for $0.00 (running locally) or pennies (running on Groq/Together AI), we achieved 97% of o1's performance. The "moat" that OpenAI claims to have is actually just a very well-tuned system prompt loop.

Section 5: The "Diminishing Returns" of Scale

This brings us to the existential crisis of the AI industry.

From GPT-2 to GPT-3, we saw emergent behavior. From GPT-3 to GPT-4, we saw reasoning. The curve was exponential.

But from GPT-4 to o1, the curve is flattening. We are no longer seeing "Smarter" models; we are seeing "More Persistent" models.

It turns out that Scaling Laws are hitting a wall. We are running out of internet data to train on. We are running out of GPU clusters large enough to train the next generation.

So, the labs have shifted to "Inference Scaling." If we can't make the model smarter, let's just make it think longer.

This is a valid strategy, but it has limits. If you ask a median human to solve a Quantum Physics problem, giving them 100 years to think about it won't help if they don't know the math. Similarly, if the model hallucinates facts, letting it hallucinate for longer doesn't produce truth.

In our $5 debug session, o1 didn't know the nuances of Next.js 14 Server Actions. No amount of "Thinking" would simulate that knowledge. It just hallucinated more convincing, more elaborate wrong answers.

Section 6: When Should You Actually Use o1?

I am not saying o1 is useless. It is a tool, like a scalpel. You don't use a scalpel to butter toast.

XQA only uses o1 for three specific tasks:

  1. One-Shot Architecture Design: "Design a DynamoDB schema for this detailed access pattern." (It's good at planning).
  2. Complex Regex Generation: (Because nobody knows Regex).
  3. Data Transformation Scripts: "Write a Python script to convert this messy JSON into this SQL schema." (It is meticulous).

For everything else—Coding, Writing, Chatting, Summarizing—we stick to Claude 3.5 Sonnet or GPT-4o.

Claude 3.5 Sonnet, in particular, feels "smarter" without the "Thinking" lag. It gets the React bug right on the first try, usually because its training data is fresher and its context window handling is superior, not because it "thought" about it.

Conclusion

The AI hype cycle demands a new revolution every 6 months. o1 is the latest offering to feed the beast.

But for practical engineering teams, "Reasoning" is often a distraction. We don't need a philosopher in the loop. We need a reliable, fast, cheap autocomplete that knows the docs.

Don't fall for the "Thinking" label. It's just a machine guessing. And sometimes, it guesses wrong—really, really slowly.

Tags:TechnologyTutorialGuide
X

Written by XQA Team

Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.