The Speed Wars: Two Radically Different Approaches to Fast LLM Inference

Anthropic and OpenAI recently announced “fast mode” for their flagship coding models. Both promise dramatically faster inference, but they’re achieving it through radically different technical approaches—and the differences reveal a lot about the tradeoffs in modern AI infrastructure.

The Speed Comparison: 6x Difference

Anthropic’s Fast Mode (Opus 4.6):

Speed: ~170 tokens/second (2.5x faster than standard 65 tokens/sec)
Model: Real Opus 4.6 (full capability, no compromises)
Cost: 6x more expensive

OpenAI’s Fast Mode (GPT-5.3-Codex-Spark):

Speed: 1000+ tokens/second (15x faster than standard 65 tokens/sec)
Model: Spark (distilled version, notably less capable)
Cost: Standard pricing

Bottom line: OpenAI’s fast mode is 6x faster than Anthropic’s, but uses a different, less capable model. Anthropic serves their actual flagship model, just faster.

How Anthropic’s Fast Mode Works: Low-Batch-Size Inference

The Core Tradeoff: Batching vs. Speed

The heart of AI inference economics is batching, because the main bottleneck is memory, not compute.

The Problem:

GPUs are extremely fast at computation
But moving data onto a GPU is slow
Every inference requires copying all prompt tokens onto the GPU before processing can start
This memory transfer is expensive

The Standard Solution: Batching

Wait for multiple users’ requests
Copy all their prompts onto GPU at once
Process them together in a single batch
Result: Higher overall throughput, slower individual requests

The Bus Analogy

Think of it like a bus system:

Standard Mode (High Batching):

Bus waits until it’s full before departing
Great for overall throughput (more people transported per hour)
Bad for individual wait times (you wait for the bus to fill up)

Anthropic’s Fast Mode (Low Batching):

Bus departs immediately when you get on
Fast for you (zero wait time)
Expensive (you’re paying for all the empty seats)
Lower overall system throughput

Why This Makes Sense

Cost Analysis:

6x more expensive for 2.5x faster speed
This ratio is exactly what you’d expect from low-batch-size inference
You’re effectively paying for the other users who could have shared the GPU with you

No Model Changes Required:

Same Opus 4.6 model
Same quality outputs
Just different scheduling/batching configuration

Important Caveat: The “waiting for the bus” cost is really only paid for the first token. For streaming responses, the main performance impact is that smaller batches require fewer flops and execute more quickly. Think “lighter buses drive faster.”

How OpenAI’s Fast Mode Works: Cerebras Giant Chips

The Hardware Solution

OpenAI’s approach is fundamentally different—they’re using special hardware from their Cerebras partnership announced in January 2025.

Standard GPU Architecture:

H100 chip: ~1 square inch
SRAM (fast on-chip memory): Tens of megabytes
Most model weights stored in HBM (slower off-chip memory)
Inference time spent streaming weights from HBM → SRAM → compute

Cerebras Chip Architecture:

Size: 70 square inches (entire silicon wafer!)
SRAM: 44GB (versus MB for standard GPUs)
Entire model fits in fast on-chip memory
Zero time spent streaming weights from slow external memory

Why This Requires a New Model

The 44GB Constraint:

44GB can fit ~20B parameters at fp16 precision
Or ~40B parameters at int8 quantization
GPT-5.3-Codex is much larger (exact size unknown)
Solution: Distill a smaller “Spark” model that fits

Knowledge Distillation Process:

Take large GPT-5.3-Codex (“teacher” model)
Query it extensively across many tasks
Train smaller “Spark” model (“student”) to mimic outputs
Result: 20-40B param model that captures ~80-90% of capabilities

The 15x Speedup Explained

Why So Fast:

Entire model lives in ultrafast SRAM (21 petabytes/sec bandwidth)
Zero latency from memory transfers
Compute can run at maximum throughput continuously
No waiting for weights to stream from external memory

The Tradeoff:

Model is genuinely less capable
Gets confused on complex tasks
Messes up tool calls that vanilla GPT-5.3-Codex handles perfectly
“Small model smell” - feels like a distilled version

Key Differences: Side-by-Side Comparison

Technical Approach

Aspect	Anthropic	OpenAI
Mechanism	Batch size optimization	Special hardware (Cerebras)
Model	Identical (Opus 4.6)	Different (Spark distil)
Infrastructure	Standard GPU stack	Giant wafer-scale chips
Complexity	Simple (config change)	Complex (hardware + distillation)
Speed gain	2.5x	15x
Quality	100% (same model)	~80-90% (distilled)

Performance Characteristics

Anthropic Fast Mode:

✅ Full model capability preserved
✅ Predictable quality (same as regular Opus)
✅ No new failure modes
❌ Only 2.5x faster
❌ 6x more expensive
⚠️ First-token latency may still be slow

OpenAI Fast Mode:

✅ Dramatically faster (15x)
✅ Standard pricing
✅ Low latency (fast enough for persistent WebSocket)
❌ Different model with reduced capabilities
❌ New failure modes (tool call confusion)
❌ “Small model smell”

Strategic Positioning

Anthropic’s Play:

“We give you the real model, just faster”
Premium pricing for premium speed
No capability compromises
Appeal to users who need reliability > speed

OpenAI’s Play:

“We give you blazing speed for most tasks”
Good enough for many use cases
Demonstrates Cerebras partnership value
Appeal to users who need speed > marginal capability

The Competitive Timeline (Author’s Theory)

January 2025:

OpenAI announces Cerebras partnership
Begins work on fitting a model onto Cerebras chips

Early February 2025:

Anthropic learns OpenAI will announce fast inference soon
Realizes they have no comparable hardware play
Quickly implements low-batch-size inference (simple config change)

Mid-February 2025:

Anthropic announces first (a few days before OpenAI ready)
OpenAI follows with Cerebras-backed Spark
To non-technical observers, looks like OpenAI copied Anthropic

Author’s Take:

“I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens.”

Impact Analysis: Is Fast Inference the Future?

The Author’s Skepticism

Core Argument:

“The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model.”

Why Speed Alone Doesn’t Win:

Mistake Handling Dominates User Time
- 15x faster model that makes 20% more mistakes
- User spends more time fixing errors than they saved from speed
- Net productivity: negative
Real-World Testing: Cursor’s Experience
- Cursor released fast, less-capable agent model
- Hype dropped significantly
- Fast models didn’t improve actual user experience
- Claude Code (full capability) won despite being slower
The Quality Bar Is High
- AI coding tools need to be nearly perfect to be useful
- A 90% solution that requires 10% manual fixes is often worse than doing it manually
- Speed amplifies both capabilities AND mistakes

Potential Future Use Cases

1. Lower-Level Primitives

Fast models for routine operations
Full models for critical decisions
Example: Claude Code already uses Haiku for some operations
Spark could become OpenAI’s “Haiku equivalent”

2. Tiered Inference Architecture

Fast model (Spark): Simple tool calls, data formatting, routine code
Full model (GPT-5.3-Codex): Complex logic, architecture decisions, debugging

3. Latency-Critical Applications

Real-time conversational AI
Interactive coding assistants
Live code completion
Where 50-200ms matters (hence OpenAI’s WebSocket switch)

4. Cost-Optimized Workflows

Use fast model for 80% of tasks
Route hard tasks to full model
Significant cost savings at marginal quality loss

Industry Impact

For Developers:

New tradeoff space: speed vs capability vs cost
Need to evaluate if tasks are “Spark-appropriate”
Opportunity to build smarter routing systems

For AI Labs:

Validates multiple approaches (no single winner yet)
Anthropic: Software optimization path
OpenAI: Hardware specialization path
Both can coexist serving different use cases

For Infrastructure:

Cerebras demonstrates viability of giant chips
But economics still unclear (how much do those chips cost?)
Model distillation becoming critical skill
Batch size optimization now a competitive lever

The Bigger Picture: What This Reveals

Different Philosophies

Anthropic’s Approach:

Preserve model quality at all costs
Optimize infrastructure around existing models
Premium pricing for premium experience
Conservative, user-trust-focused

OpenAI’s Approach:

Explore new hardware partnerships aggressively
Willing to trade some quality for speed
Bet on “good enough” being actually good enough
Experimental, market-exploration-focused

Technical Innovation Paths

Software Optimization (Anthropic):

Advantages: Fast to implement, works with existing hardware
Disadvantages: Limited headroom (can’t go 15x faster)
Best for: Incremental improvements, preserving quality

Hardware Specialization (OpenAI):

Advantages: Dramatic performance gains possible
Disadvantages: Requires new hardware, model changes, complexity
Best for: Breakthrough performance, new use cases

Unanswered Questions

1. Economics:

How much do Cerebras chips cost to run?
Is 6x premium for Anthropic sustainable?
Will distilled models cannibalize full model revenue?

2. Model Fit:

What’s the largest model that fits on 44GB SRAM?
Can future Cerebras chips handle 100B+ models?
How much quality loss is acceptable for distillation?

3. Market Adoption:

Do users actually prefer Spark over GPT-5.3-Codex?
Is Anthropic’s premium pricing justified?
What percentage of workloads are “fast-model appropriate”?

Key Takeaways

1. Two Valid Approaches

Anthropic: Optimize scheduling (batch size)
OpenAI: Optimize hardware (giant chips)
Both deliver “fast mode”, completely different methods

2. Speed vs Quality Tradeoff

Anthropic: 2.5x faster, same quality, 6x cost
OpenAI: 15x faster, reduced quality, standard cost
No clear winner—depends on use case

3. Technical Complexity Varies

Batch size tuning: Simple, fast to implement
Cerebras integration: Complex, requires distillation
Anthropic’s speed was likely competitive response

4. Fast ≠ Better (Usually)

Mistakes cost more time than speed saves
Quality bar for AI coding tools is very high
Fast models work as primitives, not replacements

5. Infrastructure Matters

Hardware innovation (Cerebras) enables new capabilities
Software optimization (batching) provides incremental gains
Both will continue to evolve

6. The Real Innovation

Not just “fast inference”
But exploration of different tradeoff spaces
Anthropic: premium quality at premium price
OpenAI: good-enough quality at breakthrough speed

Recommendations

For Developers Choosing a Fast Mode:

✅ Use Anthropic Fast Mode when:

Quality is critical (production systems)
Budget allows premium pricing
You need predictable behavior (same model)
First-token latency isn’t critical

✅ Use OpenAI Fast Mode when:

Speed is paramount (interactive tools)
Tasks are relatively simple (Spark can handle)
You can tolerate occasional mistakes
Cost efficiency matters

For AI Infrastructure Teams:

Monitor batch size impact on your inference stack
Evaluate if giant chips (Cerebras) make economic sense
Build model routing: fast models for simple, full models for hard
Measure mistake rates, not just speed

For the Industry:

Fast inference is important but not transformational (yet)
Quality > speed for most high-value applications
Hardware specialization will continue (more Cerebras-like solutions)
Distillation quality will determine viability of speed-optimized models

The Bottom Line

Anthropics and OpenAI’s “fast modes” represent two philosophically different approaches to the same problem:

Anthropic: “How can we make our best model faster?” → Optimize infrastructure
OpenAI: “How can we make a fast model?” → New hardware + distillation

OpenAI’s achievement is more technically impressive (Cerebras integration + distillation), but Anthropic’s has a cleaner value proposition (same quality, just faster).

The winner? Neither. Both approaches will coexist, serving different use cases. Fast distilled models will become lower-level primitives in AI systems, while full models remain the gold standard for complex tasks.

The real question isn’t “which is better?” but “when should I use which?”—and that answer depends entirely on your specific use case, quality requirements, and budget.

Speed is a feature. Quality is a requirement. The labs that figure out how to deliver both will win.

Source: Two different tricks for fast LLM inference by Sean Goedecke