ai, research,

The Speed Wars: Two Radically Different Approaches to Fast LLM Inference

Cui Cui Follow Feb 15, 2026 · 11 mins read
The Speed Wars: Two Radically Different Approaches to Fast LLM Inference

Anthropic and OpenAI recently announced “fast mode” for their flagship coding models. Both promise dramatically faster inference, but they’re achieving it through radically different technical approaches—and the differences reveal a lot about the tradeoffs in modern AI infrastructure.

The Speed Comparison: 6x Difference

Anthropic’s Fast Mode (Opus 4.6):

  • Speed: ~170 tokens/second (2.5x faster than standard 65 tokens/sec)
  • Model: Real Opus 4.6 (full capability, no compromises)
  • Cost: 6x more expensive

OpenAI’s Fast Mode (GPT-5.3-Codex-Spark):

  • Speed: 1000+ tokens/second (15x faster than standard 65 tokens/sec)
  • Model: Spark (distilled version, notably less capable)
  • Cost: Standard pricing

Bottom line: OpenAI’s fast mode is 6x faster than Anthropic’s, but uses a different, less capable model. Anthropic serves their actual flagship model, just faster.


How Anthropic’s Fast Mode Works: Low-Batch-Size Inference

The Core Tradeoff: Batching vs. Speed

The heart of AI inference economics is batching, because the main bottleneck is memory, not compute.

The Problem:

  • GPUs are extremely fast at computation
  • But moving data onto a GPU is slow
  • Every inference requires copying all prompt tokens onto the GPU before processing can start
  • This memory transfer is expensive

The Standard Solution: Batching

  • Wait for multiple users’ requests
  • Copy all their prompts onto GPU at once
  • Process them together in a single batch
  • Result: Higher overall throughput, slower individual requests

The Bus Analogy

Think of it like a bus system:

Standard Mode (High Batching):

  • Bus waits until it’s full before departing
  • Great for overall throughput (more people transported per hour)
  • Bad for individual wait times (you wait for the bus to fill up)

Anthropic’s Fast Mode (Low Batching):

  • Bus departs immediately when you get on
  • Fast for you (zero wait time)
  • Expensive (you’re paying for all the empty seats)
  • Lower overall system throughput

Why This Makes Sense

Cost Analysis:

  • 6x more expensive for 2.5x faster speed
  • This ratio is exactly what you’d expect from low-batch-size inference
  • You’re effectively paying for the other users who could have shared the GPU with you

No Model Changes Required:

  • Same Opus 4.6 model
  • Same quality outputs
  • Just different scheduling/batching configuration

Important Caveat: The “waiting for the bus” cost is really only paid for the first token. For streaming responses, the main performance impact is that smaller batches require fewer flops and execute more quickly. Think “lighter buses drive faster.”


How OpenAI’s Fast Mode Works: Cerebras Giant Chips

The Hardware Solution

OpenAI’s approach is fundamentally different—they’re using special hardware from their Cerebras partnership announced in January 2025.

Standard GPU Architecture:

  • H100 chip: ~1 square inch
  • SRAM (fast on-chip memory): Tens of megabytes
  • Most model weights stored in HBM (slower off-chip memory)
  • Inference time spent streaming weights from HBM → SRAM → compute

Cerebras Chip Architecture:

  • Size: 70 square inches (entire silicon wafer!)
  • SRAM: 44GB (versus MB for standard GPUs)
  • Entire model fits in fast on-chip memory
  • Zero time spent streaming weights from slow external memory

Why This Requires a New Model

The 44GB Constraint:

  • 44GB can fit ~20B parameters at fp16 precision
  • Or ~40B parameters at int8 quantization
  • GPT-5.3-Codex is much larger (exact size unknown)
  • Solution: Distill a smaller “Spark” model that fits

Knowledge Distillation Process:

  1. Take large GPT-5.3-Codex (“teacher” model)
  2. Query it extensively across many tasks
  3. Train smaller “Spark” model (“student”) to mimic outputs
  4. Result: 20-40B param model that captures ~80-90% of capabilities

The 15x Speedup Explained

Why So Fast:

  • Entire model lives in ultrafast SRAM (21 petabytes/sec bandwidth)
  • Zero latency from memory transfers
  • Compute can run at maximum throughput continuously
  • No waiting for weights to stream from external memory

The Tradeoff:

  • Model is genuinely less capable
  • Gets confused on complex tasks
  • Messes up tool calls that vanilla GPT-5.3-Codex handles perfectly
  • “Small model smell” - feels like a distilled version

Key Differences: Side-by-Side Comparison

Technical Approach

Aspect Anthropic OpenAI
Mechanism Batch size optimization Special hardware (Cerebras)
Model Identical (Opus 4.6) Different (Spark distil)
Infrastructure Standard GPU stack Giant wafer-scale chips
Complexity Simple (config change) Complex (hardware + distillation)
Speed gain 2.5x 15x
Quality 100% (same model) ~80-90% (distilled)

Performance Characteristics

Anthropic Fast Mode:

  • ✅ Full model capability preserved
  • ✅ Predictable quality (same as regular Opus)
  • ✅ No new failure modes
  • ❌ Only 2.5x faster
  • ❌ 6x more expensive
  • ⚠️ First-token latency may still be slow

OpenAI Fast Mode:

  • ✅ Dramatically faster (15x)
  • ✅ Standard pricing
  • ✅ Low latency (fast enough for persistent WebSocket)
  • ❌ Different model with reduced capabilities
  • ❌ New failure modes (tool call confusion)
  • ❌ “Small model smell”

Strategic Positioning

Anthropic’s Play:

  • “We give you the real model, just faster”
  • Premium pricing for premium speed
  • No capability compromises
  • Appeal to users who need reliability > speed

OpenAI’s Play:

  • “We give you blazing speed for most tasks”
  • Good enough for many use cases
  • Demonstrates Cerebras partnership value
  • Appeal to users who need speed > marginal capability

The Competitive Timeline (Author’s Theory)

January 2025:

  • OpenAI announces Cerebras partnership
  • Begins work on fitting a model onto Cerebras chips

Early February 2025:

  • Anthropic learns OpenAI will announce fast inference soon
  • Realizes they have no comparable hardware play
  • Quickly implements low-batch-size inference (simple config change)

Mid-February 2025:

  • Anthropic announces first (a few days before OpenAI ready)
  • OpenAI follows with Cerebras-backed Spark
  • To non-technical observers, looks like OpenAI copied Anthropic

Author’s Take:

“I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens.”


Impact Analysis: Is Fast Inference the Future?

The Author’s Skepticism

Core Argument:

“The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model.”

Why Speed Alone Doesn’t Win:

  1. Mistake Handling Dominates User Time
    • 15x faster model that makes 20% more mistakes
    • User spends more time fixing errors than they saved from speed
    • Net productivity: negative
  2. Real-World Testing: Cursor’s Experience
    • Cursor released fast, less-capable agent model
    • Hype dropped significantly
    • Fast models didn’t improve actual user experience
    • Claude Code (full capability) won despite being slower
  3. The Quality Bar Is High
    • AI coding tools need to be nearly perfect to be useful
    • A 90% solution that requires 10% manual fixes is often worse than doing it manually
    • Speed amplifies both capabilities AND mistakes

Potential Future Use Cases

1. Lower-Level Primitives

  • Fast models for routine operations
  • Full models for critical decisions
  • Example: Claude Code already uses Haiku for some operations
  • Spark could become OpenAI’s “Haiku equivalent”

2. Tiered Inference Architecture

Fast model (Spark): Simple tool calls, data formatting, routine code
Full model (GPT-5.3-Codex): Complex logic, architecture decisions, debugging

3. Latency-Critical Applications

  • Real-time conversational AI
  • Interactive coding assistants
  • Live code completion
  • Where 50-200ms matters (hence OpenAI’s WebSocket switch)

4. Cost-Optimized Workflows

  • Use fast model for 80% of tasks
  • Route hard tasks to full model
  • Significant cost savings at marginal quality loss

Industry Impact

For Developers:

  • New tradeoff space: speed vs capability vs cost
  • Need to evaluate if tasks are “Spark-appropriate”
  • Opportunity to build smarter routing systems

For AI Labs:

  • Validates multiple approaches (no single winner yet)
  • Anthropic: Software optimization path
  • OpenAI: Hardware specialization path
  • Both can coexist serving different use cases

For Infrastructure:

  • Cerebras demonstrates viability of giant chips
  • But economics still unclear (how much do those chips cost?)
  • Model distillation becoming critical skill
  • Batch size optimization now a competitive lever

The Bigger Picture: What This Reveals

Different Philosophies

Anthropic’s Approach:

  • Preserve model quality at all costs
  • Optimize infrastructure around existing models
  • Premium pricing for premium experience
  • Conservative, user-trust-focused

OpenAI’s Approach:

  • Explore new hardware partnerships aggressively
  • Willing to trade some quality for speed
  • Bet on “good enough” being actually good enough
  • Experimental, market-exploration-focused

Technical Innovation Paths

Software Optimization (Anthropic):

  • Advantages: Fast to implement, works with existing hardware
  • Disadvantages: Limited headroom (can’t go 15x faster)
  • Best for: Incremental improvements, preserving quality

Hardware Specialization (OpenAI):

  • Advantages: Dramatic performance gains possible
  • Disadvantages: Requires new hardware, model changes, complexity
  • Best for: Breakthrough performance, new use cases

Unanswered Questions

1. Economics:

  • How much do Cerebras chips cost to run?
  • Is 6x premium for Anthropic sustainable?
  • Will distilled models cannibalize full model revenue?

2. Model Fit:

  • What’s the largest model that fits on 44GB SRAM?
  • Can future Cerebras chips handle 100B+ models?
  • How much quality loss is acceptable for distillation?

3. Market Adoption:

  • Do users actually prefer Spark over GPT-5.3-Codex?
  • Is Anthropic’s premium pricing justified?
  • What percentage of workloads are “fast-model appropriate”?

Key Takeaways

1. Two Valid Approaches

  • Anthropic: Optimize scheduling (batch size)
  • OpenAI: Optimize hardware (giant chips)
  • Both deliver “fast mode”, completely different methods

2. Speed vs Quality Tradeoff

  • Anthropic: 2.5x faster, same quality, 6x cost
  • OpenAI: 15x faster, reduced quality, standard cost
  • No clear winner—depends on use case

3. Technical Complexity Varies

  • Batch size tuning: Simple, fast to implement
  • Cerebras integration: Complex, requires distillation
  • Anthropic’s speed was likely competitive response

4. Fast ≠ Better (Usually)

  • Mistakes cost more time than speed saves
  • Quality bar for AI coding tools is very high
  • Fast models work as primitives, not replacements

5. Infrastructure Matters

  • Hardware innovation (Cerebras) enables new capabilities
  • Software optimization (batching) provides incremental gains
  • Both will continue to evolve

6. The Real Innovation

  • Not just “fast inference”
  • But exploration of different tradeoff spaces
  • Anthropic: premium quality at premium price
  • OpenAI: good-enough quality at breakthrough speed

Recommendations

For Developers Choosing a Fast Mode:

Use Anthropic Fast Mode when:

  • Quality is critical (production systems)
  • Budget allows premium pricing
  • You need predictable behavior (same model)
  • First-token latency isn’t critical

Use OpenAI Fast Mode when:

  • Speed is paramount (interactive tools)
  • Tasks are relatively simple (Spark can handle)
  • You can tolerate occasional mistakes
  • Cost efficiency matters

For AI Infrastructure Teams:

  • Monitor batch size impact on your inference stack
  • Evaluate if giant chips (Cerebras) make economic sense
  • Build model routing: fast models for simple, full models for hard
  • Measure mistake rates, not just speed

For the Industry:

  • Fast inference is important but not transformational (yet)
  • Quality > speed for most high-value applications
  • Hardware specialization will continue (more Cerebras-like solutions)
  • Distillation quality will determine viability of speed-optimized models

The Bottom Line

Anthropics and OpenAI’s “fast modes” represent two philosophically different approaches to the same problem:

  • Anthropic: “How can we make our best model faster?” → Optimize infrastructure
  • OpenAI: “How can we make a fast model?” → New hardware + distillation

OpenAI’s achievement is more technically impressive (Cerebras integration + distillation), but Anthropic’s has a cleaner value proposition (same quality, just faster).

The winner? Neither. Both approaches will coexist, serving different use cases. Fast distilled models will become lower-level primitives in AI systems, while full models remain the gold standard for complex tasks.

The real question isn’t “which is better?” but “when should I use which?”—and that answer depends entirely on your specific use case, quality requirements, and budget.

Speed is a feature. Quality is a requirement. The labs that figure out how to deliver both will win.


Source: Two different tricks for fast LLM inference by Sean Goedecke

Join Newsletter
Get the latest news right in your inbox. We never spam!
Cui
Written by Cui Follow
Hi, I am Z, the coder for cuizhanming.com!

Click to load Disqus comments