Anthropic and OpenAI recently announced “fast mode” for their flagship coding models. Both promise dramatically faster inference, but they’re achieving it through radically different technical approaches—and the differences reveal a lot about the tradeoffs in modern AI infrastructure.
The Speed Comparison: 6x Difference
Anthropic’s Fast Mode (Opus 4.6):
- Speed: ~170 tokens/second (2.5x faster than standard 65 tokens/sec)
- Model: Real Opus 4.6 (full capability, no compromises)
- Cost: 6x more expensive
OpenAI’s Fast Mode (GPT-5.3-Codex-Spark):
- Speed: 1000+ tokens/second (15x faster than standard 65 tokens/sec)
- Model: Spark (distilled version, notably less capable)
- Cost: Standard pricing
Bottom line: OpenAI’s fast mode is 6x faster than Anthropic’s, but uses a different, less capable model. Anthropic serves their actual flagship model, just faster.
How Anthropic’s Fast Mode Works: Low-Batch-Size Inference
The Core Tradeoff: Batching vs. Speed
The heart of AI inference economics is batching, because the main bottleneck is memory, not compute.
The Problem:
- GPUs are extremely fast at computation
- But moving data onto a GPU is slow
- Every inference requires copying all prompt tokens onto the GPU before processing can start
- This memory transfer is expensive
The Standard Solution: Batching
- Wait for multiple users’ requests
- Copy all their prompts onto GPU at once
- Process them together in a single batch
- Result: Higher overall throughput, slower individual requests
The Bus Analogy
Think of it like a bus system:
Standard Mode (High Batching):
- Bus waits until it’s full before departing
- Great for overall throughput (more people transported per hour)
- Bad for individual wait times (you wait for the bus to fill up)
Anthropic’s Fast Mode (Low Batching):
- Bus departs immediately when you get on
- Fast for you (zero wait time)
- Expensive (you’re paying for all the empty seats)
- Lower overall system throughput
Why This Makes Sense
Cost Analysis:
- 6x more expensive for 2.5x faster speed
- This ratio is exactly what you’d expect from low-batch-size inference
- You’re effectively paying for the other users who could have shared the GPU with you
No Model Changes Required:
- Same Opus 4.6 model
- Same quality outputs
- Just different scheduling/batching configuration
Important Caveat: The “waiting for the bus” cost is really only paid for the first token. For streaming responses, the main performance impact is that smaller batches require fewer flops and execute more quickly. Think “lighter buses drive faster.”
How OpenAI’s Fast Mode Works: Cerebras Giant Chips
The Hardware Solution
OpenAI’s approach is fundamentally different—they’re using special hardware from their Cerebras partnership announced in January 2025.
Standard GPU Architecture:
- H100 chip: ~1 square inch
- SRAM (fast on-chip memory): Tens of megabytes
- Most model weights stored in HBM (slower off-chip memory)
- Inference time spent streaming weights from HBM → SRAM → compute
Cerebras Chip Architecture:
- Size: 70 square inches (entire silicon wafer!)
- SRAM: 44GB (versus MB for standard GPUs)
- Entire model fits in fast on-chip memory
- Zero time spent streaming weights from slow external memory
Why This Requires a New Model
The 44GB Constraint:
- 44GB can fit ~20B parameters at fp16 precision
- Or ~40B parameters at int8 quantization
- GPT-5.3-Codex is much larger (exact size unknown)
- Solution: Distill a smaller “Spark” model that fits
Knowledge Distillation Process:
- Take large GPT-5.3-Codex (“teacher” model)
- Query it extensively across many tasks
- Train smaller “Spark” model (“student”) to mimic outputs
- Result: 20-40B param model that captures ~80-90% of capabilities
The 15x Speedup Explained
Why So Fast:
- Entire model lives in ultrafast SRAM (21 petabytes/sec bandwidth)
- Zero latency from memory transfers
- Compute can run at maximum throughput continuously
- No waiting for weights to stream from external memory
The Tradeoff:
- Model is genuinely less capable
- Gets confused on complex tasks
- Messes up tool calls that vanilla GPT-5.3-Codex handles perfectly
- “Small model smell” - feels like a distilled version
Key Differences: Side-by-Side Comparison
Technical Approach
| Aspect | Anthropic | OpenAI |
|---|---|---|
| Mechanism | Batch size optimization | Special hardware (Cerebras) |
| Model | Identical (Opus 4.6) | Different (Spark distil) |
| Infrastructure | Standard GPU stack | Giant wafer-scale chips |
| Complexity | Simple (config change) | Complex (hardware + distillation) |
| Speed gain | 2.5x | 15x |
| Quality | 100% (same model) | ~80-90% (distilled) |
Performance Characteristics
Anthropic Fast Mode:
- ✅ Full model capability preserved
- ✅ Predictable quality (same as regular Opus)
- ✅ No new failure modes
- ❌ Only 2.5x faster
- ❌ 6x more expensive
- ⚠️ First-token latency may still be slow
OpenAI Fast Mode:
- ✅ Dramatically faster (15x)
- ✅ Standard pricing
- ✅ Low latency (fast enough for persistent WebSocket)
- ❌ Different model with reduced capabilities
- ❌ New failure modes (tool call confusion)
- ❌ “Small model smell”
Strategic Positioning
Anthropic’s Play:
- “We give you the real model, just faster”
- Premium pricing for premium speed
- No capability compromises
- Appeal to users who need reliability > speed
OpenAI’s Play:
- “We give you blazing speed for most tasks”
- Good enough for many use cases
- Demonstrates Cerebras partnership value
- Appeal to users who need speed > marginal capability
The Competitive Timeline (Author’s Theory)
January 2025:
- OpenAI announces Cerebras partnership
- Begins work on fitting a model onto Cerebras chips
Early February 2025:
- Anthropic learns OpenAI will announce fast inference soon
- Realizes they have no comparable hardware play
- Quickly implements low-batch-size inference (simple config change)
Mid-February 2025:
- Anthropic announces first (a few days before OpenAI ready)
- OpenAI follows with Cerebras-backed Spark
- To non-technical observers, looks like OpenAI copied Anthropic
Author’s Take:
“I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens.”
Impact Analysis: Is Fast Inference the Future?
The Author’s Skepticism
Core Argument:
“The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model.”
Why Speed Alone Doesn’t Win:
-
Mistake Handling Dominates User Time
- 15x faster model that makes 20% more mistakes
- User spends more time fixing errors than they saved from speed
- Net productivity: negative
-
Real-World Testing: Cursor’s Experience
- Cursor released fast, less-capable agent model
- Hype dropped significantly
- Fast models didn’t improve actual user experience
- Claude Code (full capability) won despite being slower
-
The Quality Bar Is High
- AI coding tools need to be nearly perfect to be useful
- A 90% solution that requires 10% manual fixes is often worse than doing it manually
- Speed amplifies both capabilities AND mistakes
Potential Future Use Cases
1. Lower-Level Primitives
- Fast models for routine operations
- Full models for critical decisions
- Example: Claude Code already uses Haiku for some operations
- Spark could become OpenAI’s “Haiku equivalent”
2. Tiered Inference Architecture
Fast model (Spark): Simple tool calls, data formatting, routine code
Full model (GPT-5.3-Codex): Complex logic, architecture decisions, debugging
3. Latency-Critical Applications
- Real-time conversational AI
- Interactive coding assistants
- Live code completion
- Where 50-200ms matters (hence OpenAI’s WebSocket switch)
4. Cost-Optimized Workflows
- Use fast model for 80% of tasks
- Route hard tasks to full model
- Significant cost savings at marginal quality loss
Industry Impact
For Developers:
- New tradeoff space: speed vs capability vs cost
- Need to evaluate if tasks are “Spark-appropriate”
- Opportunity to build smarter routing systems
For AI Labs:
- Validates multiple approaches (no single winner yet)
- Anthropic: Software optimization path
- OpenAI: Hardware specialization path
- Both can coexist serving different use cases
For Infrastructure:
- Cerebras demonstrates viability of giant chips
- But economics still unclear (how much do those chips cost?)
- Model distillation becoming critical skill
- Batch size optimization now a competitive lever
The Bigger Picture: What This Reveals
Different Philosophies
Anthropic’s Approach:
- Preserve model quality at all costs
- Optimize infrastructure around existing models
- Premium pricing for premium experience
- Conservative, user-trust-focused
OpenAI’s Approach:
- Explore new hardware partnerships aggressively
- Willing to trade some quality for speed
- Bet on “good enough” being actually good enough
- Experimental, market-exploration-focused
Technical Innovation Paths
Software Optimization (Anthropic):
- Advantages: Fast to implement, works with existing hardware
- Disadvantages: Limited headroom (can’t go 15x faster)
- Best for: Incremental improvements, preserving quality
Hardware Specialization (OpenAI):
- Advantages: Dramatic performance gains possible
- Disadvantages: Requires new hardware, model changes, complexity
- Best for: Breakthrough performance, new use cases
Unanswered Questions
1. Economics:
- How much do Cerebras chips cost to run?
- Is 6x premium for Anthropic sustainable?
- Will distilled models cannibalize full model revenue?
2. Model Fit:
- What’s the largest model that fits on 44GB SRAM?
- Can future Cerebras chips handle 100B+ models?
- How much quality loss is acceptable for distillation?
3. Market Adoption:
- Do users actually prefer Spark over GPT-5.3-Codex?
- Is Anthropic’s premium pricing justified?
- What percentage of workloads are “fast-model appropriate”?
Key Takeaways
1. Two Valid Approaches
- Anthropic: Optimize scheduling (batch size)
- OpenAI: Optimize hardware (giant chips)
- Both deliver “fast mode”, completely different methods
2. Speed vs Quality Tradeoff
- Anthropic: 2.5x faster, same quality, 6x cost
- OpenAI: 15x faster, reduced quality, standard cost
- No clear winner—depends on use case
3. Technical Complexity Varies
- Batch size tuning: Simple, fast to implement
- Cerebras integration: Complex, requires distillation
- Anthropic’s speed was likely competitive response
4. Fast ≠ Better (Usually)
- Mistakes cost more time than speed saves
- Quality bar for AI coding tools is very high
- Fast models work as primitives, not replacements
5. Infrastructure Matters
- Hardware innovation (Cerebras) enables new capabilities
- Software optimization (batching) provides incremental gains
- Both will continue to evolve
6. The Real Innovation
- Not just “fast inference”
- But exploration of different tradeoff spaces
- Anthropic: premium quality at premium price
- OpenAI: good-enough quality at breakthrough speed
Recommendations
For Developers Choosing a Fast Mode:
✅ Use Anthropic Fast Mode when:
- Quality is critical (production systems)
- Budget allows premium pricing
- You need predictable behavior (same model)
- First-token latency isn’t critical
✅ Use OpenAI Fast Mode when:
- Speed is paramount (interactive tools)
- Tasks are relatively simple (Spark can handle)
- You can tolerate occasional mistakes
- Cost efficiency matters
For AI Infrastructure Teams:
- Monitor batch size impact on your inference stack
- Evaluate if giant chips (Cerebras) make economic sense
- Build model routing: fast models for simple, full models for hard
- Measure mistake rates, not just speed
For the Industry:
- Fast inference is important but not transformational (yet)
- Quality > speed for most high-value applications
- Hardware specialization will continue (more Cerebras-like solutions)
- Distillation quality will determine viability of speed-optimized models
The Bottom Line
Anthropics and OpenAI’s “fast modes” represent two philosophically different approaches to the same problem:
- Anthropic: “How can we make our best model faster?” → Optimize infrastructure
- OpenAI: “How can we make a fast model?” → New hardware + distillation
OpenAI’s achievement is more technically impressive (Cerebras integration + distillation), but Anthropic’s has a cleaner value proposition (same quality, just faster).
The winner? Neither. Both approaches will coexist, serving different use cases. Fast distilled models will become lower-level primitives in AI systems, while full models remain the gold standard for complex tasks.
The real question isn’t “which is better?” but “when should I use which?”—and that answer depends entirely on your specific use case, quality requirements, and budget.
Speed is a feature. Quality is a requirement. The labs that figure out how to deliver both will win.
Source: Two different tricks for fast LLM inference by Sean Goedecke
Click to load Disqus comments