ai engineering, agent systems,

ByteDance Interviewer: Your SFT Accuracy is 94%? Then Why Can't Your Agent Complete 10 Steps?

Cui Cui Follow May 14, 2026 · 23 mins read
Share this

In 2026, if you still put “Proficient in LangChain/CrewAI” on your resume, the interviewer might not even blink.

What big tech companies are fighting over now are people who understand Agent Loop control and Harness architecture.

Yesterday I interviewed a candidate with 2 years of experience. His resume read: “Improved tool-calling accuracy to 94% via SFT.”

I asked him: “Is that 94% useful in production?”

He froze.

Because a 94% single-call accuracy rate cannot solve the 1% runaway risk in an Agent.


The Industry’s Biggest Illusion: Agent = LLM + Tool

Many teams have fallen into this trap: thinking that once you give an LLM access to an API, it becomes an Agent.

In reality, 90% of Agents on the market are essentially “advanced scripts”: user input → plan steps → call tools → return results → done.

That’s not an Agent. That’s a “disposable lighter.” No feedback, no iteration, no self-correction. If any step goes wrong, the entire chain collapses.

📊 Real Production Data Comparison:

We ran the standard DeepResearch benchmark internally using Qwen3-30B:

Metric Fake Agent (no Loop, no Harness) Real Agent (with Loop and Harness)
Single tool-call accuracy 94% (very high) 88% (slightly lower)
10+ step task completion rate 47% (dreadful) 92% (production-ready)
Runaway rate (infinite loops / hallucinations) 22% <0.5%

The conclusion is brutal: SFT solves the “steady hands” problem (Hand), while Agent architecture solves the “clear path” problem (Brain).

Bumping single-call accuracy from 94% to 97% might yield a 3% improvement in task completion. But adding a complete Loop and Harness can directly double the task completion rate.

That’s why big tech companies today would rather hire an ordinary engineer who understands architecture than an algorithm specialist who only knows how to game SFT metrics.


Part I — Tao: Agent Is Not a Feature, It’s a Whole New Software Paradigm

Many people’s understanding of Agent is still stuck at “LLM plus tools.”

That’s a fundamental cognitive mistake.

Agent is not a feature of an LLM, nor is it a new API. It is the third entirely new software paradigm, following procedural programming and object-oriented programming.

Three Revolutions in Software Paradigms

  • Procedural: Tell the computer how to do it (How) — the programmer hardcodes every step of logic
  • Object-oriented: Tell the computer what to do (What) — the programmer defines objects and methods
  • Agent-oriented: Tell the computer what you want (Goal) — the programmer only defines goals and boundaries

This is a fundamental leap.

In traditional software, all logic is hardcoded upfront. The program only executes the code you wrote and won’t do anything unexpected.

In Agent software, the programmer no longer writes concrete execution logic. Instead, they define goals, rules, and boundaries. How those goals are achieved is up to the Agent itself.

This is why all traditional software engineering methods fail when applied to Agents:

  • You can’t test an Agent the way you test traditional software, because its execution path is different every time
  • You can’t debug an Agent the way you debug traditional software, because its errors are probabilistic and non-reproducible
  • You can’t operate an Agent the way you operate traditional software, because it generates its own “dynamic noise”

Four Stages of Industry Development

Over the past three years, the entire Agent industry has moved through four clear stages:

Stage Year Core Keyword Industry Collective Illusion Final Truth
Stage 1 2023 Prompt Engineering Writing good prompts solves everything Prompts can’t handle complex tasks
Stage 2 2024 Tool Calling High enough tool-calling accuracy makes Agent usable High accuracy still can’t solve long-chain problems
Stage 3 2025 Agent Loop A closed loop makes a real Agent An uncontrolled loop is just an infinite loop
Stage 4 2026 Agent Harness A strong enough model is enough to deploy Uncontrolled Agents are all time bombs

We are now at the beginning of Stage 4.

In 2026, the competition in Agents has completely shifted from the model layer to the system layer.

Whoever can build the most stable, most controllable, and most reliable Agent runtime system will win in this revolution.


Part II — Method: The Three-Layer Golden Model for Agent Optimization

After two years of production practice, we’ve distilled a general Agent optimization methodology — we call it the “Three-Layer Golden Model.”

Any production-grade Agent must include all three layers. None can be missing.

┌─────────────────────────────────┐
│ Control Layer: Agent Harness    │
│ Goal: Prevent runaway, ensure   │
│ safety, enable governance       │
├─────────────────────────────────┤
│ Architecture Layer: Cognitive   │
│ Loop                            │
│ Goal: Raise ceiling, enable     │
│ correction, sustain operation   │
├─────────────────────────────────┤
│ Model Layer: SFT / RL / RAG     │
│ Goal: Raise floor, improve      │
│ precision, reduce hallucinations│
└─────────────────────────────────┘

Layer 1: Model Layer (Hand) — Raise the Floor

The model layer is the foundation of the entire Agent. It handles the “steady hands” problem.

Its core goal: push single tool-call accuracy above the industry baseline.

Current industry consensus:

  • Single tool-call accuracy ≥ 94%: minimum threshold for production use
  • Single tool-call accuracy ≥ 97%: noticeably better experience
  • Single tool-call accuracy ≥ 99%: approaching human-level performance

But there’s a crucial insight here: model-layer optimization has a ceiling, and the marginal cost is extremely high.

No matter how much effort you pour into SFT and RL, you’ll never push single tool-call accuracy to 100%. And the cost of going from 94% to 97% is more than 10× the cost of going from 80% to 94%.

More importantly, model-layer optimization cannot solve the task completion rate problem in long-chain tasks.

That’s the brutal truth revealed by our test data:

A 94% single tool-call accuracy rate only yields a 47% completion rate for 10-step tasks.

Layer 2: Architecture Layer (Brain) — Raise the Ceiling

The architecture layer is the core of the entire Agent. It handles the “clear path” problem.

Its core goal: even when the model occasionally makes a mistake, the overall system can still ultimately achieve its goal.

A qualified architecture layer must include three core mechanisms:

  • ReAct Loop: Reasoning and action alternate — take one step, see what happens
  • Reflexion: Evaluate results after each action, promptly correct errors
  • Memory: Store all historical information, avoid repeating mistakes

The power of the architecture layer lies in using the determinism of the system to offset the probabilism of the model.

Even if the model has a 12% error rate at every step, through constant reflection and correction, the system’s final completion rate can still reach 90% or higher.

Layer 3: Control Layer (Guardian) — Prevent Disasters

The control layer is the last line of defense for the entire Agent. It handles the “will things go wrong?” problem.

Its core goal: even if the model goes completely off the rails, the overall system will not cause irreversible damage.

Many people ask: if we already have the architecture layer’s reflection mechanism, why do we need a control layer?

Because the reflection mechanism is itself executed by the model — and the model can make mistakes too.

When a model falls into severe hallucinations or an infinite loop, it doesn’t even realize it’s making errors. At that point, you need a deterministic control system that operates independently of the model — one that can forcibly terminate erroneous behavior.

The model layer solves whether the system is “good to use.” The architecture layer solves whether the system “can be used.” The control layer solves whether you “dare to use” it.

That’s why we say:

Use SFT to raise the floor. Use Agent Loop to raise the ceiling. Use Harness to prevent disasters.


Part III — Technique: Engineering Guide for Production-Grade Agents

With the methodology covered, let’s look at concrete engineering implementation.

All of the code below has been validated in our production environment. You can copy it directly into your projects.

1. Architecture Comparison: Disposable Lighter vs. Precision Machine Tool

First, let’s use code to clarify the fundamental difference between the two:

Fake Agent Architecture (Linear Pipeline)

# ============================================
# Fake Agent: Disposable lighter
# Traits: Linear execution, no feedback, no control
# Problem: Any step fails = entire flow crashes
# ============================================
def fake_agent(user_input):
    # One-shot planning — wrong once, wrong entirely
    plan = llm.make_plan(user_input)

    # Execute with no feedback — never looks back
    result = execute_plan(plan)

    # Return directly, regardless of correctness
    return result

Real Agent Architecture (Cognitive Loop)

# ============================================
# Real Agent: Precision machine tool
# Traits: Closed-loop execution, reflection, control
# Advantages: Can correct, recover, prevent runaway
# ============================================
def real_agent(user_input):
    context = AgentContext(user_input)

    while not context.is_done:
        # 1. Perceive & compress: prevent token explosion
        compressed_ctx = compress_context(context)

        # 2. Reason & plan: ReAct paradigm
        thought = llm.reason(compressed_ctx)

        # 3. Act & validate: parse structured output
        action = parse_action(thought)

        # 4. Harness interception: last line of defense in production
        if not harness.approve(action):
            action = harness.correct(action)

        # 5. Execute & reflect: update state, evaluate result
        observation = execute(action)
        context.reflect(observation)

    return context.final_answer

Core differences:

  • The fake Agent uses return — done after execution
  • The real Agent uses a while loop — runs until task is complete
  • The real Agent has two extra core components: harness and reflect

2. Full Production Agent Executor Implementation

This is the core code for the Agent executor currently in use in our production environment:

from typing import List, Dict, Any
from dataclasses import dataclass
import numpy as np

@dataclass
class AgentStep:
    action: Dict[str, Any]
    observation: str
    thought: str
    confidence: float = 0.8  # Reasonable default; avoid null value issues

@dataclass
class AgentContext:
    user_query: str
    steps: List[AgentStep] = None
    is_done: bool = False
    final_answer: str = None
    estimated_cost: float = 0.0

    def __post_init__(self):
        if self.steps is None:
            self.steps = []

    def add_step(self, step: AgentStep):
        self.steps.append(step)
        self.estimated_cost += calculate_cost(step)

    def reflect(self, observation: str):
        # Basic reflection logic: check if goal is reached, whether to adjust direction
        if "task complete" in observation or "answer found" in observation:
            self.is_done = True
            self.final_answer = observation
        elif "no relevant information found" in observation and len(self.steps) > 3:
            self.is_done = True
            self.final_answer = "Could not find enough information to answer your question. Please try adjusting your query."

class ProductionAgentExecutor:
    def __init__(self, model, tools, harness, max_steps: int = 20):
        self.llm = model
        self.tools = tools
        self.harness = harness
        self.MAX_STEPS = max_steps

    def run(self, user_query: str) -> str:
        context = AgentContext(user_query)

        while not context.is_done:
            # 1. Context compression: 90% of long-chain failures stem from context rot
            compressed_ctx = self._compress_context(context)

            # 2. Generate thought and action: ReAct prompt template
            prompt = self._build_react_prompt(compressed_ctx)
            response = self.llm.generate(prompt, temperature=0.1)
            thought, action = self._parse_response(response)

            # 3. Safely obtain confidence: compatible with all model APIs
            # Addresses: inconsistent logprobs format across OpenAI/Anthropic/open-source models
            confidence = self._safe_get_confidence(response)

            # 4. Harness safety check: don't go to production without this
            harness_result = self.harness.check(action, context, confidence)
            if harness_result.blocked:
                action = harness_result.corrected_action
                if harness_result.terminate:
                    return harness_result.termination_message

            # 5. Execute tool call
            observation = self.tools.execute(action)

            # 6. Record step and reflect
            step = AgentStep(
                action=action,
                observation=observation,
                thought=thought,
                confidence=confidence
            )
            context.add_step(step)
            context.reflect(observation)

            # 7. Circuit breaker: the last safety net
            if len(context.steps) > self.MAX_STEPS:
                return f"Task failed: exceeded max steps ({self.MAX_STEPS}). Completed {len(context.steps)} steps without reaching the goal."

        return context.final_answer

    def _safe_get_confidence(self, response: Any) -> float:
        """
        Safely retrieve model output confidence, compatible with all mainstream model APIs.
        Handles:
        1. Model doesn't return logprobs
        2. Different logprobs field names (logprobs/token_logprobs)
        3. Empty logprobs arrays
        4. Any exception
        """
        try:
            # Compatible with OpenAI format
            logprobs = getattr(response, 'logprobs', None)
            if logprobs is None:
                return 0.8

            # Compatible with different API field names
            token_logprobs = getattr(logprobs, 'token_logprobs', [])
            if not token_logprobs:
                # Compatible with Anthropic format
                token_logprobs = getattr(logprobs, 'content', [])

            if token_logprobs:
                # Compute average log-probability and convert to confidence
                avg_logprob = np.mean([lp for lp in token_logprobs if lp is not None])
                return float(np.exp(avg_logprob))

            return 0.8
        except Exception:
            # Any exception returns default confidence — ensures code doesn't crash
            return 0.8

    def _compress_context(self, context: AgentContext) -> str:
        # Keep the last 5 steps; summarize earlier steps intelligently
        if len(context.steps) <= 5:
            return self._format_context(context)

        recent_steps = context.steps[-5:]
        old_steps = context.steps[:-5]
        summary = self._summarize_old_steps(old_steps)

        return f"""Historical summary: {summary}
Recent execution steps:
{self._format_steps(recent_steps)}
Original user query: {context.user_query}
"""

    def _build_react_prompt(self, compressed_ctx: str) -> str:
        return f"""You are a professional AI assistant. Please think and act strictly in the following format:
Thought: [Your analysis of the current situation and the reason for your next action]
Action: [The tool and parameters you want to call, must be strict JSON format]
Available tools:
{self.tools.get_descriptions()}
Current context:
{compressed_ctx}
Begin:"""

    def _parse_response(self, response: str) -> tuple[str, Dict[str, Any]]:
        # Robust response parsing logic; handles various model formatting errors
        import re
        import json

        thought_match = re.search(r"Thought:(.*?)(?=\nAction:|$)", response, re.DOTALL)
        action_match = re.search(r"Action:(.*?)$", response, re.DOTALL)

        thought = thought_match.group(1).strip() if thought_match else "Thinking about the next step"
        action_json = action_match.group(1).strip() if action_match else "{}"

        try:
            action = json.loads(action_json)
            # Validate tool exists
            if action.get("name") not in self.tools.get_names():
                raise ValueError(f"Unknown tool: {action.get('name')}")
            return thought, action
        except Exception as e:
            return thought, {
                "name": "respond",
                "arguments": {"content": "I encountered some issues and cannot continue. Please restate your request."}
            }

3. Three Core Firewalls of Agent Harness

The Harness is the most important component of the entire system — and the one 90% of teams overlook.

A qualified Harness must include at least the following three firewalls:

Firewall 1: Semantic Loop Detection

def detect_semantic_loop(steps: List[AgentStep], threshold: float = 0.92) -> bool:
    """Detect whether the Agent has fallen into a semantic loop (doing the same thing repeatedly)"""
    if len(steps) < 3:
        return False

    # Compute semantic similarity of the last 3 actions
    embeddings = [get_embedding(str(step.action)) for step in steps[-3:]]

    sim1 = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
    sim2 = np.dot(embeddings[1], embeddings[2]) / (np.linalg.norm(embeddings[1]) * np.linalg.norm(embeddings[2]))

    return sim1 > threshold and sim2 > threshold

Firewall 2: Tool Budget Circuit Breaker

def check_budget(context: AgentContext, max_budget: float = 1.0) -> HarnessResult:
    """Check if the per-task budget limit has been exceeded; prevent abusive calls"""
    if context.estimated_cost > max_budget:
        return HarnessResult(
            blocked=True,
            terminate=True,
            termination_message=f"Task terminated: estimated cost ${context.estimated_cost:.2f} exceeds the per-task limit of ${max_budget}. You can split the task and try again."
        )
    return HarnessResult(blocked=False)

Firewall 3: Confidence Threshold

def check_confidence(step: AgentStep, threshold: float = 0.7) -> HarnessResult:
    """Check the model's confidence in the current action; route to human confirmation if uncertain"""
    if step.confidence < threshold:
        return HarnessResult(
            blocked=True,
            corrected_action={
                "name": "respond",
                "arguments": {
                    "content": "I'm not sure what to do next. Would you like me to:\n1. Search for more information\n2. Try a different approach\n3. End the task and return the current results"
                }
            }
        )
    return HarnessResult(blocked=False)

4. Common Fault Troubleshooting Guide

Finally, here’s the Agent fault troubleshooting guide we use internally:

Symptom Root Cause Priority Solution
Infinite loop Model cannot recover from error P0 Add semantic loop detection; force change of action direction
Context rot Irrelevant info fills the context window P0 Implement smart context compression; retain only key info
Goal drift Model forgets the original objective P1 Re-inject the original goal every 3 steps
Tool abuse Model calls the same tool repeatedly P1 Add per-tool call frequency limits
Hallucination output Model fabricates non-existent information P2 Add output verification; require model to provide evidence
Silent failure Tool returns error but model pretends success P2 Add tool return result validation in the Harness layer

I want to share my take on where the Agent industry is heading in 2026.

Trend 1: SFT Will Gradually Become Infrastructure

Over the next 6 months, tool-calling SFT will become increasingly standardized. More and more open-source models will natively support tool-calling accuracy above 94%.

Spending a lot of effort building SFT in-house will become less and less worthwhile.

Trend 2: Competition Will Completely Shift to Architecture and Control Layers

When all models’ tool-calling accuracy reaches 94% or higher, the model itself becomes a commodity.

Real competition will shift to areas like architecture design, runtime control, observability, and governance — these system engineering concerns.

Trend 3: Enterprise-Grade Agents Will Become Mainstream

In 2026, more and more enterprises will start deploying Agents into production environments — to replace repetitive cognitive labor.

And what enterprises care about most is never how smart the Agent is. It’s how reliable, how safe, and how controllable it is.

That’s exactly why Agent Harness will become the hottest technical topic this year.


Closing Thoughts

Many people say 2026 is the year Agent deployment takes off.

But I’d say: 2026 is actually the year Agent engineering comes of age.

Over the past three years, we’ve been exploring “what can Agents do?”

Starting this year, the question we need to solve is: “How can Agents do things stably, safely, and reliably?”

This is a leap from zero to one. A leap from toy to tool.

In this journey, the most important skill is not the ability to write prompts, or to fine-tune models — it’s systems engineering.

You need to learn how to use the determinism of a system to offset the probabilism of a model. You need to learn how to use the power of architecture to compensate for the limitations of a model. You need to learn how to use control to guard against unknown risks.

A closing thought for every Agent engineer:

The model is the brain. The architecture is the skeleton. The Harness is the immune system. A system without a brain is an idiot. A system without a skeleton is a cripple. A system without an immune system is a liability.


Originally published in Chinese by 大模型从业者 on WeChat: 字节面试官:你的 SFT 准确率 94%?那你的 Agent 为什么跑不通 10 步

Join Newsletter
Get the latest news right in your inbox. We never spam!
Cui
Written by Cui Follow
Hi, I am Z, the coder for cuizhanming.com!

Click to load Disqus comments