LLMs Corrupt Your Documents When You Delegate: What the DELEGATE-52 Benchmark Reveals

A new paper just landed that every engineer building AI-assisted workflows should read. The DELEGATE-52 benchmark reveals something uncomfortable: when you hand a long-form task to a frontier LLM — editing a document, maintaining a codebase, refining a report — the model silently corrupts roughly 25% of your content by the time the workflow ends.

Even the best models we have right now. Even Claude 4.6 Opus. Even GPT 5.4. Even Gemini 3.1 Pro.

This isn’t an edge case. It’s a systematic failure mode, and it compounds over time.

What Is DELEGATE-52?

DELEGATE-52 is a benchmark designed by Philippe Laban and collaborators at arXiv to test LLMs in delegated workflows — long multi-step tasks where you hand off document editing to a model and expect it to faithfully execute without introducing its own errors.

The name comes from the 52 professional domains they tested: coding, crystallography, music notation, legal drafting, medical reports, financial models, and more. Each domain requires sustained document work across multiple interactions — the kind of workflow that “vibe coding” and AI copilots are increasingly used for.

The key insight the paper is testing: delegation requires trust. When you ask a model to handle something, you’re trusting it to do the task and only the task — not to subtly reshape content in ways you didn’t ask for.

The Core Finding: 25% Corruption at Scale

The experiment ran 19 LLMs through DELEGATE-52 tasks. The result was striking:

Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.

Less capable models fared far worse. But even the top tier had a 1-in-4 corruption rate.

What does “corruption” mean here? Not hallucination in the usual sense — the model fabricating facts in a chat response. DELEGATE-52 captures something more insidious: sparse but severe edits that silently alter the document’s meaning, remove information, change numerical values, or restructure content in ways the user never requested.

The model is doing the task. It’s just also doing other things.

Why It Compounds

The paper found three factors that make degradation worse:

Document size — Larger documents have more surface area for unintended edits. Context windows are big now, but attention over long documents is uneven. The model loses track of what it’s “supposed to leave alone.”

Interaction length — Each step in a delegated workflow introduces small deviations. Those deviations accumulate. By step 10 of a 15-step workflow, the corruption from early steps has already drifted out of sight.

Distractor files — When you give a model multiple files in its context and ask it to edit one, it sometimes bleeds content across. Information from File B ends up in File A. This is a real-world scenario: any RAG setup, any multi-file coding task, any research assistant managing sources.

Agentic Tool Use Doesn’t Save You

One might hope that giving models proper tools — read/write file operations, structured document editors, version control — would reduce corruption. The paper tested this explicitly.

It didn’t help.

Tool use doesn’t fix the underlying attention and instruction-following degradation. The model still drifts from the original document’s intent, it just does so through tool calls instead of freeform generation. The mechanism changes; the corruption rate doesn’t.

Why This Matters for AI Engineering

If you’re building systems that delegate document work to LLMs — and most AI engineering teams are, in some form — this paper is a quiet alarm.

The obvious affected workflows:

Vibe coding — where you delegate extended codebase edits to a coding agent across many turns
Research summarization pipelines — where a model refines and re-summarizes content across multiple passes
RAG with document updates — where a model not only retrieves but also annotates or restructures retrieved content
Automated report generation — where sequential editing passes produce a “final” document

The corruption DELEGATE-52 finds isn’t random. The paper describes it as “sparse but severe” — which is the worst kind. Random noise is easier to detect and filter. Sparse-but-severe means most of your document looks fine. A small number of locations have been meaningfully corrupted. And you probably won’t catch it on a casual review.

What You Can Do About It

The paper doesn’t offer a silver bullet, but the failure modes it identifies suggest some practical mitigations:

Checkpoint diffing. If your workflow involves multi-step document editing, diff the document against the original at each step. Flag changes that weren’t explicitly requested. This is extra overhead, but it catches drift before it compounds.

Structural preservation constraints. When prompting for delegated edits, be explicit about what the model should not touch. “Edit only section 3. Do not modify any other sections.” Boundaries reduce surface area.

Shorter delegation chains. The corruption compounds with interaction length. Break long workflows into shorter segments with human review at each transition. This trades automation for accuracy — a real tradeoff that the field hasn’t fully reckoned with.

Read-back verification. After a delegated edit, ask the model to re-read the original and the edited version and identify any differences. This isn’t perfect (the same model may not catch its own drift), but it adds a weak check.

Use structured formats where possible. The paper suggests that domain-specific formats (code, music notation, structured schemas) may constrain corruption more than free-form text. If you can lock the output format, you reduce the space where unintended changes can hide.

The Bigger Picture: The Trust Assumption Is Wrong

The deeper issue this paper surfaces is architectural. Most AI delegation frameworks assume that LLMs are faithful executors — that when you ask a model to “edit the introduction,” it will edit only the introduction. DELEGATE-52 shows this assumption is false, systematically, across every frontier model tested.

That’s not a bug in any specific model. It’s a property of how current LLMs handle long-context instruction following under multi-step delegation. The models aren’t malicious. They’re just not precise in the way that document integrity requires.

This should change how we design agentic systems. Delegation is not the same as execution. Every delegation step introduces a trust gap. Systems that treat LLMs as reliable sub-contractors are building on a foundation the current generation of models can’t reliably provide.

Until we have models where DELEGATE-52 corruption rates approach zero, any workflow that relies on delegated document work needs integrity checks baked in — not as an afterthought, but as a first-class architectural concern.

The paper is at arXiv:2604.15597. Worth reading the full benchmark methodology if you’re building in this space.

Monitor the AI engineering research that matters. New posts every Monday and Thursday.