LLMs Corrupt Your Documents When You Delegate

LLMs Corrupt Your Documents When You Delegate
Photo by Egor Komarov / Unsplash

The uncomfortable gap between “can edit” and “can be trusted”

A lot of current AI enthusiasm is built around delegation.

We no longer ask language models only to answer questions. We ask them to modify source code, rewrite reports, refactor configuration files, reorganize spreadsheets, update structured records, transform diagrams, edit subtitles, and operate across entire project folders. In software engineering this is often described as “vibe coding”, but the pattern is broader: a human gives a goal, the AI system manipulates artifacts, and the human supervises at a distance.

That is exactly the scenario Microsoft Research studies in the paper “LLMs Corrupt Your Documents When You Delegate”. The paper introduces DELEGATE-52, a benchmark for long-horizon delegated document editing across 52 professional domains, and the result is sobering: even strong frontier models can silently degrade documents over repeated editing workflows. The paper reports that frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupted an average of about 25% of document content by the end of long simulated workflows, while the average degradation across all evaluated models was substantially worse. (arXiv)

This is not a paper about models refusing instructions. It is about models trying to do the work, mostly following the request, and still damaging the artifact.

That distinction matters.

What DELEGATE-52 evaluates

DELEGATE-52 is designed to answer a practical question:

If I hand an LLM a set of professional documents and ask it to perform a sequence of realistic edits, how much of the original document remains semantically intact after repeated delegation?

The benchmark contains work environments across 52 domains, including examples such as Python code, Docker files, database schemas, Graphviz diagrams, recipes, subtitles, accounting ledgers, genealogy records, chess notation, music notation, crystallography files, 3D object files, calendars, transit data, and more. The paper groups these domains into categories such as Code & Configuration, Science & Engineering, Creative & Media, Structured Records, and Everyday documents. (arXiv)

Each work environment contains:

  1. Seed documents
    Real documents found online, not synthetic templates. They are typically in textual, unencoded formats and range around 2,000–5,000 tokens.
  2. Edit tasks
    Realistic, non-trivial transformations a user might ask an AI system to perform. For example, splitting an accounting ledger by category, converting amounts, reformatting records, or restructuring a document.
  3. Distractor context
    Related but irrelevant files, meant to simulate a realistic workspace where retrieval is not perfect and the model sees more than just the one file it needs. The paper describes distractor context in the 8,000–12,000 token range. (arXiv)

This matters because real-world AI delegation rarely happens in a clean prompt containing only the relevant data. It happens in messy repositories, document libraries, folders, SharePoint sites, wiki exports, generated artifacts, old versions, and “probably relevant” files returned by a search or retrieval system.

The clever part: round-trip editing

One of the hardest problems in evaluating document editing is that you often do not have a perfect reference answer.

For a coding benchmark, you might have unit tests. For a math benchmark, you might have a known answer. For a document transformation task, reference answers are expensive to create and domain-specific. How do you know whether the result is semantically equivalent?

DELEGATE-52 solves this with a round-trip relay.

Instead of evaluating a single one-way edit, each task is defined as a pair:

Original document
   ↓ forward edit
Transformed document
   ↓ backward edit
Reconstructed document

A perfect model should be able to apply the forward transformation and then apply the inverse transformation, returning the document to its original semantic state.

For example:

Forward task:
Split this ledger into separate files by expense category.

Backward task:
Merge the category files back into one chronological ledger.

If the reconstructed ledger differs from the original ledger, something was lost, altered, duplicated, reordered incorrectly, or hallucinated.

The benchmark then chains multiple round trips together. Ten round trips equal 20 model interactions. The paper calls this a relay, and it is designed to simulate long delegated workflows rather than isolated prompt-response interactions. (arXiv)

The core metric is the Reconstruction Score, or RS@k, which measures how well the document is preserved after k interactions using domain-specific similarity functions. The repository describes this directly: round trips are chained, and performance is measured by comparing the recovered document against the original using domain-specific evaluators. (GitHub)

Why generic similarity is not enough

A key contribution of the benchmark is that it does not rely only on generic text similarity, Levenshtein distance, embeddings, or an LLM judge.

That would be too weak.

A recipe where 200g butter becomes 800g butter may look textually similar but is semantically broken. A DNS zone file with one incorrect record can be operationally dangerous. A calendar entry with the wrong date is not “mostly right”. A source file with a small but critical logic change can still compile and be wrong.

DELEGATE-52 therefore uses domain-specific parsers and evaluators. For a recipe, the parser might extract ingredients, quantities, units, steps, and tips. For another domain, it might parse structured records, source files, metadata, geometry, accounting entries, or notation. The paper states that these domain-specific similarity functions were designed to capture semantic equivalence and that generic similarity measures, including LLM-as-judge approaches, failed to capture nuanced semantic differences reliably. (arXiv)

This is one of the most important ideas in the paper: document correctness is domain-specific.

There is no universal “looks fine to me” metric that can reliably validate all delegated work.

The headline result: degradation compounds

The paper evaluates 19 LLMs from six model families, including OpenAI, Anthropic, Google Gemini, Mistral, xAI, and Moonshot models. The main experiment uses 20 delegated interactions over work environments with seed documents plus distractor context. (arXiv)

The reported results show that models degrade documents over time. The paper highlights that frontier models lose roughly a quarter of document content by the end of long workflows, and that across all models the average degradation is around 50%. (arXiv)

The most important lesson is not just that models make mistakes. We already knew that.

The important lesson is that short tests are misleading.

The paper gives examples where two models perform similarly after two interactions but diverge substantially by the twentieth interaction. Conversely, one model may start behind another and overtake it later. The authors explicitly warn that short interaction simulations are insufficient for understanding long-horizon delegated performance. (arXiv)

That has direct implications for how we evaluate AI-assisted engineering tools.

A demo where an AI assistant successfully edits one file is not evidence that it can safely maintain a project over 50 edits. A benchmark where a model performs a single transformation does not tell us whether it preserves invariants across repeated transformations. A one-shot code refactor may pass, while a multi-step repository migration slowly accumulates incorrect assumptions.

Tool use did not magically fix the problem

A common intuition is that agents with tools should perform better than plain LLMs. Give the model file-system access, read/write tools, Python execution, and a multi-turn loop, and surely it should preserve documents more reliably.

The DELEGATE-52 repository includes exactly this kind of agentic harness: model_agentic.py, where the LLM can use tools such as reading files, writing files, deleting files, and running Python in a multi-turn loop. (GitHub)

The paper’s finding is important: basic agentic tool use did not improve performance on DELEGATE-52. (arXiv)

That does not mean tools are useless. It means that merely wrapping a model in a tool loop is not enough. The agent still needs robust planning, state tracking, validation, rollback, diff awareness, semantic checks, and domain-specific correctness tests.

A tool-using LLM that confidently writes corrupted files is still a corruption engine.

The failures are sparse but severe

One of the most interesting sections of the paper analyzes how the degradation happens.

At first glance, aggregate curves can make degradation look smooth, as if every interaction introduces a small amount of noise. But the paper’s deeper analysis says that is not the main failure mode.

Instead, models often preserve the document reasonably well for some steps, then suffer critical failures: individual round trips that drop the score by 10 points or more. The authors report that these sparse critical failures explain about 80% of total document degradation. Stronger models do not necessarily eliminate the failure mode; they delay it or experience it less often. (arXiv)

This is exactly the kind of failure that is dangerous in real delegated work.

A model can look reliable for several operations, building user trust, and then silently introduce one severe corruption:

  • a field is dropped from a structured record;
  • a financial amount is changed;
  • a calendar recurrence rule is mangled;
  • a source file loses an edge case;
  • a dependency version is changed incorrectly;
  • a music notation file remains syntactically plausible but musically wrong;
  • a 3D object file renders differently;
  • a translation preserves style but loses a constraint.

The user may not notice because the document still “looks” valid.

Deletion versus corruption

The paper distinguishes between two broad degradation patterns:

  1. Deletion: content disappears.
  2. Corruption: content remains present but becomes incorrect.

This distinction is critical.

The paper finds that weaker models tend to lose content through deletion, while frontier models more often corrupt content that is still present. (arXiv)

From a user perspective, corruption is often worse than deletion.

Missing content can sometimes be spotted. Incorrect content that remains structurally plausible is harder to detect. A missing row in a ledger is bad; a row with the wrong amount, currency, date, or account can be worse. A removed test is visible in a diff; a subtly weakened assertion may not be. A missing DNS record may cause an outage; an incorrect DNS record may route traffic somewhere unintended.

This is why “the model preserved most of the file” is not enough. Preservation must be semantic, not cosmetic.

Structured domains perform better, but only relatively

DELEGATE-52 shows that performance varies significantly by domain.

The paper reports that models perform better in programmatic and structured domains, such as Python and database schemas, and worse in natural language or niche domains such as recipes, fiction, transit, or textile-related formats. It also notes better performance in domains with high repetitiveness and structural density, and weaker performance in domains with rich, unrepeated vocabulary. (arXiv)

That fits what many practitioners observe.

LLMs are comparatively strong where:

  • syntax is explicit
  • structure is repetitive
  • constraints are local
  • validators exist
  • tests can be executed
  • there are many examples in training data
  • the domain has machine-checkable invariants

They are weaker where:

  • correctness depends on domain semantics
  • the document is long and irregular
  • many entities must be tracked globally
  • subtle changes matter
  • there is no easy validator
  • the format is rare or specialized
  • human review requires expertise

This is also why AI coding often feels ahead of AI document editing in other professional domains. Code has compilers, tests, linters, type checkers, schemas, package managers, and runtime behavior. Many other professional documents do not have such rich verification infrastructure.

Global restructuring is hard

The benchmark tags edit tasks by semantic operations such as sorting, merging, splitting, classification, string manipulation, and referencing. The paper finds that tasks requiring global document restructuring, such as split-and-merge operations or classification across the whole document, are harder than local operations such as string manipulation. Tasks requiring multiple coordinated operations are harder still. (arXiv)

This is highly relevant for real workflows.

The risky tasks are not necessarily simple edits like:

Rename this heading.
Fix this typo.
Change this variable name.
Convert this field from snake_case to camelCase.

The risky tasks are more like:

Refactor this module into three smaller modules while preserving behavior.

Split this specification into separate requirement documents grouped by subsystem.

Normalize this spreadsheet into separate tables and regenerate the summary.

Convert this accounting ledger into another format and preserve all balances.

Reorganize this policy document by audience and remove duplicates.

Merge these calendar files and preserve recurrence rules.

Those tasks require the model to maintain a global mental model of the artifact. That is exactly where small mistakes become structural corruption.

The image editing result is even worse

The paper also explores whether the methodology applies beyond text by creating visual work environments for image editing models. The result is even more severe: the authors report that image editing models degrade images much faster than LLMs degrade text. The best image models achieved final reconstruction scores around 28–30%, compared with roughly 70–80% for textual domains, and no image model exceeded 65% after only two interactions. (arXiv)

This is relevant because “document” should be interpreted broadly.

Many professional artifacts are not plain prose:

  • diagrams
  • CAD-like files
  • screenshots
  • design files
  • charts
  • maps
  • slides
  • images
  • audio metadata
  • video subtitles
  • 3D assets

Delegated editing of these artifacts needs even stronger validation because visual plausibility is not the same as fidelity.

The repository

Microsoft released the code in the microsoft/DELEGATE52 repository. The repository contains the benchmark harness, prompts, domain-specific parsers/evaluators, and experiment runners. The README describes DELEGATE-52 as a benchmark for evaluating LLMs on long-horizon delegated document editing across 52 professional domains, and it points to the dataset hosted on Hugging Face. (GitHub)

The key files are:

run_relay.py        Main experiment runner for chained round-trip edits
run_single.py       Runs individual forward/backward edit pairs
model_openai.py     OpenAI / Azure OpenAI model wrapper
model_agentic.py    Tool-using agent harness
domains/            Domain-specific parsers and evaluators
prompts/            Prompt templates used during simulation

The public dataset contains the redistributable subset: 234 work environments across 48 domains, each with seed documents, 5–10 reversible edit pairs, and distractor context. The repository README also includes a basic example for running a relay simulation, with a clear warning that simulations call LLM APIs and therefore cost real money. (GitHub)

Example command from the repository:

python run_relay.py --model_names gpt-5.4 --domains subtitles --num_round_trips 10

For practitioners, the repository is useful not only as a benchmark, but as a design pattern: create domain-specific round-trip tasks, parse the resulting artifacts, and measure semantic preservation over repeated edits.

What this means for AI-assisted engineering

For software engineers, the paper should feel familiar.

We already know that AI coding assistants can produce impressive results and still introduce subtle defects. The difference is that software engineering has a mature validation culture:

  • version control
  • diffs
  • pull requests
  • tests
  • static analysis
  • CI/CD
  • type systems
  • linters
  • code review
  • runtime monitoring
  • rollback

The paper’s central message is that all delegated document workflows need a similar discipline.

The more autonomy we give an AI system, the more we need artifact-level safety mechanisms.

A good AI-assisted workflow should therefore include:

1. Version every artifact

Never let an AI agent mutate important documents without version history.

For code, this means Git. For documents, it may mean SharePoint versioning, OneDrive history, document snapshots, object storage versioning, or explicit pre/post copies. The ability to diff and rollback is not optional.

2. Prefer patch-based edits over full rewrites

A model that rewrites an entire file has far more opportunity to corrupt unrelated content.

Where possible, ask for minimal patches:

Only modify the sections required for this change.
Return a unified diff.
Do not rewrite unrelated sections.
Preserve all existing identifiers, values, comments, and ordering unless explicitly instructed.

This does not eliminate risk, but it reduces the blast radius.

3. Use domain-specific validators

Generic LLM review is not enough.

Use validators that understand the artifact:

Code              tests, type checks, linters, static analysis
JSON/YAML         schema validation
Terraform/Bicep   plan validation, policy checks
SQL               schema diffing, migration tests
Spreadsheets      formula checks, row/column invariants
Accounting        balanced entries, totals, currency checks
Calendars         recurrence validation
Subtitles         timing overlap checks
DNS               zone validation
Diagrams          parse/render validation

This is exactly the spirit of DELEGATE-52’s domain-specific evaluators.

4. Detect invariant violations

Before delegating, define what must remain true.

Examples:

The number of invoices must not change.
All customer IDs must be preserved.
The total balance must remain identical.
No dates may be changed unless explicitly requested.
All source citations must remain attached to their claims.
All tests that passed before must pass after.
All public method signatures must remain compatible.

Then validate those invariants mechanically where possible.

5. Keep humans in the loop for semantic review

The paper explicitly warns users not to generalize capability from one domain to another and says users still need to closely monitor LLM systems when delegating work. (arXiv)

The right level of review depends on risk.

For low-risk prose drafts, lightweight review may be enough. For production code, financial documents, legal text, medical records, security configuration, infrastructure changes, or customer-facing data, delegated edits should go through rigorous review.

6. Treat long workflows differently from single edits

The paper shows that short interaction performance is not predictive of long-horizon performance. (arXiv)

That means evaluation should match the intended workflow. If your agent will perform 30 steps, do not evaluate it with one-step tasks. If your assistant will edit entire repositories, do not validate it only on isolated snippets. If your process includes retrieved context, include distractor documents in evaluation.

7. Build rollback into agentic systems

An AI agent should not only edit. It should be able to checkpoint, validate, fail, rollback, and explain.

A safer architecture looks like this:

Input workspace
   ↓
Create snapshot
   ↓
Plan changes
   ↓
Apply minimal patch
   ↓
Run validators
   ↓
Compare semantic invariants
   ↓
Summarize diff
   ↓
Human approval or automatic rollback

Without this loop, tool use may only make the model faster at corrupting files.

A practical delegated-editing checklist

Before letting an LLM edit important documents, ask:

Do I have a clean pre-edit snapshot?
Can I see a precise diff?
Can I validate the file syntactically?
Can I validate it semantically?
Are there domain-specific invariants?
Are unrelated sections protected?
Can I roll back automatically?
Is the task local or global?
Is there distractor context that may confuse the model?
How many sequential edits will happen?
Do I have tests that reflect the real task, not just a demo?

For AI-assisted coding, that translates into:

Use small commits.
Use feature branches.
Require tests after every agent step.
Ask for plans before edits.
Review diffs carefully.
Prefer constrained file scopes.
Run formatters and linters.
Run unit and integration tests.
Use static analysis and dependency checks.
Do not accept broad rewrites without review.

For RAG and document automation systems, it translates into:

Preserve source references.
Validate extracted entities.
Track document lineage.
Compare pre/post structured representations.
Use schema-aware parsing.
Flag changed numbers, dates, names, IDs, and citations.
Evaluate over multi-step workflows, not just one answer.

Why this paper matters

The core insight of “LLMs Corrupt Your Documents When You Delegate” is not that LLMs are bad. In fact, the paper also shows rapid progress: it notes that GPT-family benchmark performance improved significantly between the tested GPT 4o and GPT 5.4 models. (arXiv)

The real message is more nuanced:

LLMs are becoming capable enough to delegate work to, but not reliable enough to trust without verification.

That is the dangerous middle.

When models were obviously weak, users did not trust them. When models become near-perfect, delegation will be safer. But today’s systems often live in between: impressive, useful, productive, and still capable of silent severe corruption.

That makes evaluation, validation, and workflow design critical.

DELEGATE-52 gives us a useful language for this problem. It shifts the conversation from “Can the model do the task once?” to “Can the model preserve the artifact over a long delegated workflow?” That is the right question for AI-assisted engineering, document automation, enterprise copilots, and agentic systems.

Conclusion

Delegation is not just prompting at a larger scale. It is an operational model where an AI system mutates valuable artifacts on behalf of a human.

That requires trust.

The Microsoft Research paper shows that today’s LLMs can still violate that trust in subtle and severe ways. They often attempt the task. They may produce plausible output. They may succeed for several steps. But over long workflows, errors compound, critical failures appear, and documents can become corrupted.

The practical takeaway is clear:

Use LLMs aggressively, but do not delegate blindly.

Treat AI-generated edits like untrusted code changes: version them, diff them, validate them, test them, and review them. The future of AI-assisted work is not “let the model edit everything.” It is model capability plus engineering discipline.

That is where reliable delegation starts.