From scanners to reasoning: how LLMs and agent harnesses can improve code security

Sascha Corti

09 Mar 2026 • 6 min read

Better models matter, but better harnesses may matter more. The future of AI-assisted security is evidence, validation, and human-guided judgment.

A year ago, a team at Microsoft explored an idea that felt promising but still a little early: using an AI agent to go beyond vulnerability scanning and perform deeper CVE analysis, including generating VEX documents. The goal was not just to detect that a vulnerable package existed somewhere in a dependency tree, but to reason about whether a given vulnerability actually mattered in a specific deployment.

The results can be found in this GitHub repo.

At the time, the conclusion was measured and practical. The direction looked right, but the models were not yet reliable enough for high-accuracy security work without heavy expert tuning and close human oversight.

That assessment made sense then. It looks different now.

What changed

The shift is not explained by model quality alone. The real change is that both the foundation models and the harnesses around them have improved materially.

A strong signal came from the recent collaboration between Mozilla and Anthropic. In that work, Claude Opus 4.6 reportedly discovered 22 Firefox vulnerabilities over a two-week period, 14 of them classified as high severity, and Mozilla addressed them in Firefox 148. Just as important as the count was the quality of the output: the reports included reproducible test cases, which made them actionable for engineers rather than merely interesting.

That matters because Firefox is not an easy codebase to analyze. It is one of the most heavily scrutinized and security-hardened open-source projects on the web, backed by years of fuzzing, static analysis, secure engineering practices, and review. Finding meaningful vulnerabilities there is a higher bar than generating plausible-looking bug reports against toy repositories or lightly maintained projects.

The implication is significant: AI-assisted security analysis is starting to show value not only on under-defended code, but on mature and deeply examined systems as well.

Why agent harnesses matter

The most important lesson here is that this is not simply a story of “the model got smarter.”

The harness is doing real work.

A strong agent harness gives the model a structured environment in which it can search, test hypotheses, validate findings, and iterate. Instead of producing a one-shot answer, the system operates in a loop:

Form a hypothesis about a vulnerability or unsafe behavior.
Inspect code, runtime behavior, and surrounding context.
Run checks or test cases to validate the hypothesis.
Propose a fix or mitigation.
Verify that the fix removes the issue without breaking intended functionality.

This is a very different pattern from autocomplete or chat-based code assistance. The critical capability is not only generation, but verification.

One especially interesting idea highlighted in recent work is the use of task verifiers. These are trusted checks that help confirm whether the agent truly found a real bug, whether a proposed patch actually resolves it, and whether the resulting system still behaves correctly. That moves the workflow away from ungrounded suggestion and toward evidence-based engineering.

For security teams, that distinction is everything.

From vulnerability scanning to exploitability reasoning

This also aligns closely with what many teams have been trying to do with VEX and deeper CVE workflows.

A good security workflow cannot stop at “scanner found package X with CVE-Y.”

That is useful as input, but not as a final decision.

To assess the real impact of a CVE, teams need answers to questions like these:

Is the vulnerable component actually present in the deployed environment?
Is the vulnerable code path reachable in this application?
Are relevant mitigations already in place?
Are the exploit preconditions realistic in this operating context?
Is the component used in a way that makes the vulnerability relevant?
Can we justify a disposition such as affected, not affected, or fixed with defensible evidence?

This is exactly where LLMs paired with strong agent harnesses become interesting.

A well-designed harness can help gather dependency data, inspect code paths, review configuration, compare runtime assumptions, look for mitigations, and assemble evidence that supports a VEX decision. Instead of merely reporting a CVE, the system can assist with reasoning about exploitability.

That is a much higher-value outcome than raw scanner output.

The emerging workflow

What is starting to emerge is a new operating model for secure software engineering.

Agents can help with:

wide search across large codebases
triage support for vulnerability findings
hypothesis generation for exploitability
evidence gathering across source, configuration, and build artifacts
patch proposal and remediation suggestions
regression-aware validation of those patches
structured output for downstream security processes such as VEX

Humans still own the trust boundaries.

That part does not change. Security engineers remain responsible for final judgment, risk acceptance, production decisions, and the governance around what evidence is sufficient. The human role becomes less about manually doing every first-pass investigation and more about supervising, validating, and making high-consequence decisions.

That is not replacement. It is amplification.

Product direction is following the same pattern

We are also seeing product momentum in this direction. New security-focused coding agents are being framed not as scanner replacements, but as systems that find, validate, and propose fixes while using project context and reducing false-positive noise.

That framing is important.

Security teams do not need more unactionable findings. They already have too many. What they need are workflows that improve signal quality:

fewer false positives
stronger reproducibility
better contextual reasoning
clearer remediation guidance
more defensible decisions

The value of these systems is not in dumping more alerts on engineers. The value is in helping turn raw findings into validated and prioritized engineering work.

The caution still matters

None of this means “trust the bot.”

That would be the wrong lesson.

Security teams have good reasons to be skeptical. AI-generated bug reports have had a mixed track record, and false positives impose real costs on maintainers and responders. Every low-quality report consumes time, attention, and trust. In security, those costs add up quickly.

So the correct pattern is not autonomous trust. It is bounded autonomy.

Use agents for exploration, synthesis, and validation support. Require evidence. Keep human review in the loop. Build workflows where the system earns trust through reproducibility, verification, and traceability rather than persuasive language.

That is the difference between a demo and an operational capability.

Why this matters for VEX

VEX is a particularly strong candidate for this model.

Generating a high-quality VEX statement is not just a documentation task. It is a reasoning task. It requires the system to connect vulnerability metadata with the reality of a specific codebase and deployment model.

A useful AI-assisted VEX pipeline would need to:

ingest CVE and advisory information
map affected components to the actual software bill of materials
inspect source and configuration for feature usage and reachability
identify environment-specific mitigations
evaluate exploit preconditions
collect evidence for the final status
produce a human-reviewable explanation for why the status is justified

That is exactly the kind of multi-step workflow where agent harnesses shine.

The model provides reasoning and synthesis. The harness provides structure, tools, validation, and repeatability. Together, they can reduce manual effort while improving the quality of the final security assessment.

The real inflection point

Twelve months ago, using agents for deep CVE reasoning and VEX generation felt early but directionally correct.

In 2026, it looks much more like an emerging operating model.

The pattern is becoming clearer:

let agents perform broad search across code and security context
let them generate and test hypotheses
let them gather evidence and propose patches
let them support exploitability reasoning and VEX authoring
let humans make the final call

That is the real story. The interesting development is not just that language models have improved. It is that we are getting better at surrounding them with the right harnesses: verifiers, tools, evidence loops, and guardrails.

That combination is what makes the output useful for serious security work.

Final thoughts

For engineering organizations, this opens up a practical and high-value set of opportunities. The most promising areas are the ones where the cost of manual analysis is high, the amount of contextual reasoning is substantial, and the final decision still benefits from expert human review.

CVE triage, VEX generation, exploitability assessment, and patch validation all fit that model extremely well.

The opportunity is not to remove security engineers from the loop. It is to give them better leverage.

And for teams building internal platforms and developer tooling, this may be one of the most interesting places to invest next: security workflows where LLMs provide breadth and speed, agent harnesses provide discipline and validation, and humans provide judgment.

That is a much stronger foundation than either models or scanners alone.