Multi-agent workflows - Claude skills for journalism

The one idea

A workflow is an assignment desk, not a single reporter. Instead of asking one AI to do an entire job in one long conversation, you split the work across many narrow agents — each with one beat — then run an editor's pass over what they file. The shape is older than the technology: the most credible AI-assisted journalism, from the Panama Papers to The Markup's algorithm audits, already works this way. A machine does the high-volume first pass at scale, and a separate verification step decides what to trust before anything gets published.

That second step is the difference. “The AI found it” is not a finding. “A separate agent tried to knock it down and couldn't” is a method you can defend to an editor.

Who this is for: AI-curious journalists, media professionals, faculty, and researchers. No code required to understand it. The examples are real and sourced; the cautions are honest.

One chat vs. a workflow

A single chat

One model, one context window, one pass.
Loses track somewhere around document 20.
Grades its own homework.
Cheap and fast — right for most questions.

A workflow

Many agents, each with its own focus, run in parallel.
Holds hundreds of items in flight; returns a structured table.
A separate agent checks the work before you see it.
Expensive — earns its keep at volume and when verification matters.

Four patterns, in plain language

Almost every workflow is built from four moves. You don't need the names to use them, but they make the rest of this page easier to read.

Fan-out

Split a big job into independent pieces and run them at the same time, then merge. Like handing 50 documents to 50 researchers at once. Fast; uses the most compute. Best when the pieces don't depend on each other.

Pipeline

A fixed sequence where each step feeds the next: draft, then translate, then proofread. Linear and easy to trace, so you can see exactly where something went wrong. Slower than running in parallel.

Adversarial-verify

One agent produces an answer; a different agent (or a human) tries to poke holes in it against explicit criteria before it's trusted. It works only if the checker is genuinely independent and pulls its own fresh evidence — a refutation that adds nothing new proves nothing.

Loop-until-dry

Keep trying until a condition is met or there's nothing left to find: try the live URL, fall back to an archive, confirm a usable copy exists, then stop. Set a stopping rule so it doesn't loop forever or chase diminishing returns.

How it works in real life

These are documented, real efforts — not hypotheticals. Each one already follows the assignment-desk shape. The sources are linked; treat any vendor accuracy figures as starting points to verify, not settled facts.

Investigation at scale

Panama & Pandora Papers + Datashare

Fan-out + verify

ICIJ · 2016–2021

ICIJ ran OCR and named-entity recognition over roughly 11.5M (Panama) and 11.9M (Pandora) leaked files with its open-source Datashare tool to auto-detect people, organizations, and places, then mapped the relationships in a graph database.

The shape: millions of documents split for parallel extraction; reporters in dozens of countries confirm entities against public records before publishing; the graph is the merge step that links per-document findings into one deduplicated network.

ICIJ on Datashare →

Hidden Spy Planes

Fan-out + verify

BuzzFeed News (Peter Aldhous) · 2017

A machine-learning classifier trained on the flight patterns of known FBI and DHS planes scored about 20,000 aircraft from four months of flight data, flagging roughly 69 candidate surveillance planes. The code and data were open-sourced.

The shape: the model scores every plane independently; reporters then cross-check FAA registration, shell-company ownership, and flight behavior to confirm each hit before naming it.

BuzzFeed News →

Algorithm audits (Citizen Browser, Allstate)

Fan-out + verify

The Markup (Allstate work with Consumer Reports) · 2020–2021

The Markup analyzed feed data from a paid panel of 1,000+ Facebook users to see what different demographics were shown, and separately analyzed pricing for about 93,000 Allstate policyholders in Maryland to expose a discriminatory pricing scheme.

The shape: automated analysis across many sampled units, then a rigorous verify stage — published methodology, independent re-runs by data scientists, and outside-expert review before publication.

The Markup →

Production at scale

Automated earnings and sports coverage

Fan-out

Associated Press (with Automated Insights) · 2014–2016

AP used natural-language generation to turn structured earnings data into stories, scaling quarterly coverage roughly tenfold — from about 300 to several thousand stories per quarter — and later extended it to minor-league baseball recaps.

The shape: one generation step produces a draft per item (each filing, each box score) in parallel; verification is built in as templates plus source-data validation, with editors spot-checking and a disclosure line as the audit trail. The pattern works even when the per-item agent is template generation, not a chatbot.

Poynter →

Verification & fact-checking

AI claim detection and matching

Fan-out + human verify

Full Fact (UK) · 2016–present

A classifier spots check-worthy claims across news, TV subtitles, and social media at firehose scale, and a separate model flags when an already-debunked claim resurfaces even reworded. The AI surfaces candidates; human fact-checkers write the verdict. Used by 45+ organizations across 30 countries.

The shape: detection parallelizes across hundreds of thousands of sentences, but nothing publishes until a skeptical human confirms or refutes each flag. The mandatory human challenge is the guardrail.

Full Fact →

Squash and Tech & Check

Pipeline + verify

Duke Reporters' Lab · 2016–present

Tech & Check runs a claim-spotting model over transcripts and emails journalists a daily list of check-worthy claims. Squash transcribes live political speeches and instantly matches spoken claims against a database of published human fact-checks, displaying matches on screen.

The shape: detection is cleanly separated from verification — a model proposes a claim, then a distinct layer must ground it in an independent corpus of human-vetted checks. It refuses to surface a verdict it cannot anchor.

Duke Reporters' Lab →

InVID / WeVerify Verification Plugin

Adversarial-verify

InVID & WeVerify consortia, maintained by AFP Medialab · 2017–present

A browser plugin used by 50,000+ journalists and fact-checkers weekly: keyframe extraction, reverse image search across multiple engines, metadata analysis, forensic filters, and synthetic-media detection.

The shape: built for refutation. A journalist starts with a claim (“this video shows event X today”) and the tools hunt for evidence that disproves it — earlier copies, mismatched metadata, manipulation artifacts. Each check is a fresh challenge to the original claim.

Verification Plugin (GitHub) →

Multi-agent debate fact-checking

Adversarial-verify

Academic research groups (arXiv) · 2025–2026

Research frameworks where two agents take opposing stances — one affirming, one refuting a claim — over several rounds while a moderator renders a verdict. Some add per-agent evidence retrieval. Published results show the debate setup corrects single-model errors.

The shape: the most literal version of adversarial-verify — one agent argues a claim is true, a second is tasked specifically with refuting it, a third adjudicates. Direct evidence that a separate refuting agent reduces error, especially when it retrieves its own evidence. (Preprint; not yet peer-reviewed.)

arXiv preprint →

Research & academia

Systematic literature review

Fan-out + verify

Elicit · 2023–2026

An AI research assistant that runs the systematic-review pipeline end to end: takes a research question, screens large paper sets, and extracts structured data into a table, with a supporting quote and citation attached to every answer. The company reports high screening recall benchmarked against hundreds of Cochrane reviews.

The shape: one query expands into thousands of parallel per-paper screening calls; each judgment ships with a verbatim quote and citation so a human can audit it. The quote-per-claim is the structural verification gate. (Accuracy figures are the vendor's own.)

Elicit evaluation →

AI Co-Scientist

Fan-out + adversarial-verify

Google DeepMind · 2025–2026

A multi-agent system where specialized agents generate, debate, rank, and refine research hypotheses, spending most of the compute on verifying hypotheses rather than generating them. Validated on drug repurposing (30 candidates narrowed to 5 lab-tested) and published in Nature.

The shape: a generation agent proposes many hypotheses in parallel; a debate agent and a ranking tournament adversarially score and eliminate weak ones; an evolution loop refines the survivors. Verification is a dedicated agent role, not an afterthought.

Google DeepMind →

Archiving & link rot

Dead-link repair at scale

Loop-until-dry

InternetArchiveBot, Internet Archive + Wikimedia · 2016–present

A bot scans outbound links across Wikipedia, confirms a URL is actually dead before acting, then rewrites the citation to an archived snapshot. It has fixed roughly 6 million dead references; 9M+ Wikipedia URLs now point to archived copies.

The shape: probe each live URL; on confirmed failure, query archive providers for a usable snapshot; rewrite the citation only when a valid replacement is verified. The dead-link confirmation is a false-positive guard before the archive fallback.

How the bot works →

Link rot and content drift

Verify at scale

Harvard Library Innovation Lab, in Columbia Journalism Review · 2021

Analyzed 553,693 New York Times articles (1996–2019) containing over 2.2M outbound links, finding widespread link rot (dead pages) and content drift (pages that still load but silently changed from what was cited).

The shape: a large automated verify pass over millions of URLs that compared current content against the original, not just HTTP status. The lesson: real verification checks whether the content still supports the claim, not just whether the link loads.

Columbia Journalism Review →

What you could do with it

The same patterns map onto everyday newsroom and research jobs. These are drawn from tools and sites built at the Center for Cooperative Media and elsewhere — sorted by how much you need to know to try them.

Job	Pattern	Real anchor
Find what matters in a FOIA dump or document set	Pipeline	ICIJ Datashare; DocumentCloud
Research a topic, then prove or kill each finding	Sweep + adversarial-verify	Anthropic research system
Recover dead links and rebuild an archive	Loop-until-dry	InternetArchiveBot; the Jay Rosen archive
Audit a whole site for accessibility and broken metadata	Dimensions + verify	CCM tool sites
Publish civic information in 10+ languages	Fan-out + per-language check	ReRoute NJ; NJ News Commons translation API
Turn a backlog of meetings into action items	Pipeline	Transcript pipelines
Watch many feeds and surface only what needs a human	Loop-until-dry	NJ News Wire aggregation
Map who's connected to whom (funders, sources)	Sweep + dedupe	NJCIC grantees map; CCM stakeholder map
Keep a course or skill library from going stale	Fan-out + completeness critic	The Knight Center MOOC

Case study: the run that built this page

The real-world examples above weren't pulled from memory. They came from an actual workflow — the research-and-synthesis pattern — run while this page was being written. Here's exactly what happened, because the run is a cleaner teaching example than any diagram.

11

agents spawned

~13.5

minutes, start to finish

~801K

tokens used

13

verified examples kept

The three phases

Research (5 agents, in parallel). One agent per domain — investigative documents, fact-checking, multi-agent practice, academic research, archiving. Each was told to use web search, confirm facts, and return findings in a fixed structure with a real source URL for every example.
Verify (5 skeptic agents, one per domain). As soon as a domain's research finished, a separate skeptic agent re-checked each example against fresh web searches and dropped or corrected anything it couldn't confirm. This ran as a pipeline, not a barrier — the fact-checking domain could be under verification while the archiving domain was still researching, so no agent sat idle.
Synthesize (1 agent). A final agent merged all the verified findings, removed duplicates, and produced the intro, the example set, the patterns, and the cautions — as structured data, not prose to be re-parsed.

What the verify pass caught

The skeptic agents weren't decoration. Forcing each example to survive an independent second search is why the figures on this page are sourced and hedged (“the vendor's own number,” “preprint, not peer-reviewed”) rather than confidently wrong. The run practiced the exact adversarial-verify pattern the page recommends.

What it looks like in code

You don't write this by hand to use it — you describe the job and the script is generated. But seeing the shape demystifies it. This is the core of the real script, trimmed:

// one research agent per domain, each verified as soon as it finishes
const results = await pipeline(
  DOMAINS,
  (d) => agent(d.researchPrompt, { schema: FINDINGS }),        // research
  (findings) => agent(`Skeptically re-check each example via
                       web search; drop what you can't confirm:
                       ${findings}`, { schema: FINDINGS })       // verify
)

// merge everything, dedupe, write the structured brief
const brief = await agent(`Synthesize: ${results}`, { schema: PAGE })

Three ideas do all the work: agent() spawns one worker; pipeline() pushes each item through research-then-verify without waiting for the slowest; schema forces clean, structured data back instead of free text. That's the assignment desk, in code.

Two more runs, on newsroom-shaped tasks

To show the pattern doing work an editor would recognize, two more workflows ran live while this page was being written — together as one job: 12 agents, under four minutes. Both deliberately turn the tool on its own work. One fact-checks claims about the examples on this very page; the other audits live pages on this very site.

Run 2 — a fact-check fan-out

Four claims, each researched and rated on a verdict scale, then handed to a separate skeptic agent told to refute the rating using its own fresh searches. All four ratings held up — and the pass correctly failed the two false claims rather than rubber-stamping them.

Claim	Verdict	Held up?
First Panama Papers stories published in 2016	True	Yes
AP automated earnings via Wordsmith around 2014	True	Yes
The Markup is a for-profit advertising company	False	Yes
Perma.cc is operated by the Internet Archive	False	Yes

The skeptic correctly established that The Markup is a nonprofit newsroom (acquired by CalMatters in 2024) and that Perma.cc is run by Harvard's Library Innovation Lab, with the Internet Archive only a preservation partner. The four answers aren't the point — the method is. A second, independent agent had to fail to break each one before it was trusted.

Run 3 — a live site audit

Four published pages on this site, each checked on six criteria — Open Graph tags, Twitter card, favicon, image alt text, heading order, and external-link rel="noopener". One agent per page, in parallel.

Page	Result
fact-check-workflow	6 / 6 pass
web-archiving	6 / 6 pass
source-verification	2 warnings — external links missing rel="noopener" (none open in new tabs)
data-journalism	1 flag — 3 external links missing rel="noopener" (low risk)

This is the honest part: the audit found something. Two pages carry external links without rel="noopener" — low risk, since none open in a new tab, but a real inconsistency worth a follow-up. A four-minute parallel run surfaced it across the site without anyone reading four pages by hand. That's the case for the pattern in one example: it doesn't replace the editor, it hands the editor a shorter, sharper list.

Honest cautions

This pattern is as oversold as it is useful. Read these before you pitch it to a newsroom or a dean.

The AI step is triage, not the final word. In every credible example here, the machine narrows a haystack and a human (or a separate verification layer) decides what to trust and publish. None of these auto-publish unchecked AI output. Remove the verify stage and you have data science, not journalism.

Fan-out is expensive. Anthropic's own multi-agent research system uses roughly 15× the tokens of a normal chat, and token usage explained about 80% of its performance gain. It only pays off when the task genuinely parallelizes and quality matters more than cost. Start with the simplest approach; add agents only when the task needs it.

A checker can share the first agent's blind spots. An AI grading another AI can rubber-stamp the same mistake if both reason the same way. Real independence means a different model or a human, and a verifier that retrieves its own fresh evidence. A refutation pass with no new evidence is theater.

Verify content, not just that something technically worked. The NYT link-rot study found pages that load fine but silently changed from what was cited. A checker that only confirms a link returns 200-OK, or that an answer is well-formatted, will pass content that no longer supports the claim.

Don't use it for simple, ordered tasks. If a single prompt or a short pipeline does the job, adding agents adds cost, latency, and more places to break. The pattern earns its complexity on high-volume parallel work, or where an independent check measurably reduces error.

Benchmark figures usually come from the vendor. Several headline accuracy numbers here originate as a company's own claims, and several research results are preprints that haven't been peer-reviewed. Treat impressive percentages as starting points to verify, not settled facts.

The infrastructure can be fragile and centralized. Recovery and archiving workflows lean heavily on the Internet Archive as a single point of failure. Build in redundancy — query multiple sources — rather than assuming one provider is always up.

The skills that compose into workflows

If a workflow is the assignment desk, these skills are the individual beats it dispatches. Each one is a focused instruction set; a workflow is what runs many of them across a corpus, with a verify pass on top.

Fact-check workflow Source verification Web archiving Digital archive Data journalism FOIA requests Interview transcription Editorial workflow Page monitoring

The one idea

One chat vs. a workflow

A single chat

A workflow

Four patterns, in plain language

Fan-out

Pipeline

Adversarial-verify

Loop-until-dry

How it works in real life

Investigation at scale

Panama & Pandora Papers + Datashare

Hidden Spy Planes

Algorithm audits (Citizen Browser, Allstate)

Production at scale

Automated earnings and sports coverage

Verification & fact-checking

AI claim detection and matching

Squash and Tech & Check

InVID / WeVerify Verification Plugin

Multi-agent debate fact-checking

Research & academia

Systematic literature review

AI Co-Scientist

Archiving & link rot

Dead-link repair at scale

Link rot and content drift

What you could do with it

Case study: the run that built this page

The three phases

What the verify pass caught

What it looks like in code

Two more runs, on newsroom-shaped tasks

Run 2 — a fact-check fan-out

Run 3 — a live site audit

Honest cautions

The skills that compose into workflows

Machines narrow the haystack. People decide what's true.