Deep Research Protocol

Overview

The Deep Research Protocol (DRP) is the framework that orchestrates multi-step research producing structured, cited reports on open-ended knowledge gaps. Where a single-pass research query degrades into either a confident-sounding hallucination or a partial answer that doesn’t notice what’s missing, DRP decomposes the input query into 3-7 sub-queries, consults the vault first, fans out Level 1 subagents in parallel to fill residual gaps from external sources (web, browser-driven commercial AIs, API-driven commercial AIs), iterates on under-covered sub-queries using confidence-triggered re-retrieval, and synthesizes findings into a markdown report with source-class citations. It is the research substrate for the Terrain Mapping Framework (TMF) and any other Level 2+ caller that requires authoritative coverage of a topic area.

The framework runs eight layers. Layer 1 (intake and clarification gate) routes vague user-direct queries through up to three focused clarification questions; detailed queries and TMF-dispatched queries skip clarification. Layer 2 (research plan construction) decomposes the normalized query into 3-7 sub-queries with per-sub-query coverage criteria, source hints (vault first-ranked), and stopping criteria; vague user-direct queries get plan review before fan-out. Layer 3 (vault reconnaissance) consults the vault first for every sub-query and produces the gap map of dimensions still needing external retrieval. Layer 4 (external retrieval fan-out) spawns Level 1 subagents in parallel up to the cap, one per gap-map entry, with explicit instructions to apply the Search-o1 Reason-in-Documents condensation pattern, prefer authoritative sources over content farms, and tag every claim with source class. Layer 5 (evidence integration and iteration decision) aggregates vault and subagent evidence, evaluates each sub-query against its coverage criterion, and dispatches one of four iteration verdicts (CONVERGED / CONVERGED_WITH_GAPS / ITERATE / BLOCKED). Layer 6 (synthesis with citation grounding) produces the structured report with executive summary, per-sub-query sections with inline source-class tags, cross-query synthesis, named caveats, and bibliography grouped by source class. Layer 7 (self-evaluation) scores the report against the nine evaluation criteria with calibration-conservative scoring. Layer 8 (error correction and output formatting) applies final consistency checks, documents corrections, declares missing information and unresolved deficiencies, and persists the report to the vault if the persist flag is true.

The framework’s load-bearing intellectual content is the vault-first discipline, the source-class citation requirement, the anti-confabulation defenses, and the bounded-iteration loop. The vault-first discipline counters the Vault Blindness Trap — skipping vault reconnaissance and over-fetching from external sources, which loses high-provenance vault evidence and inflates token cost unnecessarily. Vault is consulted first for every sub-query unless explicitly excluded; vault-sourced claims carry the highest provenance weight in synthesis. The source-class citation requirement counters the Confabulation Trap and the Citation Hallucination Trap — every substantive claim carries an inline tag ([VAULT], [WEB], [BROWSER_AI], [API_AI], or [INFERRED]) and every external URL appears verbatim in the integrated evidence map (no invented URLs; no paraphrased source paths).

The bounded-iteration loop counters two opposing failure modes. The Endless Search Trap — the iteration loop spawns follow-up queries indefinitely, scouring for evidence on dimensions that aren’t well-covered by available sources — is countered by hard-capping iteration at depth_cap; uncovered dimensions move to the Caveats section rather than spawning unbounded follow-up. The Premature Convergence Trap — declaring convergence while one or more sub-queries remain UNCOVERED within iteration budget — is countered by requiring at least one iteration attempt on every UNCOVERED sub-query before declaring CONVERGED_WITH_GAPS, and reserving CONVERGED for the all-COMPLETE case. The Topical Sprawl Trap — stitching loosely related retrievals into sections that drift beyond the sub-queries declared in the plan — is countered by Layer 6’s discipline that every paragraph in Sub-Query Sections must address its own sub-query (content belonging to a different sub-query moves; content belonging to no sub-query is deleted).

The framework distinguishes itself from a single-pass research query by virtue of the milestone structure — Approved Research Plan, Evidence Integrated and Iteration Resolved, Final Research Report — each a coherent intermediate deliverable downstream layers consume, each a checkpoint where adversarial review and drift detection fire. The token cost scales 4-15× a single-shot query (per Anthropic’s published figures); the framework is for cases where the cost is justified by the requirement for defensible coverage of a topic area, not for quick lookups.

The framework answers questions like: I need a defensible literature review on a topic I’m not yet expert in — produce a report with citations I can verify. What’s the current state of debate on X across academic, industry, and practitioner sources? I’m preparing to make a decision and need the territory mapped first — give me the cited research substrate. TMF needs the deep-dive on a knowledge gap surfaced during terrain mapping; produce the report TMF will integrate.

Systemic context

The Deep Research Protocol is the Level 2 research substrate in Ora’s framework catalog — invoked by the Terrain Mapping Framework (TMF) when terrain mapping surfaces a knowledge gap requiring authoritative coverage; by user-direct invocation when the user wants a defensible cited report rather than a single-pass answer; and by project-agent delegation when a project’s research-substrate need is registered. The framework’s three-source-class fan-out (web via web_fetch; browser AI via browser_evaluate; API AI via api_evaluate) lets it operate across the full retrieval surface available to Ora; the sources_allowed input filters retrieval to a subset when the caller has reason to constrain. The framework’s output is a markdown research report saved to the vault root as Reference — [Topic].md with inherited nexus when persist=true, or returned in-memory to the caller when persist=false. The framework draws on published research orchestration practice — Anthropic’s documentation of subagent fan-out failure modes (over-spawning, SEO farm capture, citation hallucination), the STORM literature on topical sprawl in generated reports, the Search-o1 work on Reason-in-Documents condensation, the MemSearch-o1 work on context dilution. The framework is currently at v1.0 and the registry entry notes “low confidence” pending application — the framework is built but the proven applications are still accumulating.

Ora’s output

Example question: What is the current state of debate on whether large language models exhibit something properly called “reasoning,” across the academic, industry, and practitioner literatures from 2023 to early 2026?

Ora’s response (running DRP at depth_cap=3, subagent_cap=5, persist=true):

The query is detailed (specifies domain, knowledge gap, anchor concepts, and time window); Layer 1 routes with clarification_status = SKIPPED. Layer 2 decomposes into 5 sub-queries:

Sub-Query 1. What is the academic literature’s framing of “reasoning” in LLMs — specifically, what definitional moves do major papers make, and where do they disagree?

Sub-Query 2. What does the empirical evidence say about LLM performance on reasoning benchmarks (math, logic, multi-step problems), including known critiques of benchmark validity?

Sub-Query 3. How do industry labs (Anthropic, OpenAI, Google DeepMind) frame their models’ reasoning capabilities in technical reports and product communications?

Sub-Query 4. What do practitioners (working ML engineers, researchers using LLMs in applied work) report about reasoning in deployment, including the gap between benchmark performance and real-world reliability?

Sub-Query 5. What are the active debates and unresolved questions, including the chain-of-thought-as-explanation question and the System-1-vs-System-2 framings?

Each sub-query carries a coverage criterion (e.g., “at least two independent academic sources describe the major definitional moves; at least one source surfacing disagreement”). Source hints: VAULT_CONTENT first-ranked across all sub-queries; CURRENT_WEB and SPECIALIZED_DOMAIN for academic; BROWSER_AI_CONSULTATION and API_AI_CONSULTATION for cross-validation. Plan review status: NOT_REQUIRED (detailed user-direct query).

Layer 3 (vault reconnaissance). Vault search runs per sub-query. Sub-query 2 returns 7 chunks (the user’s vault has prior research on reasoning benchmarks and Apple’s “Illusion of Thinking” paper); sub-query 1 returns 3 chunks (some prior reading on definitional debates); sub-queries 3, 4, 5 return 0-1 chunks each (gaps for external fan-out). Gap map: SQ1 partial-gap on minority-tradition definitional moves; SQ3 full gap on industry labs’ framing; SQ4 full gap on practitioner reports; SQ5 partial gap on the chain-of-thought-as-explanation debate.

Layer 4 (external retrieval fan-out). Five subagents spawned in parallel (within subagent_cap=5), one per gap-map entry. Each subagent receives the specific sub-query, the vault context for sub-queries with partial coverage, the gap dimensions to address, and instructions on source-quality preference (academic, official, first-party, established-media over content farms). Each subagent applies the Search-o1 Reason-in-Documents condensation pattern — extract and condense rather than paste full pages — and tags every claim with its source class.

Layer 5 (integration and iteration). Subagent reports return; evidence integrated per sub-query. Coverage assessment: SQ1 COMPLETE; SQ2 COMPLETE (vault was already strong); SQ3 COMPLETE; SQ4 PARTIAL (subagent surfaced the practitioner debate but coverage of non-Western practitioner perspectives is thin); SQ5 COMPLETE. Iteration decision: ITERATE on SQ4 with a follow-up sub-query targeting practitioner reports from non-Western contexts. Iteration depth 1/3. Second iteration runs; coverage improves to PARTIAL-improved (some non-Western perspectives surfaced via specialized sources). Iteration depth 2/3. Coverage criterion still PARTIAL on this dimension; iteration decision: CONVERGED_WITH_GAPS, with the residual gap recorded for the Caveats section.

Layer 6 (synthesis). Report drafted with executive summary (180 words answering the normalized query and surfacing the biggest caveats), five sub-query sections each with inline source-class tags and citations, cross-query synthesis identifying patterns spanning sub-queries (the disconnect between benchmark performance and practitioner-reported reliability is a pattern across SQ2 and SQ4; the definitional disagreement in SQ1 maps onto the framing disagreement in SQ3; the chain-of-thought-as-explanation debate in SQ5 is the active research front), named caveats (non-Western practitioner perspectives under-represented per Layer 5; some industry labs’ technical reports are not publicly available and were inferred from product communications, flagged with [INFERRED] tag), bibliography grouped by source class.

Layer 7 (self-evaluation). Each of the nine evaluation criteria scored conservatively. Coverage Completeness: 4 (every sub-query has multiple cited claims; cross-query synthesis covers most). Source Provenance Integrity: 4 (98% of claims tagged; 2% needed correction at Layer 8). Vault-First Adherence: 5 (consulted first for every sub-query; vault findings prominent in synthesis). Contradiction Handling: 4 (major contradictions named; one minor unsourced contradiction surfaced and corrected). Anti-Confabulation Compliance: 4 (no unsourced claims; one inference needed an explicit reasoning statement added). Structural Fidelity: 5. Scope Discipline: 4 (one section had drift toward an adjacent topic, pruned at Layer 8). Epistemic Calibration: 4 (confidence levels stated; one over-confident claim softened). Actionability for Caller: 5 (executive summary directly answers normalized query). No UNRESOLVED DEFICIENCY.

Layer 8 (error correction and persistence). Corrections logged: two unsourced claims tagged or removed; one inference reasoning statement added; one drifting paragraph pruned. Missing information declaration: non-Western practitioner perspectives under-represented; some industry technical reports not publicly available. Recovery declaration: not applicable. File persisted to ~/Documents/vault/Reference — LLM Reasoning Debate 2023-2026.md with YAML frontmatter (type: resource, nexus: empty, tags: [LLM, reasoning, AI-research], dates auto-populated).

That is what DRP produces on a real research query. The report carries 47 citations across 5 sub-queries and 4 source classes; vault contributed substantively to 2 of 5 sub-queries; iteration ran twice on the under-covered dimension before declaring CONVERGED_WITH_GAPS; the residual gap is named in the Caveats section rather than papered over; the report is saved to the vault for future reuse and inheritance into projects that need it.

Commercial AI comparison

Comparison content auto-populates when the comparison-refresh framework runs against this question. Drafters do not author this section.

Brief comparison commentary

Auto-populates with the comparison content above.

How to use this framework

You can run the Deep Research Protocol pattern with any AI of your choice that supports tool-use for retrieval. The composition is single-pass through the eight layers.

The prompt:

[Paste the framework specification]

Run DRP on this research query.

Research query: [The knowledge gap or research goal in plain language.]

Caller context: USER_DIRECT [or TMF / PROJECT_AGENT if invoked from a parent framework]

Optional inputs: [nexus, sources_allowed, depth_cap, subagent_cap, persist]

The AI runs Layers 1-8 sequentially. Layer 1 may pause for clarification on vague user-direct queries. Layer 2’s plan goes to user review for vague user-direct queries; otherwise plan review is skipped. Layers 3-5 run the retrieval and integration loop with bounded iteration. Layer 6 produces the structured report. Layer 7 scores it against the nine criteria. Layer 8 finalizes and persists.

For best results:

Be detailed in the query if you want to skip clarification. A query that specifies the domain, the knowledge gap in concrete terms, and at least one anchor concept routes as DETAILED and skips Layer 1’s clarification step. Vague queries get the up-to-three-questions clarification pass, which is useful but adds latency.
Confirm the plan in the review gate (vague user-direct only). Layer 2’s plan review for vague user-direct queries is the user’s chance to redirect before retrieval costs are incurred. If the proposed sub-queries miss what you actually wanted, edit the plan; the framework will incorporate edits and re-present.
Don’t ask the framework to invent citations. If the report cites a URL you find suspicious, verify it — every external URL should appear in the integrated evidence map, but the verification is the user’s. Layer 8’s citation integrity check verifies internal consistency (every URL in the body appears in the bibliography); it does not verify that the URL points to what the report says it points to.
Take the Caveats section seriously. Uncovered dimensions, sparse-evidence zones, inferred claims, and source-corpus biases are all named explicitly in the Caveats section. The report is honest about what it doesn’t cover; the user should be honest with themselves about the same.
Use TMF as caller when terrain mapping is the real need. DRP serves TMF as research substrate. If your underlying need is “I’m new to a domain and want to know its structure,” route to TMF, which will dispatch DRP for the cited research and then integrate the report with the terrain-mapping output. Calling DRP directly for a domain-mapping need produces a research report rather than a navigable terrain map.

The framework is deliberately tool-aware (subagent fan-out requires tool-use capability) but otherwise tool-agnostic. The vault-first discipline, the source-class citation requirement, the anti-confabulation defenses, and the bounded-iteration loop are conceptual disciplines that survive the lift to any environment with comparable retrieval primitives.

Other examples

DRP as TMF research substrate. A user invokes TMF on “post-Keynesian macroeconomics” — they want the territory mapped before deciding how deep to go. TMF Stage 2 produces a research prompt: “what are the major schools, central debates, key figures, and active research fronts in post-Keynesian macroeconomics, with attention to where this tradition diverges from mainstream Keynesian and neoclassical work?” DRP runs with caller_context=TMF; clarification skipped (TMF is expected to emit detailed prompts); plan decomposes into 5 sub-queries; vault has minimal coverage; subagent fan-out fills the gap; report produced and returned to TMF; TMF integrates the report with the terrain-mapping work and produces the final navigable terrain map. Demonstrates DRP as research substrate for a Level 2+ caller.
DRP with sources_allowed restricting to vault. A user invokes DRP with sources_allowed=[vault] to produce a synthesis report drawing only on their vault content (no external retrieval). Layer 3 runs vault reconnaissance for all sub-queries; Layer 4 is skipped (no external fan-out permitted); Layer 5 integrates only vault evidence; Layer 6 produces the report with [VAULT] tags throughout. The Caveats section flags that the report’s coverage reflects vault content only and external sources may carry material the report doesn’t surface. Useful for synthesis of internal research before any external comparison. Demonstrates the sources_allowed filter constraining DRP to a subset of its retrieval surface.
DRP invoked by a project agent with nexus inheritance. A project agent working on a long-running research project dispatches DRP for a specific knowledge gap surfaced during the project’s work. Caller_context=PROJECT_AGENT; nexus inherited from the project; persist=true. Layer 1 runs DETAILED-vs-VAGUE check; if VAGUE, runs clarification (project agents can be vague when the project’s emergent need is itself underspecified). Layers 2-8 run as standard. The persisted report carries the project’s nexus, so it surfaces in vault searches scoped to the project automatically. Demonstrates DRP’s role in long-running project agents where research substrate is needed at intervals.

Citations

The Deep Research Protocol draws on published research-orchestration practice. The vault-first discipline operationalizes the provenance-tier hierarchy from the Ora vault YAML schema (engram > working > resource > raw); the source-class tagging convention draws on the Ora claim taxonomy that distinguishes [VAULT], [WEB], [BROWSER_AI], [API_AI], and [INFERRED] sources. The bounded-iteration loop draws on Anthropic’s published documentation of common subagent failure modes — over-spawning (“50 subagents for simple queries”), endless search (“scouring the web endlessly for nonexistent sources”), and SEO farm capture (ranking content farms above authoritative sources).

The Search-o1 Reason-in-Documents condensation pattern (Layer 4 instruction to extract and condense rather than paste full pages) draws on the Search-o1 work on agentic search with in-document reasoning. The Context Dilution Trap (Layer 6 instruction 5 to work from condensed evidence map rather than re-expanded subagent reports) draws on MemSearch-o1’s documentation of “memory dilution” where evidence accumulates faster than it is condensed. The Topical Sprawl Trap (Layer 6 instruction 4) draws on the STORM literature’s documentation of report-generation drift where loosely related retrievals are stitched into plausible-sounding paragraphs that don’t address the declared sub-queries.

The Premature Accommodation Trap (Layer 1 instruction 5 to handle user impatience with explicit assumption declaration rather than silent skip) is an Ora-internal discipline against the failure mode where the framework abandons clarification when the user signals impatience, producing research on a poorly-specified query. The calibration-conservative scoring discipline at Layer 7 (self-evaluation scores systematically inflated; LLMs overconfident in 84.3% of scenarios per published research; a self-score of 4/5 likely corresponds to 3/5 by external evaluation standards) draws on the LLM self-evaluation calibration literature.

The framework is single-author and originated 2026-04-23 at v1.0; the registry entry notes low confidence pending proven applications, which the build-out is currently accumulating. The 9-criterion evaluation rubric and the 11 named failure modes (Confabulation, Citation Hallucination, Topical Sprawl, SEO Farm, Vault Blindness, Endless Search, Premature Convergence, Over-Spawning, Context Dilution, Source-Bias Transfer, Premature Accommodation) operationalize the Process Formalization Framework’s evaluation-and-failure-mode discipline for a research-orchestration framework specifically.

Downloads

Framework specification (PDF) — link to ora-ai.org canonical artifact when published
Framework specification (plain text) — link to ora-ai.org canonical artifact when published
Full white paper (PDF) — link when published