In research, the world model is one shared body of evidence about a single question: competing hypotheses held side by side with their posteriors, every claim traced to its source, and disagreement between agents recorded and worked through instead of averaged away. A research agent that only retrieves drifts toward the answer it already favored, the top results echoing a single press release reworded five ways until the prompt mistakes that chorus for confirmation. Run several such agents and it gets worse: each builds its own private notebook and you stitch the conflict together by hand at the end. A belief state that lives outside the model fixes that, holding the disagreement as a property of the world the agents share rather than something each one reconciles in its own head.
The world
The environment is one question and the evolving body of evidence about it. Entities are claims, the sources behind them, and the sub-questions the question breaks into. Relations are supports, contradicts, and derives-from: a paper supports a claim, a primary document contradicts another, a sub-question derives from the one above it. Ground truth lives in the sources themselves, ranked by how close each one sits to the thing being studied. The world is not settled when the agents start; it is what they are converging on.
What streams in
The agents' observations are everything they pull in while exploring the question:
- search results and the snippets that rank for a query
- papers, preprints, and the primary documents they cite
- primary sources: filings, datasets, transcripts, raw records
- tool outputs: a calculation, an extraction, a scrape
- the findings of other agents working the same question
That last channel is the point. One agent's recorded finding is another agent's next observation. When an agent writes a result, it becomes evidence the rest of the world reads, so the model learns how the question responds to each line of inquiry and where the team is pulling apart.
The belief state
The state holds competing hypotheses at once, each with its own posterior, rather than collapsing early to one storyline. Contradictions are kept as first-class objects: when sources disagree, uncertainty widens and the conflict is surfaced, never averaged into a comfortable middle. Gaps are the open sub-questions still load-bearing on the answer.
1┌──────────────────────────────────────────────────────────────┐
2│ QUESTION: HOW LARGE IS THE AI-TOOLS TAM? │
3│ │
4│ H1 "TAM > $50B by 2027" 58% │ 4 sources │
5│ H2 "TAM in the $20-30B band" 46% │ 3 sources │
6│ └─ ⚠ contradicts H1 - kept open, not averaged │
7│ │
8│ ● "Top-down sizing assumes 80% seat conversion" 39% │
9│ └─ agent-A flags as the weakest load-bearing claim │
10│ │
11│ Gap: "No primary spend data, only vendor estimates" │
12│ Gap: "Seat-to-revenue ratio unverified outside one report" │
13│ │
14│ Two analysts disagree on H1 vs H2. The world holds both, │
15│ widens the band, and ranks the move that splits them. │
16└──────────────────────────────────────────────────────────────┘Many agents write to one fused world under a shared namespace. When analyst A's reading of a vendor report lifts H1 and analyst B's reading of a primary filing lifts H2, the engine does not pick a winner or blend the two into a single dull number. It keeps both hypotheses live, widens the uncertainty to reflect the real disagreement, and marks the contradiction for resolution by evidence. A source restated five times across five secondary outlets still counts as one source, so a loud consensus cannot masquerade as independent confirmation.
What we're after
The intent is to converge on a defensible answer to the question. Success has a concrete shape: the key contradictions are resolved or explicitly scoped, the load-bearing sub-questions are closed, and every surviving claim is tied to the evidence that earns it. The limiting factor is rarely raw search volume; it is the few sub-questions whose answers actually move the conclusion. Moves are ranked against that intent, not against raw uncertainty. A wide-open sub-question that does not change the answer ranks below a narrow one that decides between two live hypotheses.
| Goal | Converge on a defensible answer to the question. |
| Success | Key contradictions resolved or scoped, the load-bearing sub-questions closed, every surviving claim tied to its evidence. |
| Limiting factor | The few sub-questions whose answers actually move the conclusion. |
The policy
A research effort runs under rules that outrank any single attractive result. You encode them as the structure the agents work within:
| Policy bucket | In a research effort |
|---|---|
| Invariants | Every claim cites the evidence behind it. No bare assertions enter the record. |
| Source hierarchy | Primary source over peer-reviewed; peer-reviewed over preprint; preprint over secondary; secondary over rumor. |
| Conventions | How a finding gets recorded, when a hypothesis goes "contested," how sub-questions are split. |
| Avoid | Confirmation bias toward a favored answer. Counting one restated source as many. |
The actions
Group what an agent may do by what it costs to be wrong:
| Action | Safety class | Effect |
|---|---|---|
search, read, synthesize | info | Gathers and digests evidence. Changes nothing. |
record-finding | mutates | Writes a result into the shared world. |
commission-experiment, escalate-to-human | needs-approval | Spends real effort or a person's time; gated. |
Today, policy and actions are a modeling lens you encode, not a contract the engine runs for you. The source hierarchy, the cite-or-it-does-not-count invariant, and the safety classes live in how your agents observe and act. Read the tables as the contract your agents uphold. A declarative config surface to register policy directly is on the roadmap. What the engine guarantees is the rest of the loop: it fuses the observations, holds the disagreement, surfaces the trail, and ranks what to chase next.
Plan & act
The next move is the sub-question whose answer most reduces uncertainty toward the goal, already ranked and in hand. Each agent orients on the fused world, takes the top move, runs it, and folds the result back in, where it becomes the next observation for every other agent on the question:
1import Beliefs from 'beliefs'
2
3const beliefs = new Beliefs({
4 apiKey: process.env.BELIEFS_KEY,
5 agent: 'researcher',
6 namespace: 'topic:tam',
7 writeScope: 'space',
8})
9
10// 1. Orient: read the fused world; the ranked next moves ride back on it
11const context = await beliefs.before('Where is the AI-tools TAM estimate weakest?')
12const [move] = context.moves // top-ranked, already in hand; no extra call
13// → { action: 'gather_evidence', subType: 'verify-claim', valueOfInformation: 0.82, ... }
14
15// 2. Act: your agent dispatches the tool the move points to
16const finding = await runTool(move.subType) // 'verify-claim' → your search+read tool
17
18// 3. Fold the result back in; it becomes the next observation
19await beliefs.after(finding.text, { source: 'BLS primary spend table' })The payoff is a verdict, not a pile of quotes. Analyst A and analyst B both write to topic:tam, and where they disagree on the TAM band the engine holds H1 and H2 open with a widened uncertainty instead of letting the last writer overwrite the first. When analyst B's primary spend table outranks the vendor estimates behind H1, fusion moves the posterior, the contradiction resolves on the evidence rather than on who wrote last, and the next ranked move is the sub-question that still separates the two views. A recall layer would hand every agent the same bag of snippets and leave the debate unsettled in each one's prompt. Here the debate is a property of the shared world, and every step of its resolution traces back to the exact source that moved it.
Emerging: plan several moves ahead
As a question accumulates resolved sub-questions, the agents can project which line of inquiry sharpens the answer most before spending the effort: beliefs.forecast.predict(['verify-claim', 'pull-primary-data', 'cross-check-vendor']). It returns low confidence until enough findings have actually resolved, so a forecast is never mistaken for a settled result. See Moves for how ranking and forecasting work.