Advanced

Narrative Regime Engine

Turn central bank language, earnings commentary, and market narratives into structured regime signals

⏱️ 38 min read Macro + Sector Rotation LLM Signal Extraction

What This Is and Is Not

This is not a chatbot making discretionary market calls. This is a process for converting messy language into stable, versioned labels that can feed a systematic portfolio. The label engine is useful only if the downstream mapping, turnover controls, and validation framework are sound.

Core Idea

Markets do not only react to data. They react to how decision-makers frame the data. GPT-style models are useful because they can normalize changing wording into stable labels such as hawkish disinflation, growth scare, margin pressure, inventory rebuild, or policy easing impulse.

Why Narrative Matters
Document Universe
Schema and Labels
Pipeline Design
From Label to Trade
Validation Framework
Failure Modes
Retail Implementation Plan

Why Narrative Matters

Classical macro models wait for hard data to print. Markets often move earlier because participants update their beliefs from language. When the Fed shifts from "data dependent" to "inflation risks remain elevated," or when large-cap management teams stop saying "temporary normalization" and start saying "customer caution," the price reaction often begins before the lagging data series confirm it.

The edge is not predicting GDP with a language model. The edge is measuring qualitative delta faster and more consistently than human note-taking.

FOMC statements: minor wording changes can reshape rate path expectations
Press conferences: Q&A often contains the real policy drift
Earnings calls: management tone reveals demand, pricing power, capex, and inventory before the numbers are revised
Sector commentary: semis, transports, banks, industrials, and retailers often turn before the broad indices

Institutional Lens

A top-tier macro desk does not ask, "Was the statement dovish?" It asks, "What changed relative to the prior event, how large was the change, what assets are most exposed, and what does the market still misunderstand?" Your narrative engine should do the same.

Document Universe

Keep the input set tight. Most failures come from noisy ingest rather than weak modeling.

Macro Inputs

FOMC statements, minutes, and press conferences
CPI, PPI, NFP, ISM, retail sales, and major release summaries
Fed speaker transcripts for key governors and regional presidents
Treasury refunding statements and major policy speeches

Corporate Inputs

Earnings call transcripts for sector bellwethers
8-K guidance releases
Investor day transcripts
Industry conference commentary

Why You Should Not Over-Ingest

Adding every news headline usually degrades the signal. Institutional systems separate core narrative documents from high-noise commentary streams. Start with canonical documents where wording changes matter most and where the event calendar is stable enough to backtest.

Schema and Labels

The model should not output essays. It should output a fixed schema that you can replay, compare, and aggregate.

{
  "macro_regime": "growth_scare | reflation | soft_landing | stagflation_risk",
  "policy_bias": "hawkish | neutral | dovish",
  "earnings_breadth": -2 to 2,
  "margin_pressure": -2 to 2,
  "consumer_health": -2 to 2,
  "capex_intensity": -2 to 2,
  "confidence": 0 to 1,
  "evidence_spans": ["quoted phrase 1", "quoted phrase 2"]
}

Good Labeling Practice

Use small, stable taxonomies instead of endlessly expanding labels
Separate direction from confidence
Require evidence spans so outputs are auditable
Keep fields numeric where possible to reduce comparison ambiguity

What to Measure

For macro, track policy bias, inflation confidence, labor market resilience, and growth concerns. For earnings, track pricing power, margin compression, inventory commentary, customer urgency, and capex appetite. The goal is to measure the language change in economically meaningful categories, not raw positivity.

Pipeline Design

Ingest the latest document set for a given event window.
Chunk by speaker, section, and timestamp.
Run extraction prompts that force fixed output schema.
Aggregate chunk-level outputs into document-level scores.
Compare the new event score to the prior comparable event.
Convert the delta into a regime vector for assets or baskets.
Gate the signal with market confirmation and liquidity filters.

Why Chunking Matters

One of the common failures in long-document prompting is that the model overweights the opening paragraphs and underweights Q&A. For macro documents, the Q&A often contains the most actionable nuance. For earnings, prepared remarks may be polished while analyst questions reveal what management is trying not to say. Chunking avoids that collapse.

Chunk Aggregation Example

Document Score =
    0.40 * prepared_remarks_score
  + 0.40 * Q&A_score
  + 0.20 * title_and_release_score

Event Delta =
    current_document_score - previous_document_score

Model Output Discipline

You want deterministic, low-temperature extraction for the base schema. Creative generation is a liability here. Keep prompts explicit, ask for missing-data flags, and reject outputs that do not conform to schema.

From Label to Trade

The cleanest implementation is not single-name prediction. It is basket trading. Narrative signals are broad and probabilistic; baskets absorb idiosyncratic noise.

Simple Mapping Layer

If policy_bias = hawkish and growth_scare is rising:
    underweight long-duration growth
    overweight defensives, quality, short-duration assets

If soft_landing confidence rises and earnings_breadth improves:
    overweight cyclicals, semis, small caps, credit beta

If stagflation_risk rises:
    reduce index beta
    add commodity-sensitive expressions
    tighten gross and net limits

Practical Trade Formats

Sector rotation: XLK vs XLU, XLI vs XLV, XLF vs TLT
Factor spreads: quality vs junk, defensives vs cyclicals, low vol vs high beta
Index overlays: scale exposure in SPY, QQQ, and IWM using the regime score as a gate
Macro hedges: if policy or inflation narrative shifts sharply, use rates or commodity hedges rather than forcing equity-only expression

A good rule is to require both a narrative shift and some market confirmation, such as relative strength, breadth expansion, or credit spread behavior. Text alone is usually too early. Price alone is often too late. The combination is where the process becomes usable.

Validation Framework

This strategy should be tested event-by-event, not bar-by-bar in isolation.

What to Backtest

Event delta versus subsequent 1-day, 5-day, and 20-day sector/factor returns
Hit rate of major regime classifications
Turnover and implementation slippage
Prompt version stability across the same historical documents

Two Tests Matter Most

Replay stability: rerun old documents with the same model and prompt to confirm consistent output
Prompt sensitivity: change wording slightly and see whether the signal survives

If your regime score flips materially because you rephrased a prompt, you do not have an alpha model. You have an unstable interface demo.

Failure Modes

Hallucinated interpretation: require evidence spans and manual spot checks
Prompt drift: version every prompt and keep replay logs
Macro whipsaw: do not let one document override price, breadth, and volatility regime filters
Overreaction: use baskets and capped turnover instead of one-shot directional bets
Narrative crowding: the more obvious the story, the more you should prefer relative-value expressions over outright beta

Quant Reality

There is no free lunch in "market narrative." Once a story becomes obvious on financial television, the edge is gone. The useful part of the process is measuring the narrative shift early, consistently, and at scale.

Retail Implementation Plan

Start with one document family, such as FOMC statements and press conferences.
Define 6-10 stable labels only.
Build replay logs and reject non-schema outputs.
Map signal to two or three liquid sector spreads.
Backtest event windows over 5-10 years.
Add earnings bellwether transcripts only after macro labels stabilize.

Quant Take

The real edge is not the language model. The real edge is a disciplined loop: text → schema → delta → basket → risk budget. If that loop is versioned and replayable, you have something testable. If it is not, you only have a story.