What We Build Foundation Data Features LLM RL Concepts Trading Brain Training Full Journey

Understanding
Gotham RL

A technical deep-dive for backend developers. No ML background required. We'll walk through every piece of this algorithmic trading system - from raw market data to a reinforcement learning agent that learns to trade futures.

~30 min read
9 Chapters
Code Snippets Included

What Are We Building?

Imagine a video game where the player is an artificial intelligence, the game world is a financial market, and the score is measured in profit and loss. That is Gotham RL in one sentence: an offline trading simulator where an AI agent learns, through millions of simulated trades, how to enter and exit positions on futures contracts.

The word "offline" is key. We are not plugging this into a live brokerage and letting it loose on real money. Instead, we replay historical market data, let the agent make decisions, simulate fills with realistic slippage, and measure whether the agent's strategy improves over time. Think of it like a flight simulator for a trading bot - all the physics of real markets, none of the financial risk during training.

The Three Pillars

Every part of the system maps to one of three responsibilities:

Data

Ingest, validate, and store historical market data. Raw 1-minute CSVs become clean 5-minute OHLCV bars in TimescaleDB.

Intelligence

Extract features from raw data - trend direction, price gaps, session context - and ask Claude for a qualitative market assessment.

Decisions

A reinforcement learning agent observes 36 numerical features and picks one of 9 possible actions: enter long, enter short, or skip - with configurable position size. Stop-loss and take-profit are derived from market structure: IFVG zones for stops, liquidity pools for targets.

A 30-Second Primer on Futures

Backend Dev Note

A futures contract is a legally binding agreement to buy or sell an asset at a specific price on a specific future date. You don't actually own the asset - you're trading the contract. Think of it like a ticket that says "I will buy 500 units of the Nikkei 225 index at price X on date Y." Futures trade nearly 24 hours, 5 days a week. They expire quarterly (March, June, September, December), so you periodically "roll" to the next contract.

We trade two specific futures:

  • NIY (Nikkei 225): Japan's flagship stock index. Tick size: 5 points, each tick worth 500 JPY. Traded on CME. Our trading window is the Tokyo session: 00:00–06:00 UTC, with a prime window at 00:30–01:30 UTC.
  • NQ (Nasdaq 100): America's tech-heavy stock index. Tick size: 0.25 points, each tick worth $20. Also on CME. Our trading window is the US session: 13:30–20:00 UTC, with a prime window at 15:00–16:00 UTC.

The High-Level Flow

Raw CSVs ──> 5m Bars ──> TimescaleDB ──> Feature Pipeline │ ┌─────────┴──────────┐ ▼ ▼ IFVG, Trend, Claude LLM Session, RTR Assessment │ │ └─────────┬──────────┘ ▼ 36-Number Observation │ ▼ RL Agent (PPO) ──> Action │ ▼ Position Simulation ──> Reward │ ▼ Policy Update (repeat 1M times)
Key Insight

The entire system is designed to answer one question per 5-minute bar: "Should I enter a trade right now, and if so, with what parameters?" Every module exists to give the agent better information for that decision.

The Foundation

The common/ package is imported by everything else. It provides configuration, logging, and domain constants - the bedrock that every other module stands on.

Configuration: Layered YAML + Pydantic

If you've worked with Spring Boot's property resolution or Rails' environment configs, the pattern here is familiar. We layer multiple configuration sources, where higher-priority sources override lower ones:

  1. Init kwargs (highest priority) - passed directly in code, mainly used in tests.
  2. Environment variables: prefixed with GOTHAM_, nested with __. For example, GOTHAM_DATABASE__HOST=10.0.0.5 sets database.host.
  3. Environment-specific YAML: config/{GOTHAM_ENV}.yaml. Set GOTHAM_ENV=prod to load config/prod.yaml.
  4. Base YAML (lowest priority) - config/default.yaml, always loaded.

All of this is implemented with Pydantic Settings, which gives us runtime type validation for free. If someone sets GOTHAM_DATABASE__PORT=banana, the app fails immediately at startup with a clear validation error, not deep in a database connection two hours later.

src/gotham/common/config.py
@lru_cache(maxsize=1)
def get_settings(**overrides: Any) -> GothamSettings:
    """Singleton access to settings. Cached after first call."""
    return GothamSettings(**overrides)

The @lru_cache means the first call creates the settings object, and every subsequent call returns the same instance. Tests clear this cache in conftest.py so each test gets a fresh config.

Domain Constants: The Language of the System

Two frozen dataclasses define the financial instruments and their trading sessions:

src/gotham/common/constants.py
@dataclass(frozen=True)
class InstrumentSpec:
    symbol: str        # "NIY" or "NQ"
    exchange: str      # "CME"
    currency: str      # "JPY" or "USD"
    tick_size: float   # 5.0 or 0.25
    point_value: float # 500.0 or 20.0
    session: SessionName

NIKKEI = InstrumentSpec("NIY", "CME", "JPY", 5.0, 500.0, SessionName.TOKYO)
NASDAQ = InstrumentSpec("NQ",  "CME", "USD", 0.25, 20.0,  SessionName.US)
Backend Dev Note

Tick size is the minimum price movement. For the Nikkei, price moves in steps of 5 (e.g., 38,000 → 38,005 → 38,010). For Nasdaq, it moves in steps of 0.25 (e.g., 18,500.00 → 18,500.25). Point value is how much money each point of price movement is worth per contract. A 5-point move in Nikkei = 5 × 500 = 2,500 JPY per contract.

Trading Sessions and Time Zones

Everything in the system uses UTC internally. The session windows define when each market is "active":

SessionUTC WindowPrime WindowDuration
Tokyo (NIY)00:00 – 06:0000:30 – 01:3060 min
US (NQ)13:30 – 20:0015:00 – 16:0060 min

The prime trading window is where the agent actually makes decisions. It's a narrow 60-minute slot within each session that we've identified as having the best liquidity and trend behavior. Outside this window, the agent is forced to skip. This is an intentional constraint - most professional day traders focus on a specific window rather than trading all day.

Structured Logging

The logging setup uses structlog with two rendering modes: JSON for production (machine-parseable, feeds into log aggregation) and colored console output for development. Both share the same pipeline of processors - timestamp injection, log level annotation, exception formatting - only the final renderer differs. Log files rotate at 50 MB with 5 backups via RotatingFileHandler.

Timeframes

The Timeframe enum defines four granularities: M5 (5 minutes), M15 (15 minutes), H4 (4 hours), and D1 (1 day). Raw data enters as 5-minute bars. We aggregate up to higher timeframes for trend analysis - the idea being that a trend visible on a daily chart is more significant than one visible only on a 5-minute chart. The constant TIMEFRAME_MINUTES maps each to its duration: 5, 15, 240, 1440.

Getting Data In

The data/ package handles everything from raw CSV files to validated, stored candles in TimescaleDB. This is a classic ETL pipeline with financial-data-specific validation.

OHLCV Candles: The Universal Language of Price

Backend Dev Note

A candle (or "bar") summarizes price action over a fixed time period. Each candle has five values: Open (first price), High (highest price), Low (lowest price), Close (last price), and Volume (number of contracts traded). A 5-minute candle at 10:00 tells you: "Between 10:00 and 10:05, the price opened at X, went as high as Y, as low as Z, and closed at W, with V contracts changing hands."

CandleSource: A Pluggable Protocol

Different data sources implement the same Protocol:

src/gotham/data/sources.py
@runtime_checkable
class CandleSource(Protocol):
    def fetch(self, instrument: str, start: date, end: date) -> pl.DataFrame:
        """Return 5m OHLCV DataFrame with columns:
        timestamp, instrument, open, high, low, close, volume, contract_month"""
        ...

We currently have two implementations: HistDataSource (reads 1-minute CSVs from the HistData website and aggregates to 5-minute bars) and CSVSource (reads pre-formatted 5-minute CSVs). The Protocol pattern means adding a new source - say, a direct Interactive Brokers feed - requires zero changes to the rest of the pipeline.

1-Minute to 5-Minute Aggregation

The HistDataSource reads raw 1-minute data and aggregates it using Polars' group_by_dynamic:

# 1m → 5m: group_by_dynamic("timestamp", every="5m", label="left")
# Aggregation rules:
#   open  = first()
#   high  = max()
#   low   = min()
#   close = last()
#   volume = sum()  (or 0 if no data)

Why 5-minute bars and not 1-minute or 15-minute? It's a balance. One-minute bars are noisy - too much random fluctuation for the agent to learn from. Fifteen-minute bars are too coarse - you miss important price movements. Five minutes is the sweet spot used by many institutional quant systems.

Data Validation

The CandleValidator catches two categories of problems:

OHLC Relationship Checks

The low must be ≤ everything (open, high, close). The high must be ≥ everything. No nulls, no NaN, no negative prices. Violations are severity ERROR: the row is rejected.

Spike Detection

If the close-to-close change exceeds 5%, it's flagged as a WARNING. This catches data errors like a misplaced decimal point (18500 vs 1850). The candle isn't rejected but is flagged for review.

There's also a zero-volume check during active sessions (a candle with no trades during market hours is suspicious) and a gap detector that identifies missing candles by comparing actual timestamps against expected 5-minute intervals.

The BackfillService

This is the orchestrator that ties it all together. It processes data in 30-day batches with per-batch commits for crash resilience:

  1. Query MAX(timestamp) from the database to find where we left off (resume point).
  2. For each 30-day batch: fetch from the CandleSource, validate, upsert into the database, commit.
  3. If the process crashes halfway, restart picks up from the last committed batch. Upserts are idempotent - re-inserting an existing candle just updates it.

Why TimescaleDB?

TimescaleDB is PostgreSQL with time-series superpowers. Our candle data is inherently time-series - ordered by timestamp, frequently queried by time range, and write-heavy during backfills. Hypertables automatically partition data by time, making range queries fast without manual shard management. Continuous aggregates precompute our 15m/4h/1d rollups from 5m data, so we never aggregate at query time.

If you've worked with PostgreSQL before, you'll feel right at home. The connection is configured via Pydantic settings (DatabaseConfig: host, port 5432, name/user "gotham") and exposed as both synchronous (.urlpostgresql://) and async (.async_urlpostgresql+asyncpg://) connection strings. The Docker Compose setup runs TimescaleDB on PostgreSQL 18 with health checks.

Gap Detection

After backfilling, the GapDetector verifies data completeness. It generates expected 5-minute timestamps for each instrument's session window, skips weekends, and compares against actual timestamps in the database. Missing candles are reported as GapInfo objects with the expected time and gap type ("missing_candle" or "unexpected_gap"). Sessions that cross midnight (like Tokyo) are handled by adding a day to the end time when end ≤ start. This is critical for data quality - a missing candle during the agent's trading window could cause incorrect feature calculations.

Contract Rolls

Backend Dev Note

Futures contracts expire quarterly. When the March contract approaches expiration, traders "roll" to the June contract. The problem: the June contract typically trades at a slightly different price than the expiring March one. If you naively stitch them together, you get artificial price jumps at every roll date. We track contract_month in each candle for this reason, though back-adjustment (ratio or difference method) is handled downstream during feature computation.

Reading the Market

The features/ package transforms raw price data into meaningful signals. Think of features as the agent's senses - the eyes, ears, and intuition that let it perceive what's happening in the market.

IFVG: Inverse Fair Value Gaps

This is the core trading concept in the system. Let's build it from first principles.

A Fair Value Gap (FVG) is a three-candle pattern where price moves so aggressively that it leaves a "gap" - a price range where very little trading occurred. Imagine three candles in a row where the middle candle is exceptionally long. If the high of candle 1 doesn't overlap with the low of candle 3, there's a gap between them.

Bullish FVG Bearish FVG │ ┌─┐ ┌─┐ │ │ │ │ │ │ │ │ └─┘ ← candle 1 low │ └─┘ ← candle 3 low │ │ ░░░░░░░░░░░ ← THE GAP ░░░░░░░░░░░ ← THE GAP │ ┌─┐ ← candle 1 high │ │ │ │ ┌─┐ ← candle 3 high │ ┌─┐ │ │ │ │ │ │ │ │ │ │ │ ┌─┐ │ └─┘ └─┘ └─┘ └─┘ c1 c2 c3 c1 c2 c3

An Inverse Fair Value Gap is a regular FVG that gets "inverted" - price comes back and crosses through the gap zone from the other side. When a bullish FVG is inverted (price closes below it), that zone often becomes a resistance level. When a bearish FVG is inverted (price closes above it), it often becomes support. This is the trading edge: IFVGs mark zones where price is likely to react.

The code in features/ifvg.py detects FVGs by scanning 3-candle windows. The minimum gap size is 4 ticks (configurable). Each IFVG goes through a lifecycle:

Active

Just inverted. The zone is fresh and hasn't been retested. Fill percentage is low.

Tested

Price has touched the zone boundary at least once. The more tests without breaking, the stronger the zone.

Mitigated

Price has penetrated more than 50% into the zone. The zone is losing its power.

Expired

Either fully filled (100% penetration) or older than 100 bars. No longer relevant.

Each IFVG also gets a quality score based on the displacement candle (the big candle that created the gap). The scoring logic is:

# IFVG Quality Scoring (features/ifvg.py)
if gap_ticks >= 8 and body_ratio > 0.70:
    quality = "high"    # score = 3
elif gap_ticks >= 6 or body_ratio > 0.60:
    quality = "medium"  # score = 2
else:
    quality = "low"     # score = 1

The intuition: a big, decisive candle with a large gap signals strong institutional activity. A high-quality IFVG with a gap of 10+ ticks and a body that fills 80% of the candle range is much more likely to act as a reliable support/resistance zone than a small, indecisive one.

Fill percentage tracks how much price has penetrated the IFVG zone. For a bullish IFVG: fill = clip(penetration / gap_size, 0, 1) where penetration = high - zone_lower. When fill hits 100%, the zone is dead. Test count increments each time price touches the zone boundary without fully filling it.

The IFVG module outputs 5 features per bar: ifvg_count_active (number of live IFVG zones), ifvg_nearest_dist (distance in ticks to the nearest zone, 0 if inside one), ifvg_best_quality (0 = none, 1–3 scale), ifvg_avg_fill_pct (how "used up" the active zones are), and ifvg_direction_bias (bullish count minus bearish count, normalized to [-1, 1]).

Trend Analysis: Multi-Timeframe Momentum

Backend Dev Note

An EMA (Exponential Moving Average) is a weighted average of recent prices where newer prices count more. Unlike a simple average, it "forgets" old data gradually. A 20-period EMA on 5-minute bars mostly reflects the last ~100 minutes of price action. The slope of an EMA tells you if the trend is rising, flat, or falling. We compute: slope = (ema[i] - ema[i-5]) / ema[i-5]. If this exceeds 0.1%, it's "rising"; below -0.1%, "falling"; otherwise "flat".

We compute trend features at three timeframes - M15, H4, and D1 - because trends have a hierarchy. A stock can be trending up on the daily chart while pulling back on the 15-minute chart. The compute_trend_features function computes EMA-20, EMA-50, their slopes, ATR-14 (a volatility measure), and a structure classification per timeframe.

Structure classification uses swing points (local highs and lows detected with a 5-bar lookback). If the last 4 swing highs are rising and the last 4 swing lows are rising, it's an UPTREND. Both falling: DOWNTREND. Otherwise: RANGING.

The final composite trend score blends all three timeframes:

trend_score = clip(
    D1_score × 0.40 +    # Daily trend matters most
    H4_score × 0.35 +    # 4-hour trend is next
    M15_score × 0.25,    # 15-min trend is tactical
    -1.0, 1.0
)
# Where: uptrend = +1.0, ranging = 0.0, downtrend = -1.0

Session Features: Time-of-Day Context

Markets behave differently at different times. The first hour of a session is typically volatile (the "opening drive"), while the middle is often choppy. Session features capture:

  • minutes_since_open: how far into the session we are
  • in_trading_window: whether we're in the prime 60-minute slot
  • window_progress_pct: position within the window (0.0 = start, 1.0 = end)
  • overnight_range: high minus low during the 2 hours before session open. A wide overnight range often signals a directional day.
  • prior_session_high/low/close: yesterday's key levels, which act as magnets and barriers for today's price.

Room-to-Right: How Far Can Price Go?

Entering a trade is pointless if the price immediately runs into a wall of resistance. The Room-to-Right (RTR) module estimates how much space the price has to move before hitting obstacles.

It works by detecting liquidity pools: clusters of swing highs or swing lows that are close together in price. The algorithm greedily clusters swing points within 0.5% of each other. When 3 or more swing points cluster together, that's a liquidity pool. Pools are classified as "resistance" (mostly swing highs), "support" (mostly swing lows), or "mixed". Pools act as magnets: the more "touches" at a level, the more likely price will react there.

The RTR score starts at 100 and is decremented for each pool: penalty = 10.0 × touches / max(distance_factor, 0.1) where distance_factor = distance_to_pool / ATR. Nearby pools with many touches cause the biggest reductions. The rtr_score_long considers pools above the current price (obstacles to upside moves), while rtr_score_short considers pools below (obstacles to downside moves).

There's also an exhaustion measure: exhaustion_pct = clip(intraday_range / ADR_20, 0, 1) where ADR_20 is the average daily range over the past 20 trading days. If today's range already exceeds 70% of the 20-day average, the market may be running out of energy. When exhaustion exceeds 70%, an additional penalty is applied to both RTR scores: penalty = (exhaustion_pct - 0.70) × 100. The module also tracks obstacle_count_long and obstacle_count_short: the raw count of pools in each direction.

The Pre-Screen Gate

Not every 5-minute bar is worth evaluating for a trade. The pre-screen is a simple AND filter that all four conditions must pass:

pre_screen_passed = (
    ifvg_count_active >= 2         # At least 2 active IFVGs
    AND abs(trend_score) >= 0.3    # Clear enough directional bias
    AND max(rtr_long, rtr_short) >= 30.0  # Enough room to move
    AND in_trading_window == True   # Within our 60-min window
)
Key Insight

The pre-screen dramatically reduces the search space. Instead of asking Claude to assess thousands of candles per session, we only assess the ones where the numerical features already suggest a potential opportunity. This saves both money (API calls) and training time.

Displacement Detection

Within the trend module, there's a displacement detector that identifies aggressive candles - ones where institutions are likely driving price. A candle qualifies as a "displacement" if both conditions hold: the body fills >70% of the total range (it's decisive, not indecisive) and the range exceeds 1.5× ATR (it's large relative to recent volatility). Displacement candles are the building blocks of FVGs and trend structure changes.

The Pipeline: Putting It All Together

The run_feature_pipeline function orchestrates all six modules in sequence:

  1. build_multi_timeframe(candles_5m): aggregates 5m bars to M15, H4, and D1 candles.
  2. compute_trend_features(candles_5m, htf_frames): computes EMAs, slopes, structure, ATR, displacement per timeframe, then joins to 5m via join_asof with backward strategy (forward-fills the most recent higher-timeframe value).
  3. compute_ifvg_features(result, tick_size): detects FVGs, tracks inversions, scores quality, outputs 5 features per bar. Also computes structural stop levels from IFVG zone boundaries for each direction.
  4. compute_session_features(result, instrument): adds session-aware columns: minutes since open, window position, overnight range, prior session levels.
  5. compute_rtr_features(result): detects liquidity pools, computes room-to-right scores and exhaustion metrics. Also computes structural target levels from nearest liquidity pools for each direction.
  6. compute_structural_rr(result): computes risk-reward ratios from IFVG-derived stop levels and liquidity-pool-derived target levels. Falls back to ATR-based distances when structural levels are unavailable. Outputs structural_rr_long and structural_rr_short per bar.
  7. apply_pre_screen(result): applies the AND filter, adds the pre_screen_passed boolean column.

The output is a Polars DataFrame with the original 5m OHLCV data plus all computed feature columns - roughly 45 additional columns (including structural stop/target levels and R:R ratios). This enriched DataFrame is saved as a Parquet file for downstream consumption by the LLM assessment generator and the RL training pipeline.

The AI Eye

The llm/ package uses Anthropic's Claude to add qualitative intelligence that's hard to capture with math alone. Think of it as a junior analyst who reviews each setup and gives a structured opinion.

Why Use an LLM at All?

Some market patterns are easy to quantify (a 50-bar moving average crossing above a 200-bar one). Others are harder: "this looks like a failed breakdown that's about to squeeze higher" or "this choppy price action suggests institutional accumulation." Experienced traders see these patterns intuitively. An LLM can encode some of that pattern recognition into features that the RL agent can learn from.

The LLM is not making trading decisions. It's providing 6 additional features to the observation vector. The RL agent decides what to do with them.

What Claude Sees

For each pre-screened bar, we build a compact prompt with these sections:

  1. Compact CSV: the last 50 candles as ts,O,H,L,C,V, timestamps as HH:MM. This is the raw price action.
  2. Trend Summary: structure and slope at each timeframe, plus the composite trend score.
  3. IFVG Context: count of active IFVGs, nearest distance, best quality, direction bias.
  4. Session Context: minutes since open, window progress, overnight range, prior session levels.
  5. Room to Right: long/short RTR scores, exhaustion percentage.

Forcing Structured Output

We don't want free-text responses. We use Claude's tool use API with tool_choice={"type": "tool", "name": "submit_assessment"} to force the output into a predefined JSON schema. The LLMAssessment schema has 12 structured fields:

FieldTypeRangeWhat It Captures
setup_typeEnum5 valuesbullish/bearish reversal, continuation, or no_setup
confidencefloat[0, 1]How confident the LLM is in its assessment
ifvg_qualityLiteralhigh/med/lowLLM's independent quality assessment of the IFVG
trend_alignmentdict5 values/tfBullish/bearish/neutral alignment per timeframe
room_to_right_estimatefloat[0, 100]LLM's estimate of how much room price has
risk_reward_estimatefloat≥ 0Expected reward per unit of risk
narrativestrfree textBrief market narrative
concernslist[str]0–5 itemsRisk factors the LLM identifies
regimeEnum4 valuestrending_day, choppy, event_driven, low_liquidity
narrative_sentimentfloat[-1, 1]Sentiment polarity of the narrative
nearest_targetfloatpriceMost likely profit target level
nearest_invalidationfloatpricePrice where the thesis breaks down

Encoding: Text to Numbers

Neural networks eat numbers, not strings. The encode_assessment function converts the LLM's structured output into 6 floats, all scaled to [0, 1] or [-1, 1]:

llm_confidence         = confidence                    # [0, 1]
llm_setup_type         = setup_type_index / 4.0         # [0, 1]  (5 types → 0..4)
llm_rr_estimate        = min(rr_estimate / 5.0, 1.0)    # [0, 1]
llm_regime             = regime_index / 3.0              # [0, 1]  (4 regimes → 0..3)
llm_concern_count      = min(len(concerns) / 5.0, 1.0)  # [0, 1]
llm_narrative_sentiment = narrative_sentiment             # [-1, 1]

Caching to Parquet

Calling Claude during RL training would be impossibly slow and expensive. Instead, we pre-generate all assessments once and cache them as Parquet files. The generate_assessments function processes all pre-screened bars in batches of 50, saving checkpoints along the way. It supports resume - if the process crashes, it skips already-cached timestamps. Failed API calls get a null encoding (all zeros).

Quality Checks

Two automated checks catch potential problems: confidence saturation (more than 30% of assessments at confidence = 1.0, suggesting the LLM is overconfident) and regime concentration (more than 80% of assessments sharing the same regime, suggesting lack of discrimination).

Cost Tracking

The client tracks token usage per request and computes cost using keyword-based model identification. The pricing table:

ModelInput ($/M tokens)Output ($/M tokens)
Claude Haiku$1.00$5.00
Claude Sonnet$3.00$15.00
Claude Opus$15.00$75.00

For bulk assessment generation, we use Claude Haiku to keep costs manageable. The generate_assessments function logs progress with running cost and ETA, so you can see "450/1200 assessments done (37%), $2.14 spent, ETA 18 min" in the console. The client uses exponential backoff for retries (sleep(2^attempt)) but doesn't retry on client errors like 400 or 401.

Prompt Versioning

The system prompt loaded from prompt_v1.txt (or prompt_v2.txt) defines the persona and evaluation criteria for the LLM. Version 2 added narrative_sentiment as a new field. Each cached assessment records its prompt_version, so you can retrain with v2 assessments without invalidating v1 data. The resume logic in generate_assessments filters by version when deciding what to skip.

RL for Backend Devs

This chapter explains reinforcement learning concepts from scratch. No math prerequisites - just analogies and intuition.

The Dog Training Analogy

Reinforcement learning is like training a dog. You don't show the dog a million examples of "good behavior" and "bad behavior" (that would be supervised learning). Instead, you let the dog try things, and you give it a treat when it does something good and a stern "no" when it does something bad. Over many repetitions, the dog figures out which behaviors lead to treats.

The Vocabulary

  • Agent: the decision-maker (our trading bot)
  • Environment: the world it interacts with (market sim)
  • Observation: what it can see (36 numbers)
  • Action: what it can do (enter/skip + params)
  • Reward: the treat or punishment (profit/loss)

The Loop

  • Agent observes the environment state
  • Agent chooses an action
  • Environment simulates the result
  • Environment returns a reward and new state
  • Agent updates its strategy
  • Repeat millions of times

Episodes: One Game Level

An episode is one complete run of the game. In our case, one episode is one trading session - say, the Tokyo session on January 15th. The agent sees each 5-minute bar in sequence, decides whether to trade at each bar, and the episode ends when the session closes. Then we reset and start a new episode (maybe the US session on January 15th, or Tokyo on January 16th).

Episodes always end by truncation (the session ran out of time), never by termination (the agent didn't "die"). If the agent has an open position at session end, it's force-closed at the current price. This is realistic - a day trader never holds overnight. The reset logic selects episodes via a seeded random number generator (np.random.default_rng(42)) for reproducibility.

Policy: The Strategy

The agent's policy is its strategy - a function that maps observations to actions. In our case, it's a small neural network with 2 hidden layers of 64 neurons each (written in config as policy_net_arch: [64, 64]). The input is 36 numbers (the observation), and the output is a probability distribution over possible actions. At the start of training, this is essentially random. By the end, it should have learned patterns like "when the trend is strongly bullish and there's a fresh IFVG with favorable structural R:R, enter long."

Why such a small network? Trading decisions don't require the deep pattern recognition that image classification does. A [64, 64] MLP has roughly 5,000 trainable parameters - enough to learn the relationships between 36 input features and 2 output sub-decisions (entry and size), but small enough to train quickly and avoid overfitting. If the network were too large, it might memorize specific market patterns from the training period rather than learning generalizable rules.

Backend Dev Note

A neural network is just a chain of matrix multiplications with nonlinear functions in between. Our [64, 64] MLP (Multi-Layer Perceptron) works like this: take 36 input numbers, multiply by a 36×64 matrix, apply a nonlinear function (ReLU), multiply by a 64×64 matrix, apply ReLU again, then multiply by a 64×output matrix. The "learning" part is adjusting those matrix values so the outputs become useful. It's essentially a very fancy lookup table that can interpolate.

PPO: Learning Carefully

PPO (Proximal Policy Optimization) is the specific algorithm we use to update the policy. The key idea: don't change your strategy too much in one update. If a few lucky trades make "go all-in on every trade" look great, PPO prevents the agent from swinging wildly to that extreme. It uses a "clip range" (set to 0.2) that limits how much the probability of any action can change in a single update step.

We specifically use MaskablePPO from the sb3-contrib library. The "maskable" part is crucial - it means we can tell the agent "these actions are not allowed right now" and it respects those constraints during both action selection and learning.

Key Hyperparameters

ParameterValueIntuition
gamma0.99Discount factor. Future rewards are worth 99% of present rewards. The agent thinks long-term.
ent_coef0.01Entropy coefficient. A small bonus for trying random things. Prevents premature convergence to a boring strategy.
learning_rate3e-4Step size for updates. Small enough for stability, large enough to learn in reasonable time.
clip_range0.2Maximum allowed change per update. Keeps learning stable.
n_steps2,048Play 2,048 steps, then study what happened and update the policy.
batch_size256Review 256 steps at a time during each update pass.

Action Masking: Guardrails

The agent can't do whatever it wants. Action masking enforces rules like "you can't enter a new trade when you're already in one", "you can't trade if you've hit the daily loss limit", and "you can't enter a direction where the structural risk-reward is unfavorable." This is implemented as a binary mask - an array of 6 booleans (one per sub-action option across the 2 dimensions) where False means "this option is blocked."

Key Insight

Action masking is what makes RL practical for trading. Without it, the agent would waste millions of training steps trying to learn rules we already know (like "don't enter a second trade while the first is still open"). By encoding constraints as masks, the agent focuses its learning capacity on the hard question: when and how to trade.

The Trading Brain

The rl/ package implements the Gymnasium environment, observation builder, action space, position simulator, and reward function. This is where all the pieces come together.

The 36 Observation Features

Every 5-minute bar, the agent sees exactly 36 numbers, all scaled to the range [-1, 1]. Why? Neural networks learn best when inputs are small and centered near zero. Here they are, grouped by source:

LLM 6
  • llm_confidence
  • llm_setup_type
  • llm_rr_estimate
  • llm_regime
  • llm_concern_count
  • llm_narrative_sentiment
IFVG 5
  • ifvg_count_active / 10
  • ifvg_nearest_dist / ATR
  • ifvg_best_quality / 3
  • ifvg_avg_fill_pct
  • ifvg_direction_bias
Trend 7
  • trend_score
  • ema20_slope_15m
  • ema20_slope_4h
  • ema20_slope_1d
  • structure_15m
  • structure_4h
  • structure_1d
Session 5
  • minutes_since_open / 480
  • window_progress_pct
  • overnight_range / ATR
  • prior_session_range / ATR
  • in_trading_window
Room-to-Right 4
  • rtr_score_long / 100
  • rtr_score_short / 100
  • exhaustion_pct
  • exhaustion_flag
Structural R:R 2
  • structural_rr_long / 5
  • structural_rr_short / 5
Microstructure 3
  • vol_ratio / 5
  • bar_range_norm / ATR
  • bar_body_ratio
Portfolio 4
  • in_position (0/1)
  • unrealized_pnl_r / 5
  • daily_pnl_r / 5
  • trades_today / 10

String-valued features get mapped to numbers: slopes become {rising: 1.0, flat: 0.0, falling: -1.0}; structures become {uptrend: 1.0, ranging: 0.0, downtrend: -1.0}. Distances are divided by ATR (Average True Range) to make them volatility-adaptive - "20 ticks away" means something very different on a calm day versus a volatile one.

The Action Space: 9 Combinations

The action space is MultiDiscrete([3, 3]): two independent sub-decisions, each with 3 options. That's 3 × 3 = 9 possible combinations. Stop-loss and take-profit are not agent decisions - they're derived from market structure (explained below).

Entry
  • 0 Skip
  • 1 Long
  • 2 Short
Size
  • 0 1 contract
  • 1 2 contracts
  • 2 3 contracts

Structure-Aware Stop-Loss and Take-Profit

Rather than letting the agent choose stop and target distances (which adds combinatorial complexity and leads to over-fitting), the system derives them from market structure:

Stop-Loss: IFVG Zone Boundaries

For a long entry, the stop is placed just below the nearest active IFVG zone boundary below the current price. For a short entry, just above the nearest zone above. These are natural invalidation levels - if price breaks through an IFVG zone, the thesis is likely wrong.

Fallback: when no structural level is available, the stop defaults to 1.5 × ATR from the entry price.

Take-Profit: Liquidity Pools

For a long entry, the target is the nearest liquidity pool (cluster of swing highs) above the current price. For a short entry, the nearest pool below. Price tends to gravitate toward these clusters, making them natural profit-taking zones.

Fallback: when no liquidity pool is found, the target defaults to 2.0R (twice the risk distance).

Backend Dev Note

R-multiples measure reward relative to risk. If your stop-loss is 10 ticks away (your risk = 10 ticks), a 2R target is 20 ticks of profit, 3R is 30 ticks. This makes results comparable across instruments with different price scales. Structural R:R is the ratio of target distance to stop distance derived from these market-structure levels. A structural R:R of 2.5 means the nearest liquidity pool is 2.5 times farther away than the nearest IFVG zone boundary. When the trade's MFE (maximum favorable excursion) reaches 1R, the stop automatically moves to breakeven (entry price), locking in a risk-free trade.

Action Masking Rules

Certain actions are blocked based on the current state. The mask is a flat array of 6 booleans (3 for entry + 3 for size). When any of these conditions are true, long and short entry are masked (only skip is allowed):

  • Already in a position: can't enter a second trade
  • Daily P&L ≤ -3R: daily loss limit hit, stop trading
  • Trades today ≥ 5: maximum trades per session reached
  • Structural R:R below 1.5: long is blocked if structural_rr_long < 1.5; short is blocked if structural_rr_short < 1.5. This prevents the agent from entering trades where the market structure doesn't offer a favorable risk-reward ratio.

Position Simulation

When the agent enters a trade, we simulate it with 1-tick slippage (you always get a slightly worse fill than the ideal price - this is realistic). Each subsequent bar, we check: did the price hit the stop-loss? (Checked first - conservative assumption.) Did it hit the take-profit target? We track MAE (Maximum Adverse Excursion - worst drawdown during the trade) and MFE (Maximum Favorable Excursion - best unrealized profit during the trade). Commissions are realistic: $1.25 per contract for NQ, 80 JPY per contract for NIY.

The trailing stop has a special mechanism: when the trade's MFE reaches 1R (i.e., the trade has moved one risk-unit in your favor), the stop automatically moves to breakeven (entry price). This locks in a risk-free trade. The stop_moved_to_breakeven flag ensures this happens only once per position.

A completed trade produces a CompletedTrade dataclass with detailed statistics: entry/exit prices, P&L in ticks, risk in ticks, the realized R-multiple (P&L / risk, net of commission), whether it hit the target or stop, bars held, MAE, MFE, commission in the instrument's currency, and the exit reason ("stop", "target", or "session_end"). The realized_rr formula accounts for commissions by converting them to ticks: commission_ticks = commission / (size × tick_size × point_value), then realized_rr = (pnl_ticks - commission_ticks) / risk_ticks.

The Reward Function

The reward has three components:

Trade Reward

realized_rr
+ 0.3 if target hit

Step Penalty

if daily_pnl < -2R:
-0.8 × excess drawdown

Patience Bonus

if skip & no position:
+0.001 per step

The trade reward is the realized R-multiple of the closed trade, with a 0.3 bonus for hitting the target (encouraging the agent to let winners run rather than cutting them short). The step penalty accelerates when the daily P&L drops below -2R, punishing the agent for digging deeper into a losing day. The patience bonus is tiny (0.001) but crucial - it gives the agent a small positive reward for not trading when there's no position. Without this, the agent would see "skip" as a zero-reward action and prefer to always enter trades.

Key Insight

Reward shaping is the art of RL. The patience bonus is a great example: without it, agents often over-trade (entering 5 trades per session, most losers). With it, they learn to wait for quality setups. A 0.001 bonus per skip step adds up to ~0.06 per hour of patience - enough to shift behavior without overwhelming the trade reward signal.

Teaching and Testing

Training, evaluation, baselines, and the criteria for deciding whether a model is good enough.

The Training Loop

Training follows the standard PPO loop: collect experience, compute advantages, update the policy network. In concrete terms:

  1. The agent plays through episodes, collecting 2,048 steps of experience (observations, actions, rewards).
  2. Using those 2,048 steps, PPO computes "advantages" - how much better or worse each action was compared to what was expected.
  3. The policy network is updated in mini-batches of 256 steps, with the clip range preventing drastic changes.
  4. Repeat from step 1 until we've reached 1,000,000 total timesteps.

1 million timesteps sounds like a lot, but each "timestep" is just one 5-minute bar. With ~12 bars per trading session and ~250 trading days per year, that's roughly 3,000 simulated sessions - or about 12 years of daily trading compressed into a few hours of compute time.

The training hyperparameters are tuned for this specific problem:

# TrainingConfig (common/config.py)
total_timesteps:   1_000_000   # Total experience to collect
learning_rate:     3e-4        # Adam optimizer step size
n_steps:           2_048       # Steps per rollout buffer
batch_size:        256         # Mini-batch size for updates
gamma:             0.99        # Discount factor (long-term focus)
clip_range:        0.2         # PPO clipping parameter
ent_coef:          0.01        # Entropy bonus for exploration
checkpoint_freq:   50_000      # Save every 50K steps
eval_freq:         50_000      # Evaluate every 50K steps
max_daily_loss_r:  3.0         # Daily loss limit (in R-multiples)
max_trades_per_session: 5      # Max trades before forced skip
policy_net_arch:   [64, 64]    # Hidden layers

Callbacks: Checkpoints and Evaluation

During training, two callbacks run automatically:

  • Checkpoint callback: saves a snapshot every 50,000 steps to models/checkpoints/. If training crashes, you can resume from the latest checkpoint.
  • Evaluation callback: every 50,000 steps, runs the current policy on the validation set (up to 50 episodes) with deterministic=True and logs metrics. If the Sharpe ratio beats the previous best, saves the model to models/best/best_model.zip.
Key Insight

We save the best model by Sharpe ratio, not by total reward. A model with high total reward might just be entering many trades in one lucky period. The Sharpe ratio measures return per unit of risk, which is what we actually care about. A Sharpe of 2.0 means the model's returns are 2 standard deviations above zero - consistently profitable, not just occasionally lucky.

Evaluation Metrics

≥ 1.0
Sharpe Ratio
≤ 5R
Max Drawdown
≥ 40%
Win Rate

A model must pass all three criteria to be "promoted":

Backend Dev Note

Sharpe ratio = (mean return / std of returns) × √252. The √252 annualizes it (252 trading days/year). A Sharpe ≥ 1.0 means the strategy earns its average return each year with a drawdown roughly equal to that return. Max drawdown is the worst peak-to-trough decline - if your cumulative P&L goes from +8R to +3R, that's a 5R drawdown. Win rate of 40% means 4 out of 10 trades are profitable - which is fine if your winners are larger than your losers (which the R-multiple system ensures).

Additional metrics tracked but not used as promotion gates: avg_rr (average R-multiple per trade), profit_factor (gross profit / gross loss - infinity if no losses), total_r (sum of all R-multiples across all evaluated episodes), and trades_per_session (helps spot over-trading).

The evaluation function runs model.predict(obs, deterministic=True, action_masks=masks) for up to 500 steps per episode and collects all CompletedTrade objects. Setting deterministic=True means the agent picks the most probable action (no exploration noise), which gives a cleaner picture of what the policy has actually learned versus what it stumbles into randomly.

Baseline Agents: The Minimum Bar

How do you know if your trained agent is any good? Compare it against baselines that require no learning:

R

RandomAgent

Picks random valid actions (respecting masks). The absolute floor - if your trained agent can't beat random, something is fundamentally wrong.

A

AlwaysEnterAgent

Always goes long with 1 contract. Stop and target are set by structural levels (or ATR fallback). Tests whether "just being in the market" is profitable (it usually isn't).

R

RulesBasedAgent

Uses simple rules: go long if trend > 0.3 and IFVG bias is bullish; go short if trend < -0.3 and bias bearish; skip otherwise. The "can an if-statement do this?" baseline.

LLM Ablation

To measure whether the LLM features actually help, we train a separate model with ablate_llm=True. This zeros out the first 6 observation features (the LLM-derived ones) by setting obs[:LLM_FEATURE_COUNT] = 0.0 at each step. The agent sees only the 30 numerical features from IFVG, trend, session, RTR, structural R:R, microstructure, and portfolio modules. If the full model significantly outperforms the ablated one on the same validation set, the LLM assessment adds measurable value. This is a controlled experiment - everything else stays identical, so any performance difference is attributable to the LLM features.

Time-Based Splits

We split data chronologically, never randomly:

SplitPeriodPurpose
Train≤ June 2024The agent learns from this data
ValidationJul – Dec 2024Used during training to select the best model
Test> Dec 2024Final evaluation, never seen during training
Why Not Random Splits?

In financial data, random shuffling causes data leakage. If a Friday candle is in the test set but the surrounding Thursday and Monday candles are in training, the model has implicitly "seen" the test data. Chronological splits ensure the model is always predicting the future from the past - exactly what it would do in production.

The Full Journey

Let's trace one data point's lifecycle from raw CSV to policy update. This is where all eight chapters converge.

One Data Point, Thirteen Steps

1. Raw 1-Minute CSV

A semicolon-delimited CSV from HistData: 20240115 003000;38250;38265;38245;38260;127. One row per minute, columns: timestamp, open, high, low, close, volume.

data/sources.py

2. Aggregated to 5-Minute Bar

Polars' group_by_dynamic("timestamp", every="5m") combines 5 one-minute rows into one candle: first open, max high, min low, last close, summed volume.

data/sources.py

3. Validated

OHLC relationship checks pass (low ≤ open ≤ high, etc.). No spike detected (<5% change from previous bar). Volume is positive. The bar survives validation.

data/quality.py

4. Stored in TimescaleDB

Upserted into the candles_5m hypertable with the instrument symbol and contract month. The BackfillService commits this batch and moves to the next 30-day window.

data/backfill.py

5. Aggregated to Higher Timeframes

The 5m bars are rolled up to 15-minute, 4-hour, and daily candles using the same OHLCV aggregation logic. These are needed for multi-timeframe trend analysis.

features/pipeline.py

6. Features Computed

The feature pipeline runs in sequence: trend features (EMAs, slopes, structure), IFVG detection and scoring (including structural stop levels from zone boundaries), session features (overnight range, window position), room-to-right (liquidity pools, exhaustion, and structural target levels), and structural R:R ratios (risk-reward from IFVG stops to liquidity pool targets). Higher-timeframe features are joined to 5m bars via join_asof.

features/

7. Pre-Screen Applied

This bar has 3 active IFVGs, a trend score of +0.62, an RTR long score of 45, and it's within the trading window. All four conditions pass: pre_screen_passed = True.

features/pre_screen.py

8. LLM Assesses the Setup

Claude receives the last 50 candles as compact CSV plus all feature summaries. It returns a structured assessment: setup_type=bullish_continuation, confidence=0.72, regime=trending_day, narrative_sentiment=0.4, one concern about overhead resistance.

llm/

9. Assessment Encoded

The structured output becomes 6 floats: [0.72, 0.75, 0.44, 0.0, 0.20, 0.40]. These are cached to Parquet alongside the timestamp and prompt version for reproducibility.

llm/schema.py

10. Observation Vector Built

The 6 LLM + 5 IFVG + 7 trend + 5 session + 4 RTR + 2 structural R:R + 3 microstructure + 4 portfolio = 36 numbers, all clipped to [-1, 1]. This is what the agent sees.

rl/obs.py

11. Agent Picks an Action

The policy network (36 → 64 → 64 → output) processes the observation and outputs probabilities. With action masking applied (including the structural R:R gate), it selects: [1, 0]: enter long, 1 contract. The stop-loss is placed at the nearest IFVG zone boundary below price; the take-profit target at the nearest liquidity pool above. The structural R:R for this setup is 2.1, well above the 1.5 minimum.

rl/env.py

12. Position Simulated, Reward Computed

Entry fills at close + 1 tick slippage. Over the next 8 bars, the trade hits its 2R target. Realized RR after commission: +1.87. Trade reward: 1.87 + 0.3 (target bonus) = +2.17.

rl/position.py rl/reward.py

13. Policy Updated

This experience joins 2,047 other steps in the rollout buffer. PPO computes advantages, then updates the network weights in mini-batches of 256. The probability of "enter long when trend is bullish and IFVGs are active" increases slightly.

rl/train.py

The Model Registry

When training completes, save_model writes four files to models/{timestamp}/: the model weights (model.zip), a metadata.json (instrument, training duration, git hash), eval_metrics.json (Sharpe, drawdown, win rate, profit factor), and a config_snapshot.yaml (frozen copy of the config used for training). The best model is also saved to models/best/best_model.zip.

list_models scans the models directory and returns all saved runs sorted by timestamp, making it easy to compare experiments.

Config System in Action

Change one value in config/default.yaml and it ripples through the entire pipeline. Here are some examples of how single config changes affect the whole system:

  • Set features.pre_screen_min_trend: 0.5 (up from 0.3) → fewer bars pass pre-screen → fewer LLM calls → fewer training episodes → a more selective but potentially undertrained agent.
  • Set training.ent_coef: 0.05 (up from 0.01) → the agent explores more random actions during training → slower convergence but wider strategy search.
  • Set features.ifvg_min_gap_ticks: 8 (up from 4) → only large FVGs are detected → fewer active IFVGs per bar → pre-screen becomes harder to pass.
  • Set training.gamma: 0.95 (down from 0.99) → the agent discounts future rewards more heavily → prefers quick trades over patient setups.
  • Set features.ema_fast: 10 (down from 20) → the fast EMA reacts quicker → more slope changes detected → potentially noisier trend signals.

The config snapshot saved with each model makes experiments reproducible. You can always answer "what settings produced this model?" by reading config_snapshot.yaml in the model directory.

Feature Configuration Reference

For reference, here are the key configurable thresholds that shape the feature pipeline:

# FeatureConfig defaults (common/config.py)
ifvg_min_gap_ticks:     4.0     # Minimum gap size to detect an FVG
ifvg_max_age_bars:      100     # Bars before an IFVG expires
ema_fast:               20      # Fast EMA period
ema_slow:               50      # Slow EMA period
atr_period:             14      # ATR lookback period
displacement_body_pct:  0.70    # Min body ratio for displacement candle
displacement_atr_mult:  1.5     # Min range as multiple of ATR
rtr_lookback_days:      20      # Days for liquidity pool detection
pre_screen_min_ifvgs:   2       # Min active IFVGs to pass pre-screen
pre_screen_min_trend:   0.3     # Min |trend_score| to pass pre-screen
pre_screen_min_rtr:     30.0    # Min RTR score to pass pre-screen

# Structural SL/TP defaults (rl/env.py EnvConfig)
stop_atr_fallback_mult: 1.5     # ATR multiple for stop when no IFVG zone
target_fallback_r:      2.0     # R-multiple for target when no liquidity pool
min_structural_rr:      1.5     # Minimum R:R to allow entry (action mask gate)

What's Next

The system has two modules shown with dashed borders in the architecture diagram - they exist as stubs, ready for implementation:

  • Execution (execution/) - connects to Interactive Brokers via their API on port 4002. Takes the agent's action and converts it into real orders with proper position sizing, bracket orders (stop + target), and session-end flatten logic.
  • Monitoring (monitoring/) - Telegram bot for real-time alerts (trade entries, exits, daily P&L summaries) and health checks (is IB connected? Is data flowing? Is the model loaded?).

The transition from offline to live trading is the next frontier. All the infrastructure is in place: the config system supports ib.readonly=False, the position simulator can be swapped for real order management, and the evaluation metrics define what "good enough" looks like. The gap between simulation and reality is always non-trivial, but the architecture was designed with this transition in mind.

Key Insight

The most important thing about this system isn't any single module - it's how they compose. Raw data becomes features, features become observations, observations become actions, actions become rewards, and rewards shape the policy. Every module has a clear interface and a single responsibility. Understanding this flow is understanding the system.