A technical deep-dive for backend developers. No ML background required. We'll walk through every piece of this algorithmic trading system - from raw market data to a reinforcement learning agent that learns to trade futures.
Imagine a video game where the player is an artificial intelligence, the game world is a financial market, and the score is measured in profit and loss. That is Gotham RL in one sentence: an offline trading simulator where an AI agent learns, through millions of simulated trades, how to enter and exit positions on futures contracts.
The word "offline" is key. We are not plugging this into a live brokerage and letting it loose on real money. Instead, we replay historical market data, let the agent make decisions, simulate fills with realistic slippage, and measure whether the agent's strategy improves over time. Think of it like a flight simulator for a trading bot - all the physics of real markets, none of the financial risk during training.
Every part of the system maps to one of three responsibilities:
Ingest, validate, and store historical market data. Raw 1-minute CSVs become clean 5-minute OHLCV bars in TimescaleDB.
Extract features from raw data - trend direction, price gaps, session context - and ask Claude for a qualitative market assessment.
A reinforcement learning agent observes 36 numerical features and picks one of 9 possible actions: enter long, enter short, or skip - with configurable position size. Stop-loss and take-profit are derived from market structure: IFVG zones for stops, liquidity pools for targets.
A futures contract is a legally binding agreement to buy or sell an asset at a specific price on a specific future date. You don't actually own the asset - you're trading the contract. Think of it like a ticket that says "I will buy 500 units of the Nikkei 225 index at price X on date Y." Futures trade nearly 24 hours, 5 days a week. They expire quarterly (March, June, September, December), so you periodically "roll" to the next contract.
We trade two specific futures:
The entire system is designed to answer one question per 5-minute bar: "Should I enter a trade right now, and if so, with what parameters?" Every module exists to give the agent better information for that decision.
The common/ package is imported by everything else. It provides configuration, logging, and domain constants - the bedrock that every other module stands on.
If you've worked with Spring Boot's property resolution or Rails' environment configs, the pattern here is familiar. We layer multiple configuration sources, where higher-priority sources override lower ones:
GOTHAM_, nested with __. For example, GOTHAM_DATABASE__HOST=10.0.0.5 sets database.host.config/{GOTHAM_ENV}.yaml. Set GOTHAM_ENV=prod to load config/prod.yaml.config/default.yaml, always loaded.All of this is implemented with Pydantic Settings, which gives us runtime type validation for free. If someone sets GOTHAM_DATABASE__PORT=banana, the app fails immediately at startup with a clear validation error, not deep in a database connection two hours later.
@lru_cache(maxsize=1)
def get_settings(**overrides: Any) -> GothamSettings:
"""Singleton access to settings. Cached after first call."""
return GothamSettings(**overrides)
The @lru_cache means the first call creates the settings object, and every subsequent call returns the same instance. Tests clear this cache in conftest.py so each test gets a fresh config.
Two frozen dataclasses define the financial instruments and their trading sessions:
src/gotham/common/constants.py@dataclass(frozen=True)
class InstrumentSpec:
symbol: str # "NIY" or "NQ"
exchange: str # "CME"
currency: str # "JPY" or "USD"
tick_size: float # 5.0 or 0.25
point_value: float # 500.0 or 20.0
session: SessionName
NIKKEI = InstrumentSpec("NIY", "CME", "JPY", 5.0, 500.0, SessionName.TOKYO)
NASDAQ = InstrumentSpec("NQ", "CME", "USD", 0.25, 20.0, SessionName.US)
Tick size is the minimum price movement. For the Nikkei, price moves in steps of 5 (e.g., 38,000 → 38,005 → 38,010). For Nasdaq, it moves in steps of 0.25 (e.g., 18,500.00 → 18,500.25). Point value is how much money each point of price movement is worth per contract. A 5-point move in Nikkei = 5 × 500 = 2,500 JPY per contract.
Everything in the system uses UTC internally. The session windows define when each market is "active":
| Session | UTC Window | Prime Window | Duration |
|---|---|---|---|
| Tokyo (NIY) | 00:00 – 06:00 | 00:30 – 01:30 | 60 min |
| US (NQ) | 13:30 – 20:00 | 15:00 – 16:00 | 60 min |
The prime trading window is where the agent actually makes decisions. It's a narrow 60-minute slot within each session that we've identified as having the best liquidity and trend behavior. Outside this window, the agent is forced to skip. This is an intentional constraint - most professional day traders focus on a specific window rather than trading all day.
The logging setup uses structlog with two rendering modes: JSON for production (machine-parseable, feeds into log aggregation) and colored console output for development. Both share the same pipeline of processors - timestamp injection, log level annotation, exception formatting - only the final renderer differs. Log files rotate at 50 MB with 5 backups via RotatingFileHandler.
The Timeframe enum defines four granularities: M5 (5 minutes), M15 (15 minutes), H4 (4 hours), and D1 (1 day). Raw data enters as 5-minute bars. We aggregate up to higher timeframes for trend analysis - the idea being that a trend visible on a daily chart is more significant than one visible only on a 5-minute chart. The constant TIMEFRAME_MINUTES maps each to its duration: 5, 15, 240, 1440.
The data/ package handles everything from raw CSV files to validated, stored candles in TimescaleDB. This is a classic ETL pipeline with financial-data-specific validation.
A candle (or "bar") summarizes price action over a fixed time period. Each candle has five values: Open (first price), High (highest price), Low (lowest price), Close (last price), and Volume (number of contracts traded). A 5-minute candle at 10:00 tells you: "Between 10:00 and 10:05, the price opened at X, went as high as Y, as low as Z, and closed at W, with V contracts changing hands."
Different data sources implement the same Protocol:
@runtime_checkable
class CandleSource(Protocol):
def fetch(self, instrument: str, start: date, end: date) -> pl.DataFrame:
"""Return 5m OHLCV DataFrame with columns:
timestamp, instrument, open, high, low, close, volume, contract_month"""
...
We currently have two implementations: HistDataSource (reads 1-minute CSVs from the HistData website and aggregates to 5-minute bars) and CSVSource (reads pre-formatted 5-minute CSVs). The Protocol pattern means adding a new source - say, a direct Interactive Brokers feed - requires zero changes to the rest of the pipeline.
The HistDataSource reads raw 1-minute data and aggregates it using Polars' group_by_dynamic:
# 1m → 5m: group_by_dynamic("timestamp", every="5m", label="left")
# Aggregation rules:
# open = first()
# high = max()
# low = min()
# close = last()
# volume = sum() (or 0 if no data)
Why 5-minute bars and not 1-minute or 15-minute? It's a balance. One-minute bars are noisy - too much random fluctuation for the agent to learn from. Fifteen-minute bars are too coarse - you miss important price movements. Five minutes is the sweet spot used by many institutional quant systems.
The CandleValidator catches two categories of problems:
The low must be ≤ everything (open, high, close). The high must be ≥ everything. No nulls, no NaN, no negative prices. Violations are severity ERROR: the row is rejected.
If the close-to-close change exceeds 5%, it's flagged as a WARNING. This catches data errors like a misplaced decimal point (18500 vs 1850). The candle isn't rejected but is flagged for review.
There's also a zero-volume check during active sessions (a candle with no trades during market hours is suspicious) and a gap detector that identifies missing candles by comparing actual timestamps against expected 5-minute intervals.
This is the orchestrator that ties it all together. It processes data in 30-day batches with per-batch commits for crash resilience:
MAX(timestamp) from the database to find where we left off (resume point).CandleSource, validate, upsert into the database, commit.TimescaleDB is PostgreSQL with time-series superpowers. Our candle data is inherently time-series - ordered by timestamp, frequently queried by time range, and write-heavy during backfills. Hypertables automatically partition data by time, making range queries fast without manual shard management. Continuous aggregates precompute our 15m/4h/1d rollups from 5m data, so we never aggregate at query time.
If you've worked with PostgreSQL before, you'll feel right at home. The connection is configured via Pydantic settings (DatabaseConfig: host, port 5432, name/user "gotham") and exposed as both synchronous (.url → postgresql://) and async (.async_url → postgresql+asyncpg://) connection strings. The Docker Compose setup runs TimescaleDB on PostgreSQL 18 with health checks.
After backfilling, the GapDetector verifies data completeness. It generates expected 5-minute timestamps for each instrument's session window, skips weekends, and compares against actual timestamps in the database. Missing candles are reported as GapInfo objects with the expected time and gap type ("missing_candle" or "unexpected_gap"). Sessions that cross midnight (like Tokyo) are handled by adding a day to the end time when end ≤ start. This is critical for data quality - a missing candle during the agent's trading window could cause incorrect feature calculations.
Futures contracts expire quarterly. When the March contract approaches expiration, traders "roll" to the June contract. The problem: the June contract typically trades at a slightly different price than the expiring March one. If you naively stitch them together, you get artificial price jumps at every roll date. We track contract_month in each candle for this reason, though back-adjustment (ratio or difference method) is handled downstream during feature computation.
The features/ package transforms raw price data into meaningful signals. Think of features as the agent's senses - the eyes, ears, and intuition that let it perceive what's happening in the market.
This is the core trading concept in the system. Let's build it from first principles.
A Fair Value Gap (FVG) is a three-candle pattern where price moves so aggressively that it leaves a "gap" - a price range where very little trading occurred. Imagine three candles in a row where the middle candle is exceptionally long. If the high of candle 1 doesn't overlap with the low of candle 3, there's a gap between them.
An Inverse Fair Value Gap is a regular FVG that gets "inverted" - price comes back and crosses through the gap zone from the other side. When a bullish FVG is inverted (price closes below it), that zone often becomes a resistance level. When a bearish FVG is inverted (price closes above it), it often becomes support. This is the trading edge: IFVGs mark zones where price is likely to react.
The code in features/ifvg.py detects FVGs by scanning 3-candle windows. The minimum gap size is 4 ticks (configurable). Each IFVG goes through a lifecycle:
Just inverted. The zone is fresh and hasn't been retested. Fill percentage is low.
Price has touched the zone boundary at least once. The more tests without breaking, the stronger the zone.
Price has penetrated more than 50% into the zone. The zone is losing its power.
Either fully filled (100% penetration) or older than 100 bars. No longer relevant.
Each IFVG also gets a quality score based on the displacement candle (the big candle that created the gap). The scoring logic is:
# IFVG Quality Scoring (features/ifvg.py)
if gap_ticks >= 8 and body_ratio > 0.70:
quality = "high" # score = 3
elif gap_ticks >= 6 or body_ratio > 0.60:
quality = "medium" # score = 2
else:
quality = "low" # score = 1
The intuition: a big, decisive candle with a large gap signals strong institutional activity. A high-quality IFVG with a gap of 10+ ticks and a body that fills 80% of the candle range is much more likely to act as a reliable support/resistance zone than a small, indecisive one.
Fill percentage tracks how much price has penetrated the IFVG zone. For a bullish IFVG: fill = clip(penetration / gap_size, 0, 1) where penetration = high - zone_lower. When fill hits 100%, the zone is dead. Test count increments each time price touches the zone boundary without fully filling it.
The IFVG module outputs 5 features per bar: ifvg_count_active (number of live IFVG zones), ifvg_nearest_dist (distance in ticks to the nearest zone, 0 if inside one), ifvg_best_quality (0 = none, 1–3 scale), ifvg_avg_fill_pct (how "used up" the active zones are), and ifvg_direction_bias (bullish count minus bearish count, normalized to [-1, 1]).
An EMA (Exponential Moving Average) is a weighted average of recent prices where newer prices count more. Unlike a simple average, it "forgets" old data gradually. A 20-period EMA on 5-minute bars mostly reflects the last ~100 minutes of price action. The slope of an EMA tells you if the trend is rising, flat, or falling. We compute: slope = (ema[i] - ema[i-5]) / ema[i-5]. If this exceeds 0.1%, it's "rising"; below -0.1%, "falling"; otherwise "flat".
We compute trend features at three timeframes - M15, H4, and D1 - because trends have a hierarchy. A stock can be trending up on the daily chart while pulling back on the 15-minute chart. The compute_trend_features function computes EMA-20, EMA-50, their slopes, ATR-14 (a volatility measure), and a structure classification per timeframe.
Structure classification uses swing points (local highs and lows detected with a 5-bar lookback). If the last 4 swing highs are rising and the last 4 swing lows are rising, it's an UPTREND. Both falling: DOWNTREND. Otherwise: RANGING.
The final composite trend score blends all three timeframes:
trend_score = clip(
D1_score × 0.40 + # Daily trend matters most
H4_score × 0.35 + # 4-hour trend is next
M15_score × 0.25, # 15-min trend is tactical
-1.0, 1.0
)
# Where: uptrend = +1.0, ranging = 0.0, downtrend = -1.0
Markets behave differently at different times. The first hour of a session is typically volatile (the "opening drive"), while the middle is often choppy. Session features capture:
minutes_since_open: how far into the session we arein_trading_window: whether we're in the prime 60-minute slotwindow_progress_pct: position within the window (0.0 = start, 1.0 = end)overnight_range: high minus low during the 2 hours before session open. A wide overnight range often signals a directional day.prior_session_high/low/close: yesterday's key levels, which act as magnets and barriers for today's price.Entering a trade is pointless if the price immediately runs into a wall of resistance. The Room-to-Right (RTR) module estimates how much space the price has to move before hitting obstacles.
It works by detecting liquidity pools: clusters of swing highs or swing lows that are close together in price. The algorithm greedily clusters swing points within 0.5% of each other. When 3 or more swing points cluster together, that's a liquidity pool. Pools are classified as "resistance" (mostly swing highs), "support" (mostly swing lows), or "mixed". Pools act as magnets: the more "touches" at a level, the more likely price will react there.
The RTR score starts at 100 and is decremented for each pool: penalty = 10.0 × touches / max(distance_factor, 0.1) where distance_factor = distance_to_pool / ATR. Nearby pools with many touches cause the biggest reductions. The rtr_score_long considers pools above the current price (obstacles to upside moves), while rtr_score_short considers pools below (obstacles to downside moves).
There's also an exhaustion measure: exhaustion_pct = clip(intraday_range / ADR_20, 0, 1) where ADR_20 is the average daily range over the past 20 trading days. If today's range already exceeds 70% of the 20-day average, the market may be running out of energy. When exhaustion exceeds 70%, an additional penalty is applied to both RTR scores: penalty = (exhaustion_pct - 0.70) × 100. The module also tracks obstacle_count_long and obstacle_count_short: the raw count of pools in each direction.
Not every 5-minute bar is worth evaluating for a trade. The pre-screen is a simple AND filter that all four conditions must pass:
pre_screen_passed = (
ifvg_count_active >= 2 # At least 2 active IFVGs
AND abs(trend_score) >= 0.3 # Clear enough directional bias
AND max(rtr_long, rtr_short) >= 30.0 # Enough room to move
AND in_trading_window == True # Within our 60-min window
)
The pre-screen dramatically reduces the search space. Instead of asking Claude to assess thousands of candles per session, we only assess the ones where the numerical features already suggest a potential opportunity. This saves both money (API calls) and training time.
Within the trend module, there's a displacement detector that identifies aggressive candles - ones where institutions are likely driving price. A candle qualifies as a "displacement" if both conditions hold: the body fills >70% of the total range (it's decisive, not indecisive) and the range exceeds 1.5× ATR (it's large relative to recent volatility). Displacement candles are the building blocks of FVGs and trend structure changes.
The run_feature_pipeline function orchestrates all six modules in sequence:
build_multi_timeframe(candles_5m): aggregates 5m bars to M15, H4, and D1 candles.compute_trend_features(candles_5m, htf_frames): computes EMAs, slopes, structure, ATR, displacement per timeframe, then joins to 5m via join_asof with backward strategy (forward-fills the most recent higher-timeframe value).compute_ifvg_features(result, tick_size): detects FVGs, tracks inversions, scores quality, outputs 5 features per bar. Also computes structural stop levels from IFVG zone boundaries for each direction.compute_session_features(result, instrument): adds session-aware columns: minutes since open, window position, overnight range, prior session levels.compute_rtr_features(result): detects liquidity pools, computes room-to-right scores and exhaustion metrics. Also computes structural target levels from nearest liquidity pools for each direction.compute_structural_rr(result): computes risk-reward ratios from IFVG-derived stop levels and liquidity-pool-derived target levels. Falls back to ATR-based distances when structural levels are unavailable. Outputs structural_rr_long and structural_rr_short per bar.apply_pre_screen(result): applies the AND filter, adds the pre_screen_passed boolean column.The output is a Polars DataFrame with the original 5m OHLCV data plus all computed feature columns - roughly 45 additional columns (including structural stop/target levels and R:R ratios). This enriched DataFrame is saved as a Parquet file for downstream consumption by the LLM assessment generator and the RL training pipeline.
The llm/ package uses Anthropic's Claude to add qualitative intelligence that's hard to capture with math alone. Think of it as a junior analyst who reviews each setup and gives a structured opinion.
Some market patterns are easy to quantify (a 50-bar moving average crossing above a 200-bar one). Others are harder: "this looks like a failed breakdown that's about to squeeze higher" or "this choppy price action suggests institutional accumulation." Experienced traders see these patterns intuitively. An LLM can encode some of that pattern recognition into features that the RL agent can learn from.
The LLM is not making trading decisions. It's providing 6 additional features to the observation vector. The RL agent decides what to do with them.
For each pre-screened bar, we build a compact prompt with these sections:
ts,O,H,L,C,V, timestamps as HH:MM. This is the raw price action.We don't want free-text responses. We use Claude's tool use API with tool_choice={"type": "tool", "name": "submit_assessment"} to force the output into a predefined JSON schema. The LLMAssessment schema has 12 structured fields:
| Field | Type | Range | What It Captures |
|---|---|---|---|
setup_type | Enum | 5 values | bullish/bearish reversal, continuation, or no_setup |
confidence | float | [0, 1] | How confident the LLM is in its assessment |
ifvg_quality | Literal | high/med/low | LLM's independent quality assessment of the IFVG |
trend_alignment | dict | 5 values/tf | Bullish/bearish/neutral alignment per timeframe |
room_to_right_estimate | float | [0, 100] | LLM's estimate of how much room price has |
risk_reward_estimate | float | ≥ 0 | Expected reward per unit of risk |
narrative | str | free text | Brief market narrative |
concerns | list[str] | 0–5 items | Risk factors the LLM identifies |
regime | Enum | 4 values | trending_day, choppy, event_driven, low_liquidity |
narrative_sentiment | float | [-1, 1] | Sentiment polarity of the narrative |
nearest_target | float | price | Most likely profit target level |
nearest_invalidation | float | price | Price where the thesis breaks down |
Neural networks eat numbers, not strings. The encode_assessment function converts the LLM's structured output into 6 floats, all scaled to [0, 1] or [-1, 1]:
llm_confidence = confidence # [0, 1]
llm_setup_type = setup_type_index / 4.0 # [0, 1] (5 types → 0..4)
llm_rr_estimate = min(rr_estimate / 5.0, 1.0) # [0, 1]
llm_regime = regime_index / 3.0 # [0, 1] (4 regimes → 0..3)
llm_concern_count = min(len(concerns) / 5.0, 1.0) # [0, 1]
llm_narrative_sentiment = narrative_sentiment # [-1, 1]
Calling Claude during RL training would be impossibly slow and expensive. Instead, we pre-generate all assessments once and cache them as Parquet files. The generate_assessments function processes all pre-screened bars in batches of 50, saving checkpoints along the way. It supports resume - if the process crashes, it skips already-cached timestamps. Failed API calls get a null encoding (all zeros).
Two automated checks catch potential problems: confidence saturation (more than 30% of assessments at confidence = 1.0, suggesting the LLM is overconfident) and regime concentration (more than 80% of assessments sharing the same regime, suggesting lack of discrimination).
The client tracks token usage per request and computes cost using keyword-based model identification. The pricing table:
| Model | Input ($/M tokens) | Output ($/M tokens) |
|---|---|---|
| Claude Haiku | $1.00 | $5.00 |
| Claude Sonnet | $3.00 | $15.00 |
| Claude Opus | $15.00 | $75.00 |
For bulk assessment generation, we use Claude Haiku to keep costs manageable. The generate_assessments function logs progress with running cost and ETA, so you can see "450/1200 assessments done (37%), $2.14 spent, ETA 18 min" in the console. The client uses exponential backoff for retries (sleep(2^attempt)) but doesn't retry on client errors like 400 or 401.
The system prompt loaded from prompt_v1.txt (or prompt_v2.txt) defines the persona and evaluation criteria for the LLM. Version 2 added narrative_sentiment as a new field. Each cached assessment records its prompt_version, so you can retrain with v2 assessments without invalidating v1 data. The resume logic in generate_assessments filters by version when deciding what to skip.
This chapter explains reinforcement learning concepts from scratch. No math prerequisites - just analogies and intuition.
Reinforcement learning is like training a dog. You don't show the dog a million examples of "good behavior" and "bad behavior" (that would be supervised learning). Instead, you let the dog try things, and you give it a treat when it does something good and a stern "no" when it does something bad. Over many repetitions, the dog figures out which behaviors lead to treats.
An episode is one complete run of the game. In our case, one episode is one trading session - say, the Tokyo session on January 15th. The agent sees each 5-minute bar in sequence, decides whether to trade at each bar, and the episode ends when the session closes. Then we reset and start a new episode (maybe the US session on January 15th, or Tokyo on January 16th).
Episodes always end by truncation (the session ran out of time), never by termination (the agent didn't "die"). If the agent has an open position at session end, it's force-closed at the current price. This is realistic - a day trader never holds overnight. The reset logic selects episodes via a seeded random number generator (np.random.default_rng(42)) for reproducibility.
The agent's policy is its strategy - a function that maps observations to actions. In our case, it's a small neural network with 2 hidden layers of 64 neurons each (written in config as policy_net_arch: [64, 64]). The input is 36 numbers (the observation), and the output is a probability distribution over possible actions. At the start of training, this is essentially random. By the end, it should have learned patterns like "when the trend is strongly bullish and there's a fresh IFVG with favorable structural R:R, enter long."
Why such a small network? Trading decisions don't require the deep pattern recognition that image classification does. A [64, 64] MLP has roughly 5,000 trainable parameters - enough to learn the relationships between 36 input features and 2 output sub-decisions (entry and size), but small enough to train quickly and avoid overfitting. If the network were too large, it might memorize specific market patterns from the training period rather than learning generalizable rules.
A neural network is just a chain of matrix multiplications with nonlinear functions in between. Our [64, 64] MLP (Multi-Layer Perceptron) works like this: take 36 input numbers, multiply by a 36×64 matrix, apply a nonlinear function (ReLU), multiply by a 64×64 matrix, apply ReLU again, then multiply by a 64×output matrix. The "learning" part is adjusting those matrix values so the outputs become useful. It's essentially a very fancy lookup table that can interpolate.
PPO (Proximal Policy Optimization) is the specific algorithm we use to update the policy. The key idea: don't change your strategy too much in one update. If a few lucky trades make "go all-in on every trade" look great, PPO prevents the agent from swinging wildly to that extreme. It uses a "clip range" (set to 0.2) that limits how much the probability of any action can change in a single update step.
We specifically use MaskablePPO from the sb3-contrib library. The "maskable" part is crucial - it means we can tell the agent "these actions are not allowed right now" and it respects those constraints during both action selection and learning.
| Parameter | Value | Intuition |
|---|---|---|
gamma | 0.99 | Discount factor. Future rewards are worth 99% of present rewards. The agent thinks long-term. |
ent_coef | 0.01 | Entropy coefficient. A small bonus for trying random things. Prevents premature convergence to a boring strategy. |
learning_rate | 3e-4 | Step size for updates. Small enough for stability, large enough to learn in reasonable time. |
clip_range | 0.2 | Maximum allowed change per update. Keeps learning stable. |
n_steps | 2,048 | Play 2,048 steps, then study what happened and update the policy. |
batch_size | 256 | Review 256 steps at a time during each update pass. |
The agent can't do whatever it wants. Action masking enforces rules like "you can't enter a new trade when you're already in one", "you can't trade if you've hit the daily loss limit", and "you can't enter a direction where the structural risk-reward is unfavorable." This is implemented as a binary mask - an array of 6 booleans (one per sub-action option across the 2 dimensions) where False means "this option is blocked."
Action masking is what makes RL practical for trading. Without it, the agent would waste millions of training steps trying to learn rules we already know (like "don't enter a second trade while the first is still open"). By encoding constraints as masks, the agent focuses its learning capacity on the hard question: when and how to trade.
The rl/ package implements the Gymnasium environment, observation builder, action space, position simulator, and reward function. This is where all the pieces come together.
Every 5-minute bar, the agent sees exactly 36 numbers, all scaled to the range [-1, 1]. Why? Neural networks learn best when inputs are small and centered near zero. Here they are, grouped by source:
llm_confidencellm_setup_typellm_rr_estimatellm_regimellm_concern_countllm_narrative_sentimentifvg_count_active / 10ifvg_nearest_dist / ATRifvg_best_quality / 3ifvg_avg_fill_pctifvg_direction_biastrend_scoreema20_slope_15mema20_slope_4hema20_slope_1dstructure_15mstructure_4hstructure_1dminutes_since_open / 480window_progress_pctovernight_range / ATRprior_session_range / ATRin_trading_windowrtr_score_long / 100rtr_score_short / 100exhaustion_pctexhaustion_flagstructural_rr_long / 5structural_rr_short / 5vol_ratio / 5bar_range_norm / ATRbar_body_ratioin_position (0/1)unrealized_pnl_r / 5daily_pnl_r / 5trades_today / 10String-valued features get mapped to numbers: slopes become {rising: 1.0, flat: 0.0, falling: -1.0}; structures become {uptrend: 1.0, ranging: 0.0, downtrend: -1.0}. Distances are divided by ATR (Average True Range) to make them volatility-adaptive - "20 ticks away" means something very different on a calm day versus a volatile one.
The action space is MultiDiscrete([3, 3]): two independent sub-decisions, each with 3 options. That's 3 × 3 = 9 possible combinations. Stop-loss and take-profit are not agent decisions - they're derived from market structure (explained below).
Rather than letting the agent choose stop and target distances (which adds combinatorial complexity and leads to over-fitting), the system derives them from market structure:
For a long entry, the stop is placed just below the nearest active IFVG zone boundary below the current price. For a short entry, just above the nearest zone above. These are natural invalidation levels - if price breaks through an IFVG zone, the thesis is likely wrong.
Fallback: when no structural level is available, the stop defaults to 1.5 × ATR from the entry price.
For a long entry, the target is the nearest liquidity pool (cluster of swing highs) above the current price. For a short entry, the nearest pool below. Price tends to gravitate toward these clusters, making them natural profit-taking zones.
Fallback: when no liquidity pool is found, the target defaults to 2.0R (twice the risk distance).
R-multiples measure reward relative to risk. If your stop-loss is 10 ticks away (your risk = 10 ticks), a 2R target is 20 ticks of profit, 3R is 30 ticks. This makes results comparable across instruments with different price scales. Structural R:R is the ratio of target distance to stop distance derived from these market-structure levels. A structural R:R of 2.5 means the nearest liquidity pool is 2.5 times farther away than the nearest IFVG zone boundary. When the trade's MFE (maximum favorable excursion) reaches 1R, the stop automatically moves to breakeven (entry price), locking in a risk-free trade.
Certain actions are blocked based on the current state. The mask is a flat array of 6 booleans (3 for entry + 3 for size). When any of these conditions are true, long and short entry are masked (only skip is allowed):
structural_rr_long < 1.5; short is blocked if structural_rr_short < 1.5. This prevents the agent from entering trades where the market structure doesn't offer a favorable risk-reward ratio.When the agent enters a trade, we simulate it with 1-tick slippage (you always get a slightly worse fill than the ideal price - this is realistic). Each subsequent bar, we check: did the price hit the stop-loss? (Checked first - conservative assumption.) Did it hit the take-profit target? We track MAE (Maximum Adverse Excursion - worst drawdown during the trade) and MFE (Maximum Favorable Excursion - best unrealized profit during the trade). Commissions are realistic: $1.25 per contract for NQ, 80 JPY per contract for NIY.
The trailing stop has a special mechanism: when the trade's MFE reaches 1R (i.e., the trade has moved one risk-unit in your favor), the stop automatically moves to breakeven (entry price). This locks in a risk-free trade. The stop_moved_to_breakeven flag ensures this happens only once per position.
A completed trade produces a CompletedTrade dataclass with detailed statistics: entry/exit prices, P&L in ticks, risk in ticks, the realized R-multiple (P&L / risk, net of commission), whether it hit the target or stop, bars held, MAE, MFE, commission in the instrument's currency, and the exit reason ("stop", "target", or "session_end"). The realized_rr formula accounts for commissions by converting them to ticks: commission_ticks = commission / (size × tick_size × point_value), then realized_rr = (pnl_ticks - commission_ticks) / risk_ticks.
The reward has three components:
The trade reward is the realized R-multiple of the closed trade, with a 0.3 bonus for hitting the target (encouraging the agent to let winners run rather than cutting them short). The step penalty accelerates when the daily P&L drops below -2R, punishing the agent for digging deeper into a losing day. The patience bonus is tiny (0.001) but crucial - it gives the agent a small positive reward for not trading when there's no position. Without this, the agent would see "skip" as a zero-reward action and prefer to always enter trades.
Reward shaping is the art of RL. The patience bonus is a great example: without it, agents often over-trade (entering 5 trades per session, most losers). With it, they learn to wait for quality setups. A 0.001 bonus per skip step adds up to ~0.06 per hour of patience - enough to shift behavior without overwhelming the trade reward signal.
Training, evaluation, baselines, and the criteria for deciding whether a model is good enough.
Training follows the standard PPO loop: collect experience, compute advantages, update the policy network. In concrete terms:
1 million timesteps sounds like a lot, but each "timestep" is just one 5-minute bar. With ~12 bars per trading session and ~250 trading days per year, that's roughly 3,000 simulated sessions - or about 12 years of daily trading compressed into a few hours of compute time.
The training hyperparameters are tuned for this specific problem:
# TrainingConfig (common/config.py)
total_timesteps: 1_000_000 # Total experience to collect
learning_rate: 3e-4 # Adam optimizer step size
n_steps: 2_048 # Steps per rollout buffer
batch_size: 256 # Mini-batch size for updates
gamma: 0.99 # Discount factor (long-term focus)
clip_range: 0.2 # PPO clipping parameter
ent_coef: 0.01 # Entropy bonus for exploration
checkpoint_freq: 50_000 # Save every 50K steps
eval_freq: 50_000 # Evaluate every 50K steps
max_daily_loss_r: 3.0 # Daily loss limit (in R-multiples)
max_trades_per_session: 5 # Max trades before forced skip
policy_net_arch: [64, 64] # Hidden layers
During training, two callbacks run automatically:
models/checkpoints/. If training crashes, you can resume from the latest checkpoint.deterministic=True and logs metrics. If the Sharpe ratio beats the previous best, saves the model to models/best/best_model.zip.We save the best model by Sharpe ratio, not by total reward. A model with high total reward might just be entering many trades in one lucky period. The Sharpe ratio measures return per unit of risk, which is what we actually care about. A Sharpe of 2.0 means the model's returns are 2 standard deviations above zero - consistently profitable, not just occasionally lucky.
A model must pass all three criteria to be "promoted":
Sharpe ratio = (mean return / std of returns) × √252. The √252 annualizes it (252 trading days/year). A Sharpe ≥ 1.0 means the strategy earns its average return each year with a drawdown roughly equal to that return. Max drawdown is the worst peak-to-trough decline - if your cumulative P&L goes from +8R to +3R, that's a 5R drawdown. Win rate of 40% means 4 out of 10 trades are profitable - which is fine if your winners are larger than your losers (which the R-multiple system ensures).
Additional metrics tracked but not used as promotion gates: avg_rr (average R-multiple per trade), profit_factor (gross profit / gross loss - infinity if no losses), total_r (sum of all R-multiples across all evaluated episodes), and trades_per_session (helps spot over-trading).
The evaluation function runs model.predict(obs, deterministic=True, action_masks=masks) for up to 500 steps per episode and collects all CompletedTrade objects. Setting deterministic=True means the agent picks the most probable action (no exploration noise), which gives a cleaner picture of what the policy has actually learned versus what it stumbles into randomly.
How do you know if your trained agent is any good? Compare it against baselines that require no learning:
Picks random valid actions (respecting masks). The absolute floor - if your trained agent can't beat random, something is fundamentally wrong.
Always goes long with 1 contract. Stop and target are set by structural levels (or ATR fallback). Tests whether "just being in the market" is profitable (it usually isn't).
Uses simple rules: go long if trend > 0.3 and IFVG bias is bullish; go short if trend < -0.3 and bias bearish; skip otherwise. The "can an if-statement do this?" baseline.
To measure whether the LLM features actually help, we train a separate model with ablate_llm=True. This zeros out the first 6 observation features (the LLM-derived ones) by setting obs[:LLM_FEATURE_COUNT] = 0.0 at each step. The agent sees only the 30 numerical features from IFVG, trend, session, RTR, structural R:R, microstructure, and portfolio modules. If the full model significantly outperforms the ablated one on the same validation set, the LLM assessment adds measurable value. This is a controlled experiment - everything else stays identical, so any performance difference is attributable to the LLM features.
We split data chronologically, never randomly:
| Split | Period | Purpose |
|---|---|---|
| Train | ≤ June 2024 | The agent learns from this data |
| Validation | Jul – Dec 2024 | Used during training to select the best model |
| Test | > Dec 2024 | Final evaluation, never seen during training |
In financial data, random shuffling causes data leakage. If a Friday candle is in the test set but the surrounding Thursday and Monday candles are in training, the model has implicitly "seen" the test data. Chronological splits ensure the model is always predicting the future from the past - exactly what it would do in production.
Let's trace one data point's lifecycle from raw CSV to policy update. This is where all eight chapters converge.
A semicolon-delimited CSV from HistData: 20240115 003000;38250;38265;38245;38260;127. One row per minute, columns: timestamp, open, high, low, close, volume.
Polars' group_by_dynamic("timestamp", every="5m") combines 5 one-minute rows into one candle: first open, max high, min low, last close, summed volume.
OHLC relationship checks pass (low ≤ open ≤ high, etc.). No spike detected (<5% change from previous bar). Volume is positive. The bar survives validation.
data/quality.pyUpserted into the candles_5m hypertable with the instrument symbol and contract month. The BackfillService commits this batch and moves to the next 30-day window.
The 5m bars are rolled up to 15-minute, 4-hour, and daily candles using the same OHLCV aggregation logic. These are needed for multi-timeframe trend analysis.
features/pipeline.pyThe feature pipeline runs in sequence: trend features (EMAs, slopes, structure), IFVG detection and scoring (including structural stop levels from zone boundaries), session features (overnight range, window position), room-to-right (liquidity pools, exhaustion, and structural target levels), and structural R:R ratios (risk-reward from IFVG stops to liquidity pool targets). Higher-timeframe features are joined to 5m bars via join_asof.
This bar has 3 active IFVGs, a trend score of +0.62, an RTR long score of 45, and it's within the trading window. All four conditions pass: pre_screen_passed = True.
Claude receives the last 50 candles as compact CSV plus all feature summaries. It returns a structured assessment: setup_type=bullish_continuation, confidence=0.72, regime=trending_day, narrative_sentiment=0.4, one concern about overhead resistance.
The structured output becomes 6 floats: [0.72, 0.75, 0.44, 0.0, 0.20, 0.40]. These are cached to Parquet alongside the timestamp and prompt version for reproducibility.
The 6 LLM + 5 IFVG + 7 trend + 5 session + 4 RTR + 2 structural R:R + 3 microstructure + 4 portfolio = 36 numbers, all clipped to [-1, 1]. This is what the agent sees.
rl/obs.pyThe policy network (36 → 64 → 64 → output) processes the observation and outputs probabilities. With action masking applied (including the structural R:R gate), it selects: [1, 0]: enter long, 1 contract. The stop-loss is placed at the nearest IFVG zone boundary below price; the take-profit target at the nearest liquidity pool above. The structural R:R for this setup is 2.1, well above the 1.5 minimum.
Entry fills at close + 1 tick slippage. Over the next 8 bars, the trade hits its 2R target. Realized RR after commission: +1.87. Trade reward: 1.87 + 0.3 (target bonus) = +2.17.
rl/position.py rl/reward.pyThis experience joins 2,047 other steps in the rollout buffer. PPO computes advantages, then updates the network weights in mini-batches of 256. The probability of "enter long when trend is bullish and IFVGs are active" increases slightly.
rl/train.pyWhen training completes, save_model writes four files to models/{timestamp}/: the model weights (model.zip), a metadata.json (instrument, training duration, git hash), eval_metrics.json (Sharpe, drawdown, win rate, profit factor), and a config_snapshot.yaml (frozen copy of the config used for training). The best model is also saved to models/best/best_model.zip.
list_models scans the models directory and returns all saved runs sorted by timestamp, making it easy to compare experiments.
Change one value in config/default.yaml and it ripples through the entire pipeline. Here are some examples of how single config changes affect the whole system:
features.pre_screen_min_trend: 0.5 (up from 0.3) → fewer bars pass pre-screen → fewer LLM calls → fewer training episodes → a more selective but potentially undertrained agent.training.ent_coef: 0.05 (up from 0.01) → the agent explores more random actions during training → slower convergence but wider strategy search.features.ifvg_min_gap_ticks: 8 (up from 4) → only large FVGs are detected → fewer active IFVGs per bar → pre-screen becomes harder to pass.training.gamma: 0.95 (down from 0.99) → the agent discounts future rewards more heavily → prefers quick trades over patient setups.features.ema_fast: 10 (down from 20) → the fast EMA reacts quicker → more slope changes detected → potentially noisier trend signals.The config snapshot saved with each model makes experiments reproducible. You can always answer "what settings produced this model?" by reading config_snapshot.yaml in the model directory.
For reference, here are the key configurable thresholds that shape the feature pipeline:
# FeatureConfig defaults (common/config.py)
ifvg_min_gap_ticks: 4.0 # Minimum gap size to detect an FVG
ifvg_max_age_bars: 100 # Bars before an IFVG expires
ema_fast: 20 # Fast EMA period
ema_slow: 50 # Slow EMA period
atr_period: 14 # ATR lookback period
displacement_body_pct: 0.70 # Min body ratio for displacement candle
displacement_atr_mult: 1.5 # Min range as multiple of ATR
rtr_lookback_days: 20 # Days for liquidity pool detection
pre_screen_min_ifvgs: 2 # Min active IFVGs to pass pre-screen
pre_screen_min_trend: 0.3 # Min |trend_score| to pass pre-screen
pre_screen_min_rtr: 30.0 # Min RTR score to pass pre-screen
# Structural SL/TP defaults (rl/env.py EnvConfig)
stop_atr_fallback_mult: 1.5 # ATR multiple for stop when no IFVG zone
target_fallback_r: 2.0 # R-multiple for target when no liquidity pool
min_structural_rr: 1.5 # Minimum R:R to allow entry (action mask gate)
The system has two modules shown with dashed borders in the architecture diagram - they exist as stubs, ready for implementation:
execution/) - connects to Interactive Brokers via their API on port 4002. Takes the agent's action and converts it into real orders with proper position sizing, bracket orders (stop + target), and session-end flatten logic.monitoring/) - Telegram bot for real-time alerts (trade entries, exits, daily P&L summaries) and health checks (is IB connected? Is data flowing? Is the model loaded?).The transition from offline to live trading is the next frontier. All the infrastructure is in place: the config system supports ib.readonly=False, the position simulator can be swapped for real order management, and the evaluation metrics define what "good enough" looks like. The gap between simulation and reality is always non-trivial, but the architecture was designed with this transition in mind.
The most important thing about this system isn't any single module - it's how they compose. Raw data becomes features, features become observations, observations become actions, actions become rewards, and rewards shape the policy. Every module has a clear interface and a single responsibility. Understanding this flow is understanding the system.