Reinforcement Learning Lab

Reinforcement Learning Sepsis Treatment Lab

An RL agent learns the optimal vasopressor dosing protocol for septic patients by receiving a reward (+1 for survival, −1 for deterioration) at each timestep.

Press ▶ Train Agent to watch cumulative rewards increase as the agent learns.

The MDP Framework

RL models clinical decision-making as a Markov Decision Process:

State (s): MAP, lactate, vasopressor dose Action (a): Increase / Hold / Decrease dose Reward (r): +1 survival · −1 deterioration 0 neutral step Q(s,a) = Expected cumulative reward when taking action a in state s Bellman: Q(s,a) = r + γ·max Q(s',a')

Exploration vs Exploitation

Early in training, the agent must explore unknown actions randomly. Over time it exploits known good strategies:

ε-greedy policy: With prob ε → random action (explore) With prob 1-ε → argmax Q(s,a) (exploit) ε decays from 1.0 → 0.05 during training Learning rate α controls Q-table update speed Discount γ controls future reward weighting

⚖️ Ethical Reality

RL agents trained on historical ICU data learn from past physician decisions — including their biases. An agent trained predominantly on male sepsis patients may recommend sub-therapeutic vasopressor doses for female patients, who present differently. Reward functions must explicitly penalise demographic disparities.

① RL Parameters

Learning Rate (α)0.10

Q-table update step size

Discount Factor (γ)0.95

future reward weighting

Exploration ε₀1.0

initial exploration rate

Episodes200

training episodes (patient encounters)

② Bellman Equation

                Q(s,a) ← Q(s,a) + α·δ

                δ = r + γ·max Q(s',a') − Q(s,a)

                δ = Temporal Difference Error

                Policy π(s) = argmax_a Q(s,a)

State space: MAP × Lactate = 25 states
Actions: ↑ dose · hold · ↓ dose
Episode: —

③ Agent Performance

Mean Survival Reward

—

Survival Rate

—

Final ε

—

Avg |TD Error|

—

Policy Stability

④ Training Curve — cumulative reward per episode (green) · ε-exploration decay (amber) · TD error (red) · rolling average (bright green)