Dashboard
① RL Parameters
Learning Rate (α)0.10
Q-table update step size
Discount Factor (γ)0.95
future reward weighting
Exploration ε₀1.0
initial exploration rate
Episodes200
training episodes (patient encounters)
② Bellman Equation
Q(s,a) ← Q(s,a) + α·δ
δ = r + γ·max Q(s',a') − Q(s,a)
δ = Temporal Difference Error
Policy π(s) = argmax_a Q(s,a)
State space: MAP × Lactate = 25 states
Actions: ↑ dose · hold · ↓ dose
Episode:
③ Agent Performance
Mean Survival Reward
Survival Rate
Final ε
Avg |TD Error|
Policy Stability
④ Training Curve — cumulative reward per episode (green) · ε-exploration decay (amber) · TD error (red) · rolling average (bright green)