Dashboard
① RL Parameters
Learning Rate (α)
0.10
Q-table update step size
Discount Factor (γ)
0.95
future reward weighting
Exploration ε₀
1.0
initial exploration rate
Episodes
200
training episodes (patient encounters)
② Bellman Equation
Q(s,a) ← Q(s,a) + α·δ
δ = r + γ·max Q(s',a') − Q(s,a)
δ = Temporal Difference Error
Policy π(s) = argmax_a Q(s,a)
State space:
MAP × Lactate = 25 states
Actions:
↑ dose · hold · ↓ dose
Episode:
—
▶ Train Agent
Reset
③ Agent Performance
Mean Survival Reward
—
—
—
Survival Rate
—
Final ε
—
Avg |TD Error|
—
Policy Stability
④ Training Curve — cumulative reward per episode (green) · ε-exploration decay (amber) · TD error (red) · rolling average (bright green)