OVISS: Optimal Vasopressin Initiation in Septic Shock with Reinforcement Learning

OVISS: Optimal Vasopressin Initiation in Septic Shock

Kalimouttou et al. for the OVISS Study group, JAMA 2025. doi:10.1001/jama.2025.3046

Clinical Question

  • In patients with septic shock, does a reinforcement learning model identify a strategy for vasopressin initiation that improves mortality?

Background 

  • Vasopressin is a non-catecholamine vasoconstrictor that is currently recommended as a secondary agent to be considered alongside noradrenaline if mean arterial blood pressure remains low
  • Previous randomised controlled trials on vasopressin in sepsis have failed to show a significant mortality benefit or much improvement in the rate of renal dysfunction
    • VASST and VANISH are two important randomised controlled trials; this patient-level meta-analysis from 2019 summarises two more
  • A criticism of these trials (and all negative intensive care studies!) is that initiation and dosing is in practice usually focussed on a specific subset of particularly sick patients; a more nuanced, individualised initiation strategy may show benefits
    • This is particularly important for vasopressin, as its effects on systemic vascular resistance, heart rate, and diuresis depend on drug dose, degree of intrinsic pituitary vasopressin depletion, concurrent adrenergic stimulation, endogenous V1 receptor distribution, etc
  • Reinforcement learning (RL) is a machine learning approach in which an ‘agent’ is trained through trial-and-error to maximise a ‘reward’ (“eg. +1 point for a reduction in SOFA score) by refining a set of rules (the ‘policy’)
    • In the training stage, the agent is given clinical information up to a time point (a ‘state’), the action taken at that time point, and the ‘reward’ for that patient. It then works out what actions generally move patients to states with higher ‘rewards’
    • This initially seems similar to a retrospective observational study – the agent identifies associations between decisions made in particular clinical situations to outcomes of interest. However, the agent has significantly more flexibility to combine time-varying inputs so can better infer causal links and avoid confounding
    • Outside of medicine, RL agents can interact directly with virtual or real worlds. Within medicine, RL agents are trained and evaluated on historic datasets
  • Reinforcement learning has been studied within critical care before – most famously the AI Clinician. This 2018 paper trained a model to increase or decrease intravenous fluids or vasopressor dose and concluded that their reinforcement learning model selected treatments that on average were better than human clinicians

Design 

  • Reinforcement learning on retrospective electronic health record data, validated on separate patients from three external datasets.  
  • Each hospital stay was divided into 1hr epochs, and the model was tuned to put similar importance to long-term outcomes as short-term outcomes. 
  • The model was given baseline characteristics as well as 14 further measurements per hour. These were: time from shock onset, noradrenaline dose, MAP, lactate, SOFA score, steroid therapy (binary), urea, creatinine, RRT (binary), mechanical ventilation (binary), fluid received, urine output, and mortality.  
  • The ‘reward’ that the model was trained to maximise was a patient outcome score that  penalised death (-20 points, applied to the final 1-hour epoch), and rewarded 
    • short-term survival (+1 point per hour)
    • decreased lactate, over next six hours (+1 point)
    • increased MAP to >=65 if under this, over next four hours (+1 point)
    • decreased SOFA score over next six hours (+3 points)
    • decreased noradrenaline use over next four hours (+1 point) 
  • At each time point, the model could choose to “initiate vasopressin”, or not. There was no dose titration

Population

  • Four datasets used:
    • UCSF De-Identified Clinical Data Warehouse: 120,000 critically ill admissions to several California hospitals; 2012 – 2023, used for training, testing, and internal validation 
    • MIMIC-IV: 65,000 patients, 2008 – 2019, ED or ICU admissions to Beth Israel Deaconess Medical Center, Boston
    • eICU-CRD: 200,000 patients, 2014 – 2015, several ICUs across the US
    • UPMC dataset: 2018 – 2020
  • Inclusions: Adults with first episode of septic shock (defined by Sepsis-3 criteria), already receiving noradrenaline
  • Exclusions: No laboratory values, repeat episode of septic shock
  • 499,596 admissions screened. 
    • 482,366 excluded; mostly (475,881) for not being adults in septic shock
    • 14,453 ultimately included: 
    • 3,608 used in the derivation cohort (i.e. to train the reinforcement learning model)
    • 10,845 in the validation cohort
  • Baseline characteristics showed reasonable variations between datasets
    • Median age of 63 in derivation cohort, 63-67 in validation sets
    • Male 57% derivation; 51-59% validation
    • White 45% derivation; 44-80% validation
    • SOFA 5 derivation; 4-5 validation
    • Noradrenaline dose 0.05 derivation; 0.06 – 0.26 mcg/kg/min validation
    • Vasopressin initiated in 25 derivation;  20-48% of validation
    • Mechanical ventilation in 35% derivation; 9-75% validation
    • RRT in 7% derivation; 6-20% validation

Outcomes

  • Evaluation of the model:
    • Weighted importance sampling to estimate the mean “reward” (as detailed above) that would be obtained with the reinforcement learning policy compared with a policy trained to copy what the clinicians did
      • The overall reward was higher with the reinforcement learning rule
    • In the patients who did receive vasopressin, the authors used inverse probability of treatment weighted pooled logistic regression to evaluate if the odds of (primary outcome) dying was influenced by whether the clinician initiated vasopressin at the time point that would have been indicated by the model
  • Concordance with the model policy was associated with decreased odds of in-hospital mortality (adjusted odds ratio of 0.81; 95% CI 0.73-0.91)
  • Concordance was also associated with a decreased need for renal replacement therapy (aOR 0.47), but not for mechanical ventilation (aOR 1.0).
  • Interrogation of the model validity: 
    • The authors calculated an E-value of 1.46. This implies that if the improvement in mortality were to be explained away by an unmeasured confounder, that confounder would need to be strong enough to increase the odds of the outcome and exposure by at least 46% (each)
    • Addition of extra falsification variables to attempt to trick the model failed to lead to a false positive result
    • The authors used “SHAP values” – these are commonly used in machine learning to explain the contribution of different features to a model’s output. They determined the impact of each clinical feature at the time the RL model chose to initiate vasopressin. The most important parameters were time since shock onset, SOFA score, noradrenaline dose, and serum lactate 
  • Interrogation of the model for biological insights:
    • The authors describe the patient characteristics at the time the model policy recommended vasopressin initiation. Across the datasets, the model generally initiated vasopressin at lower noradrenaline doses (0.2mcg/kg/min rather than clinician-observed 0.37mcg/kg/min), with lower SOFA scores (7 vs 9), lower lactate levels (2.5 vs 3.6); it usually opted to initiate sooner after shock onset (4hrs vs 5hrs)
    • Using the same pooled logistic regression approach as above – the model significantly outperformed a few simpler human-interpretable rules such as specific lactate, MAP, or noradrenaline cut offs

Authors’ Conclusions 

  • In adult patients with septic shock receiving norepinephrine, the use of vasopressin was variable. A reinforcement learning model developed and validated in several observational datasets recommended more frequent and earlier use of vasopressin than average care patterns and was associated with reduced mortality

Strengths 

  • Use and production of open datasets speaks to the value of the substantial efforts to produce and maintain such resources
  • Significant heterogeneity in validation cohorts strengthens future applicability; explicit consideration of racial and gender fairness and bias 
  • Thorough and considered statistical approach, including transparent sensitivity analyses on various model hyperparameters
  • Use of logistic regression / other measures in addition to weighted importance sampling 
  • The overall evaluation focussed on the very relevant outcome of mortality
  • It is both a strength and a weakness that (apart from the selection of input parameters the model was given access to) there is no biological reasoning nor any need to follow physiological rationales or smoothing. This open-minded approach can lead to new insight, but it does mean there are extra steps needed to be reassured that the model output would always be safe or logical

Weaknesses 

  • This retrospective observational study still requires prospective validation before inferring a causal link
  • The baseline and hourly measurements the model had access to are not necessarily the ones that clinically are used to decide whether to start vasopressin. The use of retrospective datasets precluded a model being informed by eg. echocardiography findings, fluid responsiveness, within-hour changes, capillary refill time, etc
  • The only decision this model could make was “start vasopressin” or “continue without vasopressin”. This makes it trickier to trial in clinical practice, as this decision will interact with other decisions being made at each time point
  • Specific methodological concerns:
    • The authors used Fitted-Q-iteration, which allows the selection of actions that were never done in a given clinical state. There was no penalty for selecting these unseen actions. In a safety-critical field this may be concerning
    • Weighted importance sampling introduces bias by under-emphasising cases where the model did something different from the clinician. Clinician – and therefore model – variability is likely greatest in difficult cases – so the model is biased by evaluating itself just on the more clear-cut decisions. In reality, we would likely want an evaluation method that biases toward the trickier cases
    • The authors also do not provide data on the accuracy of their baseline policy (the one that aims to copy the clinician actions)
    • The RL agent chose quite different actions from the clinicians. This means there will be fewer points of overlap; inverse-probability-of-treatment-weighted pooled logistic regression will weigh these highly, which can lead to unstable results. The authors therefore truncated the weighting as per the supplementary materials – but it would be useful to know what proportion of values needed this truncation, as this would then add a bias
    • The sensitivity analyses show that a small change in the reward function (giving an equal weight to each non-mortality component of the reward) stopped the model from successfully identifying a policy that was “better” in terms of reward. The values in the reward function seem arbitrarily chosen rather than based on clinical absolutes or patient preferences, and can certainly be questioned – for example, the penalty of -20 for death with +1 for each hour survived in ICU implies the same score for a patients who was stepped down to a ward after 24h, to the patient who died in ICU after 44h.
    • It is also unusual for the reward function to include information on several hours of future states. Valuing several future hours is intrinsically part of the way the Q function is trained, and so adding extra reward points based on multiple time blocks in the future would “double-count” certain rewards
  • The article’s Data Sharing Statement reads “Data available: No”. I have not yet contacted the authors but would be surprised and disappointed if the code is not accessible for analysis in any way for reproducibility and review

The Bottom Line 

  • Vasopressin has varied and individualised effects. This trial would support a survival benefit with earlier initiation as triggered by this particular reinforcement learning algorithm
  • Despite the concerns above, I am generally enthusiastic about a real-world clinical evaluation of a reinforcement-learning agent, with a human checking the decision making is not dangerous. However – there are significant other recent advances in the field of offline reinforcement learning modelling which could be applied to update to this model and ensure the agent chosen for this evaluation has the best chance of success

External Links 

Metadata

Summary author: Hrishee Vaidya – bsky / twitter. With thanks to Tom Frost for insight and discussions regarding reinforcement learning.

Summary date: 19th May 2025
Peer-review editor: David Slessor
Picture by: Richard Bell (Unsplash)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.