Patient-Level Prediction Guide

Introduction & Purpose

Patient-Level Prediction (PLP) studies are designed to build a “risk calculator” that can predict an individual patient’s probability of experiencing a future health outcome. Unlike comparative cohort studies that estimate an average effect for a population, PLP models provide a personalised risk score for a single patient based on their unique clinical history.

The purpose of PLP is to support proactive clinical decision-making. By identifying high-risk individuals before an event occurs, clinicians can intervene earlier with preventative treatments or increased monitoring. The central question is: “Based on a patient’s baseline characteristics, can we accurately predict who is at highest risk of a future outcome?”

Study Design

The design is a prognostic model development and validation study. It involves the following key steps:

Defining the Prediction Problem: Clearly specifying the target population, the outcome to be predicted, and the time window for the prediction.
Feature Engineering: Extracting a large number of potential predictor variables (covariates) from the patient’s historical data.
Model Training: Applying machine learning algorithms to a “training” dataset to learn the relationship between the baseline features and the future outcome.
Model Validation: Evaluating the performance of the trained model on a separate “testing” dataset to ensure it is accurate and generalisable.

Participants

The study starts with a target cohort of individuals for whom we want to make a prediction (e.g., “patients newly diagnosed with diabetes”). Within this cohort, the model will be trained on individuals who have sufficient observation time to determine if they experience the outcome.

Exposures / Predictors

There is no single “exposure.” Instead, the model uses a vast number of predictor variables (also called features or covariates) extracted from the patient’s history before the prediction start date. These can include:

Demographics
All prior medical diagnoses
All prior drug exposures
All prior medical procedures
Data from lab tests or measurements

Outcomes

The outcome is the event we are trying to predict. It must be a binary (yes/no) event that occurs within a pre-specified time-at-risk window. For example, a prediction problem could be defined as:

Target Cohort: Patients newly diagnosed with atrial fibrillation.
Outcome: Ischemic stroke.
Time-at-Risk: Within 1 year after the diagnosis of atrial fibrillation.

Follow-up

Each patient in the target cohort is followed from their index date (the start of the prediction window) until either the outcome occurs, or the time-at-risk window ends.

Analyses

The analysis involves applying various machine learning algorithms to the data. The OHDSI PLP framework is designed to make this a standardised process. Key steps include:

Data Splitting: The data is split into a training set (used to build the model) and a testing set (used to evaluate it).
Model Training: Common algorithms used include Logistic Regression, Gradient Boosting Machines, and Random Forest. The model learns the optimal weights for each predictor variable.
Performance Evaluation: The model’s performance is assessed on the testing set using metrics like:
- Discrimination: How well the model separates those who have the outcome from those who do not (measured by the Area Under the Receiver Operating Characteristic Curve, or AUC).
- Calibration: How well the model’s predicted probabilities match the observed reality.

The final output is a validated prediction model that can be applied to new patients to generate a personalised risk score.

How to Implement This Study

Code examples and step-by-step instructions will be added here.