Patient-Level Characterisation Guide

Introduction & Purpose

A Patient-Level Characterisation study is the observational research equivalent of creating a “Table 1” in a clinical trial. Its primary purpose is to generate a detailed clinical and demographic profile of a specific group of patients (a cohort). This answers the fundamental question: “Who are these patients?”

This type of analysis is a cornerstone of transparent and reproducible research. By providing a comprehensive baseline description of the study population, it helps researchers and readers understand the context of the study, assess the generalisability of the findings, and identify potential sources of confounding or bias.

Study Design

The design is a descriptive cohort analysis. It focuses on summarising the characteristics of one or more cohorts of patients at a specific point in time (the index date), or within a defined time window before it.

Participants

The study includes one or more cohorts of interest. These cohorts are typically defined by a shared characteristic, such as:

A new diagnosis of a specific condition.
The initiation of a particular medication.
Undergoing a specific medical procedure.

A key requirement is that participants have a period of data visibility (e.g., one year) before their index date to allow for the assessment of baseline characteristics.

Exposures / Covariates

In this context, there is no “exposure” in the comparative sense. Instead, the analysis focuses on summarising a wide range of patient covariates (characteristics) present at or before the index date. These can include:

Demographics: Age, gender, race, ethnicity.
Clinical History: All recorded medical conditions, often summarised into comorbidity scores like the Charlson Comorbidity Index.
Medication History: All prior medications the patients have used.
Procedures: A history of medical procedures.

Outcomes

The “outcomes” of this study are the summary statistics for the covariates of interest. The analysis produces tables and visualisations that describe the distribution of these characteristics within the cohort. This can include:

For categorical variables: Frequencies and percentages (e.g., % of patients with a history of diabetes).
For continuous variables: Means, medians, and standard deviations (e.g., mean age of the cohort).

Follow-up

There is typically no follow-up period after the index date in a pure characterisation study. The focus is entirely on the baseline period before the index date.

Analyses

The analysis is descriptive and involves summarising the covariates. This can be done in two ways:

Large-Scale Characterisation: An automated process that summarises thousands of clinical features from the database to provide an unbiased, data-driven overview of the cohort.
Pre-Specified Characterisation: An analysis focused on a limited set of clinically important covariates that have been defined in advance by the researchers.

The results are typically presented in a summary table, often referred to as “Table 1,” which provides a comprehensive snapshot of the cohort.

How to Implement This Study

Let’s implements the study using the OHDSI CohortCharacteristics package to generate baseline profiles (“Table 1”) of one or more cohorts in the OMOP CDM, directly addressing the guide’s descriptive design.

How CohortCharacteristics works

Summarise -> table/plot: summariseCohortCount/summariseCharacteristics produce tidy summarised_result outputs consumed by table…/plot… helpers in visOmopResults.
Cohorts in, summaries out: You provide one or more cohort tables (OMOP CDM structure). Functions compute counts, attrition, demographics, and clinical features at/before index.
Windows: For granular baselines, large-scale summaries can use windows (e.g., -365:-1, 0:0, 1:365) around index.
Groups/strata: Each cohort produces groups; you can stratify or facet by cohort_name and other variables in plots/tables.

Practical guidance

Use mock data for reproducibility.
Keep cohort inclusion/exclusion logic in cohort construction; keep summarise functions descriptive.
Save the summarised_result object and then render tables/plots.

Step 1: Setup

This initial step focuses on preparing the R environment for the analysis. It involves loading libraries like CDMConnector and CohortCharacteristics makes their functions available for data manipulation and analysis. This foundational step is crucial for a reproducible and error-free workflow.

library(CDMConnector)
library(CohortCharacteristics)
library(PatientProfiles)
library(visOmopResults)
library(DrugUtilisation)

Step 2: Create a mock CDM and cohort

To ensure the analysis is reproducible and self-contained, we generate a mock Common Data Model (CDM) using the DrugUtilisation package. This simulated dataset mimics the structure of a real OMOP CDM, providing a safe and standardized environment for developing and testing analytical code. After creating the CDM, we define a cohort of interest—in this case, individuals exposed to acetaminophen. This step is vital for establishing the study population that will be characterized in the subsequent steps.

# Create a mock CDM object for demonstration purposes.
# This simulates a real CDM and populates it with synthetic data.
cdm <- DrugUtilisation::mockDrugUtilisation(numberIndividuals = 10000)

# Generate a cohort of individuals based on the ingredient "acetaminophen".
# This cohort will be used for the characterization analysis.
cdm <- generateIngredientCohortSet(
  cdm = cdm,
  name = "cohort",
  ingredient = "acetaminophen"
)

Step 3: Cohort counts and attrition

Before diving into detailed characterization, it’s important to understand the size and composition of the cohort. This step involves calculating the number of subjects and records in the generated cohort and examining the attrition process, which details how many individuals were excluded at each stage of the cohort definition. These summaries provide a high-level overview of the study population and help identify any potential issues with the cohort definition, ensuring the transparency and validity of the analysis.

# Summarize the number of subjects and records in the cohort.
counts <- summariseCohortCount(cdm$cohort)
# Display the cohort counts in a table.
tableCohortCount(counts)

CDM name	Variable name	Estimate name	Cohort name
CDM name	Variable name	Estimate name	acetaminophen
DUS MOCK	Number records	N	7,280
	Number subjects	N	5,975

# Summarize the attrition of the cohort, showing how many subjects are dropped at each step of the cohort definition.
attr <- summariseCohortAttrition(cdm$cohort)
# Display the cohort attrition in a table.
tableCohortAttrition(attr)

Reason	Variable name
Reason	number_records	number_subjects	excluded_records	excluded_subjects
DUS MOCK; acetaminophen
Initial qualifying events	7,297	5,975	0	0
Collapse records separated by 1 or less days	7,280	5,975	17	0

Step 4: Baseline characteristics (“Table 1”) and plots

This is the core of the patient-level characterization analysis. We generate a comprehensive summary of the cohort’s baseline characteristics, often referred to as “Table 1.” This table includes demographics, clinical events, and other relevant attributes of the study population at the time of cohort entry. Following the summary, we create visualizations to explore specific aspects of the data, such as the age distribution of the cohort. This step provides deep insights into the study population, which is essential for interpreting the results of any subsequent analyses.

# Summarize the baseline characteristics of the cohort.
# This creates a "Table 1" with demographics, conditions, drugs, etc.
chars <- summariseCharacteristics(cdm$cohort)
# Display the baseline characteristics in a table.
tableCharacteristics(chars)

			CDM name
			DUS MOCK
Variable name	Variable level	Estimate name	Cohort name
Variable name	Variable level	Estimate name	acetaminophen
Number records	-	N	7,280
Number subjects	-	N	5,975
Cohort start date	-	Median [Q25 - Q75]	2013-03-26 [2002-04-18 - 2019-05-28]
		Range	1954-04-29 to 2022-12-31
Cohort end date	-	Median [Q25 - Q75]	2015-08-03 [2006-08-08 - 2020-06-06]
		Range	1954-05-17 to 2022-12-31
Age	-	Median [Q25 - Q75]	19 [9 - 34]
		Mean (SD)	22.71 (16.73)
		Range	0 to 71
Sex	Female	N (%)	3,527 (48.45%)
	Male	N (%)	3,753 (51.55%)
Prior observation	-	Median [Q25 - Q75]	663 [171 - 2,049]
		Mean (SD)	1,627.62 (2,393.99)
		Range	0 to 20,955
Future observation	-	Median [Q25 - Q75]	700 [181 - 2,194]
		Mean (SD)	1,726.26 (2,519.87)
		Range	0 to 22,957
Days in cohort	-	Median [Q25 - Q75]	282 [60 - 1,044]
		Mean (SD)	937.70 (1,664.07)
		Range	1 to 18,119

# Create a plot of the age distribution by cohort.
# First, filter the characteristics data to select only the "Age" variable.
plot_data <- dplyr::filter(chars, variable_name == "Age")
# Generate a boxplot of the age distribution, colored by cohort name.
plotCharacteristics(plot_data, plotType = "boxplot", colour = "cohort_name")

Step 5: Advanced Characterization with Flags

The summariseCharacteristics function offers powerful features for creating custom variables. One such feature is conceptIntersectFlag, which allows you to generate binary flags indicating the presence of specific concepts for each individual in the cohort. This is particularly useful for assessing the prevalence of comorbidities, prior medications, or other specific clinical events. In this step, we’ll create flags for a few example concepts to demonstrate how to enrich the baseline characteristics table.

# Define a list of concepts to create flags for.
# We'll use arbitrary concept IDs for this example,
# representing conditions like 'headache' and 'fever'.
cdm$cohort |> 
  addSex() |>
  summariseCharacteristics(
    demographics = FALSE,
    strata = list("sex"), 
    conceptIntersectFlag = list(
    "Concomitant disorders" = list(
      conceptSet = list(headache = 378253, asthma = 317009, influenza = 4266367),
      window = c(-Inf, Inf)
    ))
  ) |>
  # Display the characteristics table, which now includes the new flags.
  tableCharacteristics(
    header = c("sex"),
    hide = c("cdm_name", "cohort_name", "variable_level", "table_name")
  )

Variable name	Variable level	Estimate name	Sex
Variable name	Variable level	Estimate name	overall	Female	Male
Number records	-	N	7,280	3,527	3,753
Number subjects	-	N	5,975	2,904	3,071
Concomitant disorders	Headache	N (%)	3,482 (47.83%)	1,666 (47.24%)	1,816 (48.39%)
	Influenza	N (%)	3,588 (49.29%)	1,740 (49.33%)	1,848 (49.24%)
	Asthma	N (%)	3,563 (48.94%)	1,719 (48.74%)	1,844 (49.13%)