Performing an Analysis
This guide walks you through the process of executing a study in the OMOP/R ecosystem. Think of your R script as a dynamic, executable Statistical Analysis Plan (SAP), where your cohort definitions, variable derivations, and analyses are all documented and run in a single, reproducible workflow.
The following diagram provides a high-level overview of the sequential phases involved in this workflow.
flowchart LR
subgraph "Setup"
A[Environment &
Database Setup]
end
subgraph "Cohort Generation"
B["Codelist & Cohort Generation
Attrition Tracking
Phenotype Diagnostics
"]
end
subgraph "Analysis"
C["Characterization
& Analysis"]
end
subgraph "Results"
D["Generate
Tables, Charts
& Export Results"]
end
A --> B --> C --> D
Phase 1: Study Setup and Configuration
The initial phase involves setting up the R environment and establishing a connection to the database.
Environment Setup
- Dependency Management: It is best practice to use
renvto ensure a reproducible environment.renvcreates a lockfile that captures the exact versions of all R packages used in your project. This is like ensuring every statistician on your team is using the exact same version of SAS and its modules, preventing “it works on my machine” problems.
Database Connection
- Connection: The first step in your R script is to establish a live connection to the OMOP CDM database using the
CDMConnectorpackage. This creates acdmobject that points to the database tables without loading them into memory. - Schemas: When connecting, you will define a
cdm_schemaand awrite_schema. Think ofcdm_schemaas your input library (like alibnamefor your SDTM data) andwrite_schemaas your output library (for your analysis datasets and results).
library(CDMConnector)
library(duckdb)
# Connect to a local DuckDB database file
db_file <- "omop_<data_order_id>_<date>.duckdb"
con <- DBI::dbConnect(
duckdb(),
dbdir = db_file
)
# Define database schemas to read the data and store the results
cdm_schema <- "cdm"
write_schema <- "analysis"
# Create the cdm object
cdm <- cdmFromCon(
con = con,
cdm_schema = cdm_schema,
write_schema = write_schema
)
Phase 2: Cohort Definition and Instantiation
This phase involves defining the medical concepts of interest and constructing patient cohorts based on these concepts. The CodelistGenerator package provides systematic methods for identifying relevant medical concepts, while CohortConstructor transforms these concepts into patient cohorts.
Base cohort creation functions:
conceptCohort()- creates cohorts based on clinical conceptsdemographicsCohort()- creates cohorts based on demographic criteriameasurementCohort()- creates cohorts based on measurement valuesdeathCohort()- creates cohorts based on death records
Codelist and Cohort Generation
CodelistGenerator: UseCodelistGeneratorto create codelists from concept sets. A codelist is a set of medical codes (e.g., for a disease or a drug) that define a clinical idea.CohortConstructor: UseCohortConstructorto build cohorts from these codelists.
For example, to create a cohort of patients with Hypertrophic Cardiomyopathy (HCM) who have received a first-line therapy:
library(CodelistGeneratoror)
library(CohortConstructor)
# 1. Get codelists for the disease and treatments
hcm_codes <- getCodelistFromConceptSet(4247, cdm)
beta_blockers_codes <- getCodelistFromConceptSet(4886, cdm)
ccb_non_dhp_codes <- getCodelistFromConceptSet(4887, cdm)
# 2. Combine treatment codes into a single "first-line therapy" codelist
first_line_therapy_codes <- merge_codelists(
beta_blockers_codes,
ccb_non_dhp_codes
)
# 3. Create initial cohorts based on the codelists
cdm$hcm <- concept_cohort(cdm, hcm_codes, name = "hcm")
cdm$first_line_therapy <- concept_cohort(cdm, first_line_therapy_codes, name = "first_line_therapy")
# 4. Intersect the cohorts to find patients with HCM who have received the therapy
cdm$hcm_first_line_treated <- cdm$first_line_therapy |>
require_cohort_intersect(
target_cohort_table = "hcm",
window = c(-Inf, 0) # The HCM diagnosis must be on or before the treatment start date
)
Phase 3: Cohort Attrition Tracking
After instantiating your cohort, it’s crucial to understand how many patients were included or excluded at each step.
- Attrition Tracking: The
summarise_cohort_attrition()function provides a detailed record of patient exclusions. This is often visualized withplot_cohort_attrition().
cdm$hcm_first_line_treated |>
summarise_cohort_attrition() |>
plot_cohort_attrition()
This plot helps ensure the logic of your cohort definition is working as expected and provides transparency for your final study report.
Phase 4: Phenotype and Cohort Validation
Before proceeding to analysis, it is essential to validate that your cohort definition has captured the correct patient population. The PhenotypeR package provides tools to “diagnose” your cohort, allowing you to review the clinical characteristics of the selected patients to ensure they align with the clinical understanding of the phenotype. This is a powerful quality step that goes beyond simple inclusion/exclusion counts.
PhenotypeR: This package provides a comprehensive suite of diagnostics to evaluate the quality of your clinical phenotype definitions. This helps ensure that your study cohorts accurately represent the patient populations you intend to study.
Phase 5: Analysis Pipeline
This is the core of the study, where you characterize the study population and perform your main analyses.
Baseline Characteristics
A critical first step is to generate the baseline characteristics of your study population. The CohortCharacteristics package is designed for this purpose. Functions like add_age() and add_sex() are standardized helpers from the PatientProfiles package that add common demographic variables to your cohort table. By default, age is calculated at the cohort_start_date, but this can be configured.
- Demographics: You can easily compute age and sex for your cohort.
- Comorbidities and Clinical Events: You can assess the prevalence of various conditions or events in your cohort’s history.
library([`CohortCharacteristics`](https://darwin-eu.github.io/CohortCharacteristics/))
# Example: Summarize age and sex for the study cohort
cdm$hcm_first_line_treated |>
add_age() |>
add_sex() |>
summarise_characteristics(
demographics = TRUE,
age_group = list(c(18, 49), c(50, 69), c(70, 150))
)
Specialized Analyses
Depending on your study’s objectives, you can now run various specialized analyses:
IncidencePrevalence: To calculate the incidence and prevalence of health outcomes.DrugUtilisation: To study patterns of medication use.CohortSurvival: To perform survival analysis.
Phase 6: Results Generation
The final phase involves generating and exporting the results in a clear and shareable format.
- Tables and Figures: The
visOmopResultspackage provides tools to visualize your results. For example, you can create publication-ready tables of baseline characteristics.
library(visOmopResults)
# Generate and table the characteristics summary
results <- cdm$hcm_first_line_treated |>
add_age() |>
add_sex() |>
summarise_characteristics()
table_characteristics(results)
- Export Results: You can export all results to structured formats like CSV for transparency, meta-analysis, or use in other programs.