Using OMOP with R

This guide demonstrates how to apply R programming fundamentals to work with OMOP CDM data. Assuming you have basic R knowledge, we’ll focus on OMOP-specific concepts and tools.

  1. 1. Creating a CDM Reference
    1. Connecting to a Local OMOP Database
    2. Using Mock Data for Learning
    3. Exploring the CDM Object
    4. Verifying Connection
  2. 2. Exploring the OMOP CDM
    1. Basic Counts and Summaries
    2. Demographic Summary
  3. 3. Identifying Patient Characteristics
    1. Calculating Age at Observation Start
    2. Using CohortCharacteristics for Standardized Summaries
  4. 4. Adding Cohorts to the CDM
    1. Creating Codelists with CodelistGenerator
    2. Generating Cohorts with CohortConstructor
  5. 5. Working with Cohorts
    1. Characterizing Cohort Members
    2. Comparing Cohorts
  6. 6. Bridging the Gap for Different Backgrounds
    1. For Data Scientists New to Healthcare (like Martina)
  7. 7. Glossary of Key Terms
  8. 8. Applying Tidyverse Principles to the OMOP CDM
    1. The CDMConnector Package: Your Gateway to OMOP Data
    2. compute() vs. collect(): Working in the Database
    3. Defining Clinical Ideas with CodelistGenerator
    4. A Realistic Cohort with CohortConstructor
    5. Answering a Clinical Question: Joining Cohorts and Data

1. Creating a CDM Reference

library(CDMConnector)
library(omock)
library(lubridate)

Connecting to a Local OMOP Database

For a local DuckDB file:

cdm <- cdmFromCon(
  con = dbConnect(duckdb(), "path/to/omop.db"),
  cdmSchema = "main",
  writeSchema = "main"
)

Using Mock Data for Learning

cdm <- mockCdmReference()

This creates a cdm object with sample OMOP data.

Exploring the CDM Object

# List available tables
names(cdm)

The cdm object gives you easy access to all the OMOP tables as lazy tibbles, ready to be used with dplyr.

# Access specific tables
cdm$person

Verifying Connection

# Count patients
cdm$person |> count() |> collect()

2. Exploring the OMOP CDM

Basic Counts and Summaries

Start with overall statistics:

# Total number of patients
cdm$person |> count() |> collect()

Demographic Summary

Analyze patient demographics:

demographics <- cdm$person |>
  summarise(
    total_patients = n(),
    avg_age = mean(year_of_birth, na.rm = TRUE),
    distinct_genders = n_distinct(gender_concept_id)
  ) |>
  collect()

3. Identifying Patient Characteristics

Calculating Age at Observation Start

Join person and observation_period tables:

age_at_observation <- cdm$observation_period |>
  inner_join(cdm$person, by = "person_id") |>
  group_by(person_id) |>
  summarise(
    first_observation = min(observation_period_start_date),
    birth_year = first(year_of_birth)
  ) |>
  mutate(age_at_observation = year(first_observation) - birth_year) |>
  collect()

Using CohortCharacteristics for Standardized Summaries

Instead of manual joins, use the CohortCharacteristics package for standardized, reproducible summaries.

library(CohortCharacteristics)

# First create a cohort
cdm <- generateConceptCohortSet(
  cdm = cdm,
  name = "diabetes",
  conceptSet = list("type_2_diabetes" = 201826),
  end = "observation_period_end_date",
  limit = "first"
)

# Summarize characteristics of the diabetes cohort
characteristics <- cdm$diabetes |>
  summariseCharacteristics(
    ageGroup = list(c(0, 17), c(18, 64), c(65, 999)),
    gender = TRUE,
    priorObservation = TRUE
  ) |>
  collect()

4. Adding Cohorts to the CDM

library(CodelistGenerator)
library(CohortConstructor)

Creating Codelists with CodelistGenerator

Define clinical concepts:

# Get concepts for Gender
gender_codes <- getDescendants(cdm, 8507)  # MALE concept

Generating Cohorts with CohortConstructor

Create a diabetes cohort:

cdm <- generateConceptCohortSet(
  cdm = cdm,
  name = "diabetes",
  conceptSet = list("type_2_diabetes" = diabetes_codes),
  end = "observation_period_end_date",
  limit = "first"
)

5. Working with Cohorts

Characterizing Cohort Members

Join cohort with person data:

cohort_characteristics <- cdm$diabetes |>
  inner_join(cdm$person, by = c("subject_id" = "person_id")) |>
  summarise(
    total_patients = n(),
    avg_age = mean(2023 - year_of_birth, na.rm = TRUE),
    distinct_genders = n_distinct(gender_concept_id)
  ) |>
  collect()

Comparing Cohorts

Compare diabetes cohort to general population:

diabetes_vs_general <- bind_rows(
  cdm$diabetes |>
    inner_join(cdm$person, by = c("subject_id" = "person_id")) |>
    mutate(group = "diabetes"),
  cdm$person |>
    anti_join(cdm$diabetes, by = c("person_id" = "subject_id")) |>
    mutate(group = "general")
) |>
  group_by(group) |>
  summarise(avg_age = mean(2023 - year_of_birth, na.rm = TRUE)) |>
  collect()

6. Bridging the Gap for Different Backgrounds

This guide is designed to be accessible to readers from various backgrounds. Here’s how we address common challenges:

For Data Scientists New to Healthcare (like Martina)

If you’re proficient in R but unfamiliar with clinical research, we’ll explain the “why” behind the code. For example, when we create a “cohort” of patients, it’s not just filtering data—it’s defining a study population based on clinical criteria to answer specific research questions.

Martina, a data scientist transitioning to healthcare analytics, often finds that the clinical context adds meaning to the technical work. When building cohorts, remember that each patient represents a real person with a medical history, and your analyses can directly impact healthcare decisions.

7. Glossary of Key Terms

  • Cohort: A defined group of patients who meet specific inclusion/exclusion criteria for a study.
  • Concept ID: A standardized numeric identifier for medical terms in the OMOP vocabulary.
  • Domain: The category of a concept (e.g., Condition, Drug, Measurement).
  • Index Date: The date that defines cohort entry (e.g., first diagnosis date).
  • Incidence: The rate of new cases of a condition in a population over time.
  • Prevalence: The proportion of a population with a condition at a specific point in time.
  • Vocabulary: A controlled set of terms used to standardize medical concepts across different data sources.

8. Applying Tidyverse Principles to the OMOP CDM

While the principles of dplyr are powerful for any database, the OHDSI and DARWIN EU communities have developed a suite of R packages that build on this foundation to provide a seamless experience for working with the OMOP CDM.

The CDMConnector Package: Your Gateway to OMOP Data

The cornerstone of this ecosystem is the CDMConnector package. It allows you to create a cdm object, which is a special type of database connection that understands the structure of the OMOP CDM.

library(CDMConnector)
library(duckdb)

# For this example, we'll use a mock dataset
cdm <- mockCdmReference()

The cdm object simplifies OMOP analysis by providing a consistent interface to the complex CDM structure.

compute() vs. collect(): Working in the Database

  • collect(): Pulls data out of the database into R memory.
  • compute(): Executes queries and saves results as new database tables for efficiency.

Defining Clinical Ideas with CodelistGenerator

The CodelistGenerator package gathers relevant concept IDs for clinical ideas.

A Realistic Cohort with CohortConstructor

Let’s create a cohort of patients with first-time Type 2 Diabetes diagnosis.

library(CohortConstructor)

cdm <- generateConceptCohortSet(
  cdm = cdm,
  name = "diabetes",
  conceptSet = list("type_2_diabetes" = diabetes_codes),
  end = "observation_period_end_date",
  limit = "first"
)

This command performs complex filtering, grouping, and joining on the database side.

Answering a Clinical Question: Joining Cohorts and Data

Now that we have our cohort, we can ask questions like: “What is the age and gender distribution of our new diabetes cohort?”

diabetes_cohort <- cdm$diabetes
person_table <- cdm$person

cohort_demographics <- diabetes_cohort |>
  inner_join(person_table, by = c("subject_id" = "person_id")) |>
  select("subject_id", "cohort_start_date", "gender_concept_id", "year_of_birth") |>
  mutate(age_at_diagnosis = year(cohort_start_date) - year_of_birth) |>
  collect()

summary(cohort_demographics$age_at_diagnosis)
table(cohort_demographics$gender_concept_id)

This workflow—defining concepts, generating cohorts, and analyzing results—is the foundation of powerful, scalable analysis in OHDSI.