Methods

Methods

To ensure the Research Hub collects the highest quality data possible, the All of Us Research Program employs the following comprehensive data methodology to curate data for registered and approved researchers. Read more about the Program Protocol.

Data Curation Process


Data Sources

Beyond recruiting a diverse participant community, the All of Us Research Program collects data from a wide variety of sources including: Surveys, Electronic Health Records (EHR), Biosamples, Physical Measurements, and Mobile Health Devices.

Data Harmonization

The All of Us Research Program uses the Observational Medical Outcomes Partnership (OMOP) Common Data model to ensure EHR data is standardized for all researchers.

Data Refinements

After harmonizing the data to meet the specifications of the OMOP Common Data Model, we process the data to ensure participant privacy is protected.

Curated Data Repository

Post-harmonization and refinement, the All of Us Research Program data is known as the Curated Data Repository (CDR).

Public Tier Curated Data Repository (CDR) may be accessed to through the interactive Data Browser application and the

Registered Tier Curated Data Repository (CDR) data will become available for access to approved researchers in the Workbench this Winter.

OMOP Common Data Model

The OMOP CDM is maintained by an international collaboration called the Observational Health Data Sciences and Informatics (OHDSI) program. The Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, data sets, and tools that use the same codes or data model. Learn more about OHDSI’s OMOP CDM initiative here.

As a researcher, here’s what you should know about OMOP

OMOP is a relational database. A relational database is a set of formally described tables with defined relationships from which data can be accessed and connected in many different ways without having to rebuild the original database tables. For researchers, it may be helpful to get familiar with the CDR’s OMOP tables.

OMOP is standardized. Standard vocabularies mean that, in spite of the differing ways each data element may be captured (e.g. variation among the many electronic health records), all of the data is brought together in a consistent representation. For each broad category of data, termed “domains”, OMOP incorporates important existing vocabularies so that everyone can “speak the same language” about the data.

OMOP is where metadata rules. The use of these vocabularies and “concept IDs” allows flexibility in extracting data. Instead of just the original data which is frequently incompatible sets of letters and/or numbers, OMOP provides concept IDs, which ensure one retrieves what one has requested. The vocabulary tables are available to provide the names and relationships among these different representations.

Resources can help. Don’t know what the standardized vocabulary is for your search term? Check out Athena, a platform that maps OMOP standardized vocabularies to other non-standard vocabularies. Want to take a deep dive into OMOP? Discover more on their Github Wiki.

Which OMOP Tables does the Curated Data Repository use?

Participant

Contains basic demographic information describing a participant including biological sex, birth date, race, and ethnicity.


Visit_occurrence

Visits capture encounters with healthcare providers or similar events. Contains the type of visit a Person has (outpatient care, inpatient confinement, emergency room, or long-term care), as well as date and duration information. Rows in other tables can reference this table, e.g. Condition Occurrences related to specific visit.


Condition_occurrence

Conditions are records of a Person indicating the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom, which is either observed by a Provider or reported by the patient.


Drug_exposure

Captures records about the utilization of a medication. Drug exposures include prescription and over-the-counter medicines, vaccines, and large-molecule biologic therapies. Radiological devices ingested or applied locally do not count as drugs. Drug exposure is inferred from clinical events associated with orders, prescriptions written, pharmacy dispensings, procedural administrations, and other patient-reported information.


Measurement

Contains both orders and results of a systematic and standardized examination or testing of a participant or participant’s sample, including laboratory tests, vital signs, quantitative findings from pathology reports, etc.


Procedure_occurrence

Contains records of activities or processes ordered by, or carried out by, a health care provider on the patient to have a diagnostic or therapeutic purpose.


Observation

Captures clinical facts about a Person obtained in the context of examination, questioning or a procedure. Any data that cannot be represented by any other domains, such as social and lifestyle facts, medical history, family history, etc. are recorded here.


Location

Represents a generic way to capture physical location or address information of participants and Care Sites.


Device_exposure

Captures information about a person’s exposure to a foreign physical object or instrument which is used for diagnostic or therapeutic purposes. Devices include implantable objects (e.g. pacemakers, stents, artificial joints), blood transfusions, medical equipment and supplies (e.g. bandages, crutches, syringes), other instruments used in medical procedures (e.g. sutures, defibrillators) and material used in clinical care (e.g. adhesives, body material, dental material, surgical material).


Death

Contains the clinical events surrounding how and when a participant dies.


Care_site

Contains a list of uniquely identified institutional (physical or organizational) units where health care delivery is practiced (offices, wards, hospitals, clinics, etc.)


Fact_relationship

Contains records about the relationships between facts stored as records in any table of the CDM. Relationships can be defined between facts from the same domain, or different domains. Examples of Fact Relationships include: Person relationships (parent-child), care site relationships (hierarchical organizational structure of facilities within a health system), etc.


Specimen

Contains the records identifying biological samples from a person.