Data Methods

Data Curation

Data
Sources

Data
Harmonization

Data
Refinements

Curated Data Repository

Data
Dictionary

Data Sources

All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.

EHR Data Harmonization

The All of Us Research Program uses the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to standardize EHR data for all researchers.

Data Refinements

After harmonizing the EHR data to meet the specifications of the OMOP CDM, we process the data to ensure participant privacy is protected. We also take steps to conform and clean the data to deliver high-quality data.

Curated Datasets

All of Us Research Program dataset in its final format, after harmonization and refinement, is referred to as a curated dataset. Three different levels of information are available.

The Public Tier dataset displays high-level summaries of the data available for research. Through the Data Browser, one can explore aggregated participant data and summary statistics.

The Registered Tier dataset in the Researcher Workbench includes individual-level data from surveys, physical measurements, longitudinal EHRs, and wearables like Fitbit devices. This individual-level data must be analyzed within the secure Researcher Workbench platform.

The Controlled Tier dataset is also accessed in the Researcher Workbench. In addition to all of the data in the Registered Tier, the Controlled Tier dataset includes genomic data and expanded demographic, survey, and EHR data. Genomic data include short-read whole genome sequences (WGS), long-read WGS, structural variants, and genotyping arrays.

Visit the Data Access Tiers page to learn more about our tiered-data access model.

Data Dictionary

The All of Us Data Dictionary documents what data are available from participants and what modifications the program makes to protect participant privacy. It provides a description for each data field, noting whether it is a standard OMOP field or a custom field created to help capture data unique to the program. The Data Dictionary also provides information on whether the data in each field come from participant health records or from information the participants provide themselves, like survey responses. The Data Dictionary details some ways we clean the data to improve data quality, as well as many of the program custom concept IDs for easy reference. This resource includes versioning data so you can see what has been changed, added, or removed since the previous curated dataset.

Explore the Registered and Controlled Tier Data Dictionaries.

GENOMIC DATA CURATION

Individual-level genomic data from short-read whole genome sequencing (srWGS), long-read whole genome sequencing (lrWGS), and genome-wide genotyping arrays are available within the Researcher Workbench’s Controlled Tier.

DATA SOURCE

Most All of Us participants contribute biosamples such as blood and/or saliva. DNA from these samples is extracted and sent to genome centers for genomic analysis, including whole genome sequencing (WGS) and genome-wide genotyping.

FILE TYPES & FORMAT

Registered researchers who have Controlled Tier access have access to WGS and array data. The All of Us genomic dataset contains the following variant call files and raw genomic data:

Arrays: Variant Call Format (VCF), Hail MatrixTable (MT), PLINK 1.9, IDAT
srWGS (short-read whole genome sequences): Single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) Hail 0.2 Variant Dataset (VDS)
srWGS compressed sequence alignment (CRAM)
srWGS SNP and Indel (exome only): VCF, PLINK, Hail MT, and BGEN
srWGS SNP and Indel (common variants): VCF, PLINK, Hail MT, and BGEN
srWGS SNP and Indel (clinically relevant variants): VCF, PLINK, Hail MT, and BGEN
srWGS SNP and Indel (annotated variants): Variant Annotation Table
srWGS genomic metrics, srWGS genetic ancestry, srWGS admixture estimation, and srWGS pharmacogenomics haplotype calls and predicted phenotypes: TSV files
srWGS structural variants (SV) VCF
lrWGS (long-read whole genome sequences) SNP and Indel and SV VCF and Genomic VCF (GVCF)
lrWGS SNP and Indel Hail MT
lrWGS binary alignment map (BAM), Graphical Fragment Assembly (GFA), and FASTA
lrWGS sample metrics: TSV files

Researchers can access auxiliary information, such as computed ancestry and quality reports, in the User Support Hub.

GENOMIC DATA QUALITY CONTROL

The All of Us Research Program performs stringent quality control (QC) procedures to ensure that we provide researchers with high-quality genomic data. Our QC methods confirm sample quality and genetic variants within DNA sequences. Short-read and long-read whole genome sequence (srWGS and lrWGS) samples are joint called, which combines evidence from multiple samples to filter out systematic biases. Samples that do not pass QC thresholds are not released.

Array QC

All array samples undergo QC processes to determine any issues with sample swapping, contamination, or preparation. All genome centers follow identical pipelines to generate array variant call format files (VCFs). After sequencing, QC processes include sex concordance, call rate, and cross-individual contamination rate.
srWGS QC

srWGS QC is performed using the same protocol and software at each genome center. Each sample is checked individually to determine if a swap, contamination, or preparation issue has occurred. We verify genotype fingerprints, identify appropriate sex chromosomes, and check the sequencing coverage. We also run QC processes on the single nucleotide polymorphism, insertion, and deletion variant joint callset. With a joint callset, our analysis uses information across all samples to help us detect noisy samples, remove artifacts, and ensure sequencing quality meets genomic data standards.
srWGS SV QC

Additional QC is performed on the srWGS dataset to perform structural variant (SV) calling. Each sample is checked individually for contamination, and sequencing metrics are evaluated to check for outliers. We also run QC processes on the SV joint callset to refine the variant calls and remove variants that are not backed up with high-quality sequencing evidence.
lrWGS QC

lrWGS individual samples are checked with genotype fingerprinting and sequencing metrics to determine if there are any data issues. We perform QC checks for individual samples to determine if the data matches what we expect, including checking the sex chromosomes and any sample contamination. The joint long-read callset uses information across samples to identify variants.

DOWNSTREAM VALIDITY

All of Us runs analyses to demonstrate specific capabilities of the data and to communicate caveats in the data to researchers. We do this through Genome-Wide Association Studies (GWAS). GWAS help us validate the All of Us genomic data by replicating previously published results.

Observational Medical Outcomes Partnership (OMOP) Common Data Model

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is maintained by an international collaborative called the Observational Health Data Sciences and Informatics (OHDSI) program. The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model. Learn more about OHDSI’s OMOP CDM initiative.

As a researcher, here’s what you should know about OMOP:

OMOP is a relational database. A relational database is a set of formally described tables with defined relationships allowing data to be accessed in many different ways. For researchers, it may be helpful to get familiar with the curated dataset’s OMOP Tables.

OMOP is standardized. Standard vocabularies mean that, despite differences in how each data element may be captured (e.g., variation among the many electronic health records), all of the data are represented consistently in the data model. For each broad category of data, or domain, OMOP incorporates important existing vocabularies so that everyone using the data can speak the same language.

OMOP is where metadata rules. The use of these vocabularies and concept IDs allow flexibility in extracting data. Instead of just the source data, which are often highly specific to individual institutions, OMOP provides concept IDs. This ensures that the data are represented in a standardized way, are common across many institutions, and are easily retrievable using standardized search methodologies. The vocabulary tables are available to provide the names and relationships among these different representations.

Resources can help. Don’t know what the standardized vocabulary is for your search term? Check out Athena, a platform that maps OMOP standardized vocabularies to other nonstandard vocabularies. Want to take a deep dive into OMOP? Discover more on the Github Wiki.

Which OMOP Tables Does All of Us Use?

The All of Us dataset includes EHR data found in the following OMOP tables:

Person	Contains basic demographic information describing a person including sex, birth date, race, and ethnicity. Although it is common to get this information from EHR data, the All of Us Research Program uses data provided directly by participants through surveys when this information is available.
Visit_occurrence	Visits capture encounters with health care providers or similar events. Contains the type of visit a person has (outpatient care, inpatient confinement, emergency room, or long-term care), as well as date and duration information. Rows in other tables can reference this table, e.g., condition occurrences related to a specific visit.
Condition_occurrence	Conditions are records of a person indicating the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom, which is either observed by a provider or reported by the patient.
Drug_exposure	Captures records about the utilization of a medication. Drug exposures include prescription and over-the-counter medicines, vaccines, and large-molecule biologic therapies. Radiological devices ingested or applied locally do not count as drugs. Drug exposure is inferred from clinical events associated with orders, prescriptions written, pharmacy dispensings, procedural administrations, and other patient-reported information.
Measurement	Contains both orders and results of a systematic and standardized examination or testing of a person or person’s sample, including laboratory tests, vital signs, and quantitative findings from pathology reports. Physical measurements collected by All of Us are also stored in this table.
Procedure_occurrence	Contains records of activities or processes ordered or carried out by a health care provider for a diagnostic or therapeutic purpose.
Observation	Captures clinical facts about a person obtained in the context of examination, questioning, or a procedure. Any data that cannot be represented by any other domains, such as social and lifestyle facts, medical history, family history, etc. are recorded here. Survey information is also located in this table.
Device_exposure	Captures information about a person’s exposure to a foreign physical object or instrument used for diagnostic or therapeutic purposes. Devices include implantable objects (e.g., pacemakers, stents, artificial joints), blood transfusions, medical equipment and supplies (e.g., bandages, crutches, syringes), other instruments used in medical procedures (e.g., sutures, defibrillators), and material used in clinical care (e.g., adhesives, body material, dental material, surgical material).
Death	Contains the clinical events surrounding how and when a person dies.
Fact_relationship	Contains records about the relationships between facts stored as records in any table of the CDM. Relationships can be defined between facts from the same domain or different domains. Examples of fact relationships include person relationships (parent–child), care site relationships (hierarchical organizational structure of facilities within a health system), etc.
Specimen	Contains the records identifying biological samples from a person.

Data Methods

Data Curation

Data Sources

EHR Data Harmonization

Data Refinements

Curated Datasets

Data Dictionary

Read more about our protocols and data methods

GENOMIC DATA CURATION

DATA SOURCE

FILE TYPES & FORMAT

GENOMIC DATA QUALITY CONTROL

Array QC

srWGS QC

srWGS SV QC

lrWGS QC

DOWNSTREAM VALIDITY

Observational Medical Outcomes Partnership (OMOP) Common Data Model

As a researcher, here’s what you should know about OMOP:

Which OMOP Tables Does All of Us Use?

Read more about our protocols and data methods

Read more about our protocols and data methods

Data Methods

Data Curation

Data Sources

EHR Data Harmonization

Data Refinements

Curated Datasets

Data Dictionary

Read more about our protocols and data methods

GENOMIC DATA CURATION

DATA SOURCE

FILE TYPES & FORMAT

GENOMIC DATA QUALITY CONTROL

Array QC

srWGS QC

srWGS SV QC

lrWGS QC

DOWNSTREAM VALIDITY

Observational Medical Outcomes Partnership (OMOP) Common Data Model

As a researcher, here’s what you should know about OMOP:

Which OMOP Tables Does All of Us Use?

Read more about our protocols and data methods

Read more about our protocols and data methods

Subscribe to Email Updates