Research Projects Directory

Research Projects Directory

18,835 active projects

This information was updated 6/21/2025

The Research Projects Directory includes information about all projects that currently exist in the Researcher Workbench to help provide transparency about how the Workbench is being used. Each project specifies whether Registered Tier or Controlled Tier data are used.

Note: Researcher Workbench users provide information about their research projects independently. Views expressed in the Research Projects Directory belong to the relevant users and do not necessarily represent those of the All of Us Research Program. Information in the Research Projects Directory is also cross-posted on AllofUs.nih.gov in compliance with the 21st Century Cures Act.

GLP-1 and Physical Activity

We are conducting a study using Fitbit and prescription data from the NIH-funded All of Us Research Program to examine how GLP-1 receptor agonists (GLP-1 RAs) and physical activity jointly impact cardiometabolic health. GLP-1 RAs, used for type 2 diabetes…

Scientific Questions Being Studied

We are conducting a study using Fitbit and prescription data from the NIH-funded All of Us Research Program to examine how GLP-1 receptor agonists (GLP-1 RAs) and physical activity jointly impact cardiometabolic health. GLP-1 RAs, used for type 2 diabetes and obesity, may reduce cardiovascular disease (CVD) risk, and their effects may be enhanced by physical activity.

Project Purpose(s)

  • Disease Focused Research (Cardiovascular Disease)
  • Population Health
  • Social / Behavioral

Scientific Approaches

We will assess adults aged 18–75 with at least 6 months of data to: (1) compare cardiometabolic outcomes between GLP-1 RA users and non-users among Fitbit wearers; and (2) test whether step counts modify the effect of GLP-1 RAs. The primary outcome is incident CVD (heart attack, stroke, heart failure); secondary outcomes include changes in A1c and BMI over 6–12 months.

Anticipated Findings

The anticipated outcomes of this study include quantifying the cardiovascular benefits of GLP-1 receptor agonist (GLP-1 RA) use in real-world settings, particularly among physically active individuals. We expect to generate evidence of interaction effects between GLP-1 RA use and physical activity, such as daily step counts, highlighting the potential synergistic impact on cardiometabolic outcomes. Additionally, the study will provide comparative effectiveness insights by evaluating differences in outcomes between GLP-1 RA users and non-users among Fitbit wearers, offering practical guidance for optimizing treatment strategies in diverse patient populations.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • JungAe Lee - Early Career Tenure-track Researcher, University of Massachusetts Medical School

Surgical Complication Prediction

We intend to leverage aspects of this dataset to predict complications after surgery. We hope to use this data to inform knowledge on surgical risk.

Scientific Questions Being Studied

We intend to leverage aspects of this dataset to predict complications after surgery. We hope to use this data to inform knowledge on surgical risk.

Project Purpose(s)

  • Population Health

Scientific Approaches

We will analyze the EHR data to predict surgical complication. We will use statistical modeling to predict surgical risk and complication.

Anticipated Findings

We do not anticipate any findings at this early stage of the project; however, we hope our findings inform surgical risk management.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

interface_testing

No specific scientific question at this point. The goal will simply be to test loading the programs and data.

Scientific Questions Being Studied

No specific scientific question at this point. The goal will simply be to test loading the programs and data.

Project Purpose(s)

  • Other Purpose (To learn about the workbench interface.)

Scientific Approaches

At this point there are no scientific questions and this is just to test the interface.

Anticipated Findings

I anticipate learning if the interface is amenable to the types of analyses I do or if I will need to make requests for the interface to be upated.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

  • James Hokanson - Early Career Tenure-track Researcher, Medical College of Wisconsin

BMI tutorial

Obesity & HDL distribution. If you are exploring the data at this stage to formalize a specific research question, please describe the reason for exploring the data, and the scientific question you hope to be able to answer using the…

Scientific Questions Being Studied

Obesity & HDL distribution. If you are exploring the data at this stage to formalize a specific research question, please describe the reason for exploring the data, and the scientific question you hope to be able to answer using the data: how does BMI relate to HDL levels

Project Purpose(s)

  • Educational

Scientific Approaches

Relationship between Obesity & HDL. Describe the datasets, research methods, and tools you will use to answer your scientific question(s).: will be conducted on the general population to study the relations between BMI and HDL

Anticipated Findings

Obesity negatively correlates with HDL levels. How would your findings contribute to the body of scientific knowledge in the field?:May deepen understanding of relationship between BMI and HDL

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Maria Zhao - Undergraduate Student, University of Chicago

rqi24

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome,…

Scientific Questions Being Studied

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome, where the weight is obtained from a corresponding GWAS. PRS is an effective tool to quantify the aggregated genetic propensity for a trait or disease. With rapid advances in GWAS sample size and statistical methodologies, PRS has shown substantially improved prediction accuracy and great potential in disease risk screening and precision medicine. The main goals of this project are 1) to run GWAS on numerous complex traits to identify and interpret genetic associations through integrative modeling of annotation data, and 2) to produce a set of PRS for hundreds of complex traits using newly released genomic data in AllofUs.

Project Purpose(s)

  • Social / Behavioral
  • Methods Development
  • Ancestry

Scientific Approaches

We will use the softwares like Hail, Regenie, and/or BOLT-LMM to run GWAS. We will implement a state-of-the-art method named PRS-CS to compute PRS for each GWAS trait. We will benchmark and optimize the performance of PRS models using a summary statistics-based cross-validation approach called PUMAS developed by our group (Zhao et al. Genome Biology 22(1), 2021). AllofUs genomic data will undergo rigorous quality control (QC) procedures including removing variants with lower sequencing depth and variant calling quality.

Anticipated Findings

We will produce GWAS summary statistics for numerous complex traits and disorders. We will also produce PRS for all individuals with whole-genome sequencing (WGS) data in AllofUs. Every individual will have hundreds of scores quantifying their genetic propensity for a large collection of diseases and traits. These scores will be immediately applicable in future studies. For example, one planned future study is to integrate breast cancer PRS with electronic health record data in AllofUs to improve risk screening accuracy.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Ranjun Qi - Undergraduate Student, University of Wisconsin, Madison

Duplicate of C-Tier Predictive Model of Cardiovascular Risk in Type II Diabetes

This project will explore machine learning models to predict cardiovascular risk in people with type 2 diabetes. Individuals with Type 2 Diabetes are at substantially increased risk for developing cardiovascular disease. A reliable predictive model could be very helpful for…

Scientific Questions Being Studied

This project will explore machine learning models to predict cardiovascular risk in people with type 2 diabetes. Individuals with Type 2 Diabetes are at substantially increased risk for developing cardiovascular disease. A reliable predictive model could be very helpful for identifying individuals at the greatest risk so that they can modify lifestyle and health habits and their physicians can consider pharmacologic interventions and closer monitoring.

We are particularly interested in determining whether the model is as effective for historically underrepresented groups as it is for historically well represented groups. Historically, medical studies examining prognostic predictors, including for cardiac health have focused on predominantly Caucasian populations and might not be as applicable to underrepresented minorities. Our goal would be to create a model that can effectively predict risk of cardiovascular disease for people of all demographic backgrounds.

Project Purpose(s)

  • Disease Focused Research (type 2 diabetes mellitus)
  • Population Health
  • Social / Behavioral

Scientific Approaches

- EHR, Survey data, and potentially Wearables data will be used by the model to predict cardiovascular risk.
- We will use an XGBoost model to make predictions and compare it with other models.
- Fairlean python package will be used to evaluate whether the model is fair across different demographics.

Anticipated Findings

- We are optimistic that we will able to create an effective model to predict cardiovascular risk in people with type 2 diabetes. Our goal is to produce a model with a predictive capacity that exceeds currently available predictive models, especially when examining individuals from underrepresented populations. Additionally, because the All of Us dataset contains so many unique types of data, evaluation of feature importance will determine which features are most impactful for this prediction.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Geography

Data Set Used

Controlled Tier

Research Team

Owner:

  • Jack Cummins - Other, University of Massachusetts Medical School

Parkinson's disease

How do genetic and environmental factors contribute to the development of Parkinson’s Disease? I am exploring the data surrounding Parkinson's disease for a college course assignment.

Scientific Questions Being Studied

How do genetic and environmental factors contribute to the development of Parkinson’s Disease?

I am exploring the data surrounding Parkinson's disease for a college course assignment.

Project Purpose(s)

  • Disease Focused Research (Parkinson's Disease)
  • Educational

Scientific Approaches

N/A

Anticipated Findings

N/A

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Wearable_circadian_sleep_psychiatric_risks

We are interested in investigating the relationships between mental health and circadian rhythms. We hope to understand how sleep and circadian rhythms contribute to psychiatric disorder risks, such as depression and anxiety.

Scientific Questions Being Studied

We are interested in investigating the relationships between mental health and circadian rhythms. We hope to understand how sleep and circadian rhythms contribute to psychiatric disorder risks, such as depression and anxiety.

Project Purpose(s)

  • Population Health

Scientific Approaches

We plan to use time series analysis to study circadian patterns and their relationships with depression and anxiety risks.

Anticipated Findings

We plan to use time series analysis to study circadian patterns and their relationships with depression and anxiety risks.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Minki Lee - Research Assistant, University of Michigan

Collaborators:

  • Ruby Kim - Research Fellow, University of Michigan

Duplicate of How to Work with All of Us Genomic Data (Hail - Plink)(v8)

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Scientific Questions Being Studied

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Project Purpose(s)

  • Disease Focused Research (Long QT syndrome)
  • Ancestry
  • Other Purpose (Demonstrate to the All of Us Researcher Workbench users how to get started with the All of Us genomic data and tools. It includes an overview of all the All of Us genomic data and shows some simple examples on how to use these data.)

Scientific Approaches

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Anticipated Findings

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Salivary Stones Study

We aim to study which health conditions most commonly occur with salivary stones. Salivary stones are deposits in the submandibular or parotid glands that can cause persistent infection and often present with pain and swelling. We want to investigate risk…

Scientific Questions Being Studied

We aim to study which health conditions most commonly occur with salivary stones. Salivary stones are deposits in the submandibular or parotid glands that can cause persistent infection and often present with pain and swelling. We want to investigate risk factors such as diabetes, hypertension, and environmental factors in predicting the emergence of salivary stones. We hope that gaining insight into these predictive metrics will help guide prevention strategies and prioritize environmental cleanup.

Project Purpose(s)

  • Disease Focused Research (Sialolithiasis)

Scientific Approaches

We aim to structure this study as a retrospective analysis of patients with salivary stones matched to patients without salivary stones. We will collect patient demographic data as well as condition data about hypertension, diabetes, alcohol use, obesity, and others. We will then perform a logistic regression to identify the effect size of these different condition and demographic variables in predicting salivary stones. We will optimize the logistic regression with various machine learning packages in python.

Anticipated Findings

We anticipate that a combination of health factors can help predict a prognosis of salivary stones. We would contribute to the body of scientific knowledge on prevention strategies.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Collaborators:

  • Khushi Bhatt - Graduate Trainee, University of California, Irvine

FedGeneGen

"How can we develop a robust and privacy-preserving distributed learning framework for analyzing large-scale genomic data across multiple institutions without compromising individual patient privacy or data ownership?" This question is of critical importance for several reasons: Advancing Precision Medicine: By…

Scientific Questions Being Studied

"How can we develop a robust and privacy-preserving distributed learning framework for analyzing large-scale genomic data across multiple institutions without compromising individual patient privacy or data ownership?"
This question is of critical importance for several reasons:

Advancing Precision Medicine: By enabling collaborative analysis of diverse genomic datasets, we can improve our understanding of genetic variations and their impact on disease susceptibility and treatment responses. This knowledge is crucial for advancing precision medicine initiatives.
Overcoming Data Silos: Many valuable genomic datasets are currently siloed within individual institutions due to privacy concerns and regulatory restrictions. Our research seeks to break down these barriers while maintaining strict privacy protections.

Project Purpose(s)

  • Methods Development

Scientific Approaches

Tools:

TensorFlow Federated & PyTorch (machine learning)
PLINK (genomic analysis)
R (statistical analysis)
Docker (deployment)
Jupyter Notebooks (development)

Approach:

Generate realistic synthetic datasets
Develop privacy-preserving distributed learning algorithms
Evaluate performance, privacy, and efficiency
Compare with centralized methods
Assess scalability through simulations

Anticipated Findings

Anticipated Findings:

Demonstration of a scalable, privacy-preserving distributed learning framework for genomic data analysis.
Quantification of privacy-utility trade-offs in federated genomic analysis.
Comparison of statistical power between distributed and centralized approaches.
Identification of optimal techniques for secure multi-party computation in genomic contexts.
Assessment of the impact of differential privacy on rare variant detection.
Enable large-scale collaborative genomic studies while preserving individual privacy and data ownership.
Provide a robust methodology for cross-institutional genomic data analysis, potentially accelerating precision medicine initiatives.
Offer insights into the effectiveness of federated learning and other privacy-preserving techniques for sensitive biomedical data.
Establish benchmarks for privacy and utility in distributed genomic analysis

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Sitao Min - Graduate Trainee, Rutgers, The State University of New Jersey

PheWAS Analysis for R206C DNASE1L3 Variant Associated with Autoimmune Diseases

We hope to address the question "Which disease phenotypes are conferred by genetic variants in the DNASE1/DNASE1L3 genes?". In GWAS, a common coding variant in the DNASE1L3 gene (R206C) is associated with risk for several autoimmune diseases (systemic sclerosis (SSc),…

Scientific Questions Being Studied

We hope to address the question "Which disease phenotypes are conferred by genetic variants in the DNASE1/DNASE1L3 genes?". In GWAS, a common coding variant in the DNASE1L3 gene (R206C) is associated with risk for several autoimmune diseases (systemic sclerosis (SSc), systemic lupus erythematosus [SLE], rheumatoid arthritis (RA) and autoimmune thyroid diseases (AITD)). Complete deficiency of DNASE1L3 leads to monogenic pediatric SLE (or a similar pre-lupus condition, HUVS). The disease associations with risk for systemic autoimmune disease are mainly limited to studies in cohorts of European cases and controls. This question is important because we stand to learn the penetrance of autoimmune traits in carriers of this common variants, and, we may potentially identify new traits that result from insufficient or deficient DNase1L3, which is relevant to identifying patient groups that could benefit with treatment by a DNASE biologic drug.

Project Purpose(s)

  • Educational
  • Ancestry

Scientific Approaches

We will utilize Jupyter Notebook to interrogate the Hail 0.2 MatrixTables to perform a phenome-wide association study according to the method described by Tran et al Bioinformatics, 2024, 41(1).

Anticipated Findings

We expect that some of the associates diseases in the DNASE1L3 locus in the GWAS catalog will be replicated in this PheWAS (such as associations with SSc, SLE, RA and AITD). Existing associations in the GWAS catalog are mostly in cases of European ancestry. Because of the diversity of participants in All of Us, this PheWAS analysis may reveal that traits associated with DNASE1L3 in Europeans are also associated (with similar effect sizes) in non-European groups. We also expect that the PheWAS to reveal some new traits (not present in the GWAS catalog) that are associated with the common missense variant of DNASE1L3, R206C. These findings will serve to replicate (and strengthen) existing associations, and may reveal new conditions that are treatable with DNASE1/1L3 biologics, which are under development.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Nicole Adelson - Undergraduate Student, Feinstein Institute for Medical Research

Burden of Carriers for Autosomal Recessive Conditions in All of Us Genomes

What is the frequency of heterozygote carriers of pathogenic/likely-pathogenic variants in All of Us population datasets? Reproductive carrier screening is conducted before or during pregnancy to inform parents about their likelihood of having a child affected by a genetic condition…

Scientific Questions Being Studied

What is the frequency of heterozygote carriers of pathogenic/likely-pathogenic variants in All of Us population datasets?

Reproductive carrier screening is conducted before or during pregnancy to inform parents about their likelihood of having a child affected by a genetic condition if both partners carry a pathogenic variant in a particular gene. This information enables couples to make decisions regarding reproductive options, ranging from preimplantation genetic testing to pregnancy management and adoption. The American College of Medical Genetics and Genomics recommends screening to be offered for 113 genetic conditions. All conditions have carrier frequencies exceeding 1 in 200 in the general population data from the Genome Aggregation Database version 2, comprising over 125,000 exomes from various regions. In this study, we utilize the genetic information from All of Us to estimate the frequency of carriers for these conditions in the US population represented in All of Us.

Project Purpose(s)

  • Ancestry

Scientific Approaches

We will utilize our previously published approach (PMID: 39615480, 35532184). Briefly, filtered ClinVar variant list will be queried against the All of Us population VCF files for matching variants. The information from the file will be used to estimate variant carrier frequency (VCF) and gene carrier frequency (GCF).

Anticipated Findings

An important challenge in estimation of carrier frequency of genetic conditions in population datasets is the lack of a bioinformatic pipeline that can accurately estimate the frequency of carriers for disease-causing variants in any population. Our team recently developed and verified a bioinformatic pipeline for comprehensive analysis of Genome Aggregation Database (gnomAD) datasets. Utilizing All of Us genomic datasets, we seek to provide updated data that could be used for carrier screening recommendations based on the data available from the US populations.

Together, this study takes a novel approach and provides new information utilizing large population datasets for further use in both research and clinical applications.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Vishal Soman - Project Personnel, University of Pittsburgh
  • Leon Xu - Undergraduate Student, University of Pittsburgh
  • Mahmoud Aarabi - Early Career Tenure-track Researcher, University of Pittsburgh

GWAS USC LA's BeST

We are using the All of Us dataset to perform GWAS on a specific health trait. We hope to find the main genotypes associated with the specific trait using GWA techniques. We will use the results to better understand how…

Scientific Questions Being Studied

We are using the All of Us dataset to perform GWAS on a specific health trait. We hope to find the main genotypes associated with the specific trait using GWA techniques. We will use the results to better understand how genotypes affect the trait in question.

Project Purpose(s)

  • Ancestry

Scientific Approaches

We plan to analyze All of Us data using GWAS methods, including but not limited to Principal Components Analysis.

Anticipated Findings

Because we are using GWAS methodology, we are not going to be making any causal statements or conclusions. Rather, our findings will tell us whether or not certain genes are associated with a particular phenotype/trait. Furthermore, it is unclear whether or not the genes we identify are the primary drivers for that specific trait. Our methods do not account for coding/noncoding genes, and we may not know the specific processes the genes we identify are involved in.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Associations Between Common Autoimmune Conditions & Vulvar Carcinoma

1. What is the prevalence of autoimmune conditions among patients with a documented diagnosis of vulvar carcinoma in the All of Us database? 2. Are certain autoimmune diseases (e.g., lichen sclerosus, systemic lupus erythematosus, rheumatoid arthritis, Sjogren’s syndrome) more frequently…

Scientific Questions Being Studied

1. What is the prevalence of autoimmune conditions among patients with a documented diagnosis of vulvar carcinoma in the All of Us database?
2. Are certain autoimmune diseases (e.g., lichen sclerosus, systemic lupus erythematosus, rheumatoid arthritis, Sjogren’s syndrome) more frequently reported in patients with vulvar carcinoma compared to matched controls without vulvar carcinoma?
3. Is there a statistically significant association between a history of autoimmune conditions and the risk of developing vulvar carcinoma?
4. Among patients with vulvar carcinoma, does the presence of autoimmune conditions correlate with differences in demographic factors such as age, race/ethnicity, or socioeconomic status?
5. Does the timing of autoimmune diagnosis (before, concurrent with, or after vulvar carcinoma diagnosis) show any patterns or associations?
6. Are there differences in treatment patterns or outcomes for vulvar carcinoma patients with versus without autoimmune conditions?

Project Purpose(s)

  • Disease Focused Research (Vulvar Carcinoma, Autoimmune Diseases)
  • Population Health

Scientific Approaches

This retrospective cohort study will use the All of Us Research Program dataset, which includes EHR, survey, and demographic data from a diverse U.S. population. We will identify cases with vulvar carcinoma and matched controls without the diagnosis, using ICD codes to define autoimmune conditions. Statistical analyses will include logistic regression to evaluate associations between autoimmune diseases and vulvar carcinoma risk, adjusting for confounders like age, race, and medication use. Subgroup and temporal analyses will explore specific autoimmune conditions and timing relative to cancer diagnosis. Data will be managed and analyzed using R or Python within the All of Us Researcher Workbench environment. Visualization tools will illustrate findings, and version control will ensure reproducibility.

Anticipated Findings

We expect to find a higher prevalence of certain autoimmune conditions, such as lichen sclerosus or systemic lupus erythematosus, among patients with vulvar carcinoma compared to controls. The study may reveal statistically significant associations between autoimmune diseases and increased vulvar carcinoma risk, potentially influenced by demographic factors or immunosuppressive treatments. Temporal analysis might show autoimmune conditions preceding cancer diagnosis, suggesting a possible contributory role of chronic inflammation.

These findings would enhance understanding of autoimmune diseases as potential risk factors for vulvar carcinoma, supporting more vigilant screening in affected populations. The study may identify subgroups at higher risk, guiding personalized preventive strategies. Additionally, it could inform future research on the pathophysiology linking autoimmunity and vulvar cancer, ultimately improving early detection and patient outcomes.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hamna Khalid - Graduate Trainee, University of Illinois at Urbana Champaign

SMI and cardiometabolic traits

This study aims to explore the genetic and phenotypic correlations between severe mental illnesses (schizophrenia, major depressive disorder, and bipolar disorder) and cardiometabolic traits.

Scientific Questions Being Studied

This study aims to explore the genetic and phenotypic correlations between severe mental illnesses (schizophrenia, major depressive disorder, and bipolar disorder) and cardiometabolic traits.

Project Purpose(s)

  • Disease Focused Research (psychiatric disorders, metabolic disorders)

Scientific Approaches

For phenotypic correlation, I will curate phenotypes of SMI and cardiometabolic traits, and compute Pearson correlations between each pair of phenotypes. For genetic correlation, I will use LDSC to estimate genetic correlations based on pairs of summary statistics from All-by-All table. If any SMI shows a strong correlation with cardiometabolic traits, I will construct PRS for SMI to assess their predictive power for cardiometabolic outcomes.

Anticipated Findings

I expect that SMI will show some correlation with cardiometabolic traits. These results could help identify genes that contribute to both SMI and cardiometabolic conditions.

Demographic Categories of Interest

  • Race / Ethnicity
  • Sex at Birth
  • Disability Status

Data Set Used

Controlled Tier

Research Team

Owner:

  • Yi-Sian Lin - Graduate Trainee, Baylor College of Medicine

Psychiatric_exome_v8

We are interested in rare genetic variants associated with psychological/psychiatric phenotypes (e.g., depression, anxiety, cognitive functions). As a first step, we plan to estimate prevalence and prevalence, as well as phenotypic consequence of mutations in risk genes in the literature.…

Scientific Questions Being Studied

We are interested in rare genetic variants associated with psychological/psychiatric phenotypes (e.g., depression, anxiety, cognitive functions). As a first step, we plan to estimate prevalence and prevalence, as well as phenotypic consequence of mutations in risk genes in the literature. We then plan to perform exome-wide association study of mental health phenotypes on All of Us cohort. We hope to identify relatively large-effect risk variants affecting risk of psychiatric/psychological disorders to aid in understanding mechanisms of these disorders.

Project Purpose(s)

  • Ancestry

Scientific Approaches

Participants with a history or current diagnosis of Psychiatric Disorders (including, Major Depressive Disorder, and anxiety disorders etc) will be ascertained based on their self-report and EHR dataset. Burden of mutations in risk genes in the literature will be compared between cases and controls as wells as correlated with psychiatric symptoms. Single variant analysis as well as gene-based tests (e.g., burden, SKAT, SKATO) will be performed and interpreted.

Anticipated Findings

We estimate the prevalence, penetrance, and pleiotropy of mutations in genes associated with neuropsychiatric disorders in general population. This can inform clinical integration of exome-wide association study findings (e.g., how much risk is conferred with a given mutation in general) and factors modulating the penetrance (e.g., environment, regulatory genome, etc). Discovery of relatively large-effect mutations associated with psychiatric disorders can lead to better elucidation of mechanisms of such disorders by pinpointing targets for further experimental and clinical validation.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

v8_NF1

It has been found that substantial participant bias by sex existed in large-scale population based biobank studies, such as UK Biobank. FinnGen, and BioBank Japan. For example, people found that the body mass index–increasing allele at FTO was observed at…

Scientific Questions Being Studied

It has been found that substantial participant bias by sex existed in large-scale population based biobank studies, such as UK Biobank. FinnGen, and BioBank Japan. For example, people found that the body mass index–increasing allele at FTO was observed at higher frequency in males compared to females in several biobank-based studies, suggesting that females with a lower BMI were more likely to participants in such studies. To identify potential participant biases by sex in the AoU cohort in each ancestral group, we proposed to perform a GWAS study by comparing the allele frequency of each variant in males vs females in each ancestral group.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

We will use the genome-wide association study to study the trait. The entire genotyped cohort will be divided into males (sex at birth) and females (sex at birth), and we will compare the allele frequency of each variant in males vs females across the genomes. A logistic regression model adjusted from age at enrollment and PCs will be used. Additionally, fisher extact test will be used for rare variants.

Anticipated Findings

We anticipate to identify variants associated with being females and males in each ancestral group. These variants can be associated with participant bias or the biological traits. We aim to investigate variants identified because of participant biases in each ancestral group to understand the cause of these biases, particularly for those historically understudied groups.

This study can help identify the roots of participant biases by sex across ancestral groups. This can inform future study design and participant recruitment.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

  • Chenjie Zeng - Research Fellow, National Human Genome Research Institute (NIH - NHGRI)

NTD Project

Research Question: How do climate-driven vector expansion, healthcare access disparities, and surveillance limitations shape the risk and recognition of chronic Chikungunya outcomes among Latino communities in the U.S.? Specific Aims: To describe demographic and health system characteristics of CHIKV-diagnosed individuals…

Scientific Questions Being Studied

Research Question: How do climate-driven vector expansion, healthcare access disparities, and surveillance limitations shape the risk and recognition of chronic Chikungunya outcomes among Latino communities in the U.S.?

Specific Aims:
To describe demographic and health system characteristics of CHIKV-diagnosed individuals in the U.S.
To quantify chronic sequelae and healthcare utilization patterns post-diagnosis.
To map environmental risk and demographic vulnerability intersections.
To inform culturally and climate-relevant preparedness strategies.

Project Purpose(s)

  • Educational

Scientific Approaches

I propose a mixed-methods, exploratory approach using a nationally representative EHR-linked cohort such as All of Us. The study will describe the demographic, clinical, and geographic characteristics of individuals coded with chikungunya in U.S. health records.

Key components:
Descriptive statistics: Age, sex, race/ethnicity, nativity, language, insurance status
Clinical outcomes: Presence of ICD codes for arthritis, chronic pain, specialist visits
Health access: Number of follow-up visits, geographic location
Mapping: Overlay Aedes suitability and Latino population density with patient ZIP codes

Anticipated Findings

A descriptive report on CHIKV chronic burden in a U.S. context
Geospatial maps linking environmental risk with social vulnerability
Policy brief recommending surveillance enhancements and care equity reforms

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Geography
  • Disability Status
  • Access to Care
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

GWAS and PGS

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome,…

Scientific Questions Being Studied

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome, where the weight is obtained from a corresponding GWAS. PRS is an effective tool to quantify the aggregated genetic propensity for a trait or disease. With rapid advances in GWAS sample size and statistical methodologies, PRS has shown substantially improved prediction accuracy and great potential in disease risk screening and precision medicine. The main goals of this project are 1) to run GWAS on numerous complex traits to identify and interpret genetic associations through integrative modeling of annotation data, and 2) to produce a set of PRS for hundreds of complex traits using newly released genomic data in AllofUs.

Project Purpose(s)

  • Social / Behavioral
  • Methods Development
  • Ancestry

Scientific Approaches

We will use the softwares like Hail, Regenie, and/or BOLT-LMM to run GWAS. We will implement a state-of-the-art method named PRS-CS to compute PRS for each GWAS trait. We will benchmark and optimize the performance of PRS models using a summary statistics-based cross-validation approach called PUMAS developed by our group (Zhao et al. Genome Biology 22(1), 2021). AllofUs genomic data will undergo rigorous quality control (QC) procedures including removing variants with lower sequencing depth and variant calling quality.

Anticipated Findings

We will produce GWAS summary statistics for numerous complex traits and disorders. We will also produce PRS for all individuals with whole-genome sequencing (WGS) data in AllofUs. Every individual will have hundreds of scores quantifying their genetic propensity for a large collection of diseases and traits. These scores will be immediately applicable in future studies. For example, one planned future study is to integrate breast cancer PRS with electronic health record data in AllofUs to improve risk screening accuracy.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Ranjun Qi - Undergraduate Student, University of Wisconsin, Madison

Collaborators:

  • Yuchang Wu - Research Fellow, University of Wisconsin, Madison
  • Shu Cao - Graduate Trainee, University of Wisconsin, Madison

Rel v8 Phenotypic Consequences of Elastin Related Genes

The Elastin (ELN) gene encodes a protein that provides recoil to tissues that stretch repetitively, such as the lungs, vasculature, and skin. Its presence across many tissues makes it a good candidate for study as alterations may lead to a…

Scientific Questions Being Studied

The Elastin (ELN) gene encodes a protein that provides recoil to tissues that stretch repetitively, such as the lungs, vasculature, and skin. Its presence across many tissues makes it a good candidate for study as alterations may lead to a range of phenotypic presentations. Our goal is to define genotype-phenotype associations of ELN and other elastic fiber genes, at the gene and variant level to learn about the mechanism and impact of abnormal elastin, and to discern the diseases caused by variants in connective tissue genes. To do this, we will A) screen an unselected population (AllOfUs) for variants in elastin and other elastic fiber genes, prioritizing the study of variants predicted to be of high consequence and B) examine their association with phenotypic features mined from the electronic health record and provided through surveys, questionnaires, etc.

Project Purpose(s)

  • Ancestry

Scientific Approaches

We expect an enrichment of elastic fiber phenotypes in patients with damaging variants and that the risk of these phenotypes is correlated with the predicted deleteriousness of the variant. We will identify participants with variants in elastic fiber genes. We will prioritize variants by frequency, type, location of variant, predicted alteration of function, and prior relevant phenotypic evidence in other study populations. All patients with prioritized variants will be comprehensively phenotyped based on a list of known and hypothetical elastic fiber related phenotypes. Second, we will use automated phenotyping based on ICD codes, lab values and survey responses to perform PheWAS with elastin fiber related gene variation to identify new phenotypes associated with variation in elastin fiber related genes. We will conduct penetrance analysis for each significant variant to estimate relative risk, then examine its correlation with deleteriousness.

Anticipated Findings

By using this gene-first approach, we anticipate that we will identify novel genotype-phenotype associations for elastin-related genes and will provide initial epidemiological data about progression of disease and gene-specific risk factors. The other rationale is to more broadly test the hypothesis that a genome-first approach that identifies clinically significant genetic variation coupled with deep phenotyping using electronic health records can be used to rapidly discover and expand knowledge about the impact of genes on human health and disease. The intent is to develop generalizable approaches derived from this study focusing on elastin and related genes.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • TR Luperchio - Research Fellow, National Heart, Lung, and Blood Institute (NIH - NHLBI)
  • Navya Shilpa Josyula - Project Personnel, Geisinger Clinic

Phenotypic Consequences of Elastin Related Genes

The Elastin (ELN) gene encodes a protein that provides recoil to tissues that stretch repetitively, such as the lungs, vasculature, and skin. Its presence across many tissues makes it a good candidate for study as alterations may lead to a…

Scientific Questions Being Studied

The Elastin (ELN) gene encodes a protein that provides recoil to tissues that stretch repetitively, such as the lungs, vasculature, and skin. Its presence across many tissues makes it a good candidate for study as alterations may lead to a range of phenotypic presentations. Our goal is to define genotype-phenotype associations of ELN and other elastic fiber genes, at the gene and variant level to learn about the mechanism and impact of abnormal elastin, and to discern the diseases caused by variants in connective tissue genes. To do this, we will A) screen an unselected population (AllOfUs) for variants in elastin and other elastic fiber genes, prioritizing the study of variants predicted to be of high consequence and B) examine their association with phenotypic features mined from the electronic health record and provided through surveys, questionnaires, etc.

Project Purpose(s)

  • Ancestry

Scientific Approaches

We expect an enrichment of elastic fiber phenotypes in patients with damaging variants and that the risk of these phenotypes is correlated with the predicted deleteriousness of the variant. We will identify participants with variants in elastic fiber genes. We will prioritize variants by frequency, type, location of variant, predicted alteration of function, and prior relevant phenotypic evidence in other study populations. All patients with prioritized variants will be comprehensively phenotyped based on a list of known and hypothetical elastic fiber related phenotypes. Second, we will use automated phenotyping based on ICD codes, lab values and survey responses to perform PheWAS with elastin fiber related gene variation to identify new phenotypes associated with variation in elastin fiber related genes. We will conduct penetrance analysis for each significant variant to estimate relative risk, then examine its correlation with deleteriousness.

Anticipated Findings

By using this gene-first approach, we anticipate that we will identify novel genotype-phenotype associations for elastin-related genes and will provide initial epidemiological data about progression of disease and gene-specific risk factors. The other rationale is to more broadly test the hypothesis that a genome-first approach that identifies clinically significant genetic variation coupled with deep phenotyping using electronic health records can be used to rapidly discover and expand knowledge about the impact of genes on human health and disease. The intent is to develop generalizable approaches derived from this study focusing on elastin and related genes.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • TR Luperchio - Research Fellow, National Heart, Lung, and Blood Institute (NIH - NHLBI)
  • Navya Shilpa Josyula - Project Personnel, Geisinger Clinic

Lauren's project_validation

Our research project investigates the genetic regulation of soluble Triggering Receptor Expressed on Myeloid Cells 1 (sTREM1) in childhood asthma as a potential mechanism underlying bronchodilator responsiveness (BDR). We aim to identify genetic variants associated with circulating sTREM1 levels and…

Scientific Questions Being Studied

Our research project investigates the genetic regulation of soluble Triggering Receptor Expressed on Myeloid Cells 1 (sTREM1) in childhood asthma as a potential mechanism underlying bronchodilator responsiveness (BDR). We aim to identify genetic variants associated with circulating sTREM1 levels and assess their association with BDR in children with asthma. This could reveal novel genetic pathways influencing treatment response and airway inflammation. Additionally, we will explore whether these genetic associations persist in non-asthmatic individuals to determine the disease specificity of the sTREM1-BDR link. This question is important because improving our understanding of asthma pharmacogenetics could enable more targeted therapies for children and identify biomarkers predictive of treatment response, ultimately improving public health outcomes in pediatric asthma.

Project Purpose(s)

  • Disease Focused Research (Asthma)
  • Methods Development
  • Control Set
  • Ancestry

Scientific Approaches

We will use individual-level genotype, phenotype, and protein expression data to assess genetic regulation of sTREM1 and its relationship with bronchodilator responsiveness (BDR). Our primary dataset includes childhood asthma cases with available genotyping and sTREM1 plasma levels. We will perform genome-wide association analyses to identify variants linked to sTREM1 expression. Significant SNPs will then be tested for association with BDR using linear and logistic regression models, adjusting for age, sex, and ancestry principal components. Mediation analysis will assess whether sTREM1 mediates genetic effects on BDR. For replication and comparison, we will evaluate these SNPs in non-asthmatic individuals. Tools will include PLINK for genotype QC and association testing, R for statistical modeling and visualization. We aim to uncover mechanisms linking innate immunity and pharmacologic response in pediatric asthma.

Anticipated Findings

We anticipate identifying genetic variants that regulate soluble TREM1 levels and are associated with bronchodilator responsiveness (BDR) in children with asthma. We expect that these variants will highlight immune pathways, particularly involving innate immune signaling, that influence treatment response. If mediation by sTREM1 is observed, this would suggest a biologically plausible mechanism linking genetic variation to asthma pharmacodynamics. Identifying whether these associations are specific to asthma or also present in non-asthmatic individuals will inform disease specificity. Our findings could contribute novel insights into the immunogenetic regulation of asthma treatment response, offer potential biomarkers for predicting BDR, and guide future precision medicine strategies in pediatric asthma care.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Collaborators:

  • Sung Chun - Senior Researcher, Boston Children's Hospital
  • Lauren Flynn - Project Personnel, Boston Children's Hospital

GATK-SV analysis HEDGE and AoU_2025.01.21

We are interested in identifying the genetic contributors to hypermobile Ehlers Danlos syndrome (hEDS). We are using the AllOfUs genetic data as population controls to identify genetic differences between people suffering from hEDS and a diverse selection of people from…

Scientific Questions Being Studied

We are interested in identifying the genetic contributors to hypermobile Ehlers Danlos syndrome (hEDS). We are using the AllOfUs genetic data as population controls to identify genetic differences between people suffering from hEDS and a diverse selection of people from around the United States. Learning more about the genetic basis for this disease will help us develop new treatments and therapies for people who suffer from this debilitating condition.

Project Purpose(s)

  • Disease Focused Research (Hypermobile Ehlers-Danlos Syndrome (hEDS))
  • Control Set
  • Ancestry

Scientific Approaches

We have structural variant calls from cases called by the GATK-SV pipeline. We plan to compare cases with controls from AllofUs with similar genetic ancestry, matched on principal components of ancestry. We will perform association analyses, including individual variants and, for rare variants, in aggregate (e.g. aggregating for a given chromsome location or predicted functional consequence).

Anticipated Findings

This analysis will enable rare and common structural variant studies of hEDS and have the potential to help define the genetic bases of this disease. The ancestral diversity in AllofUs will also help extend these findings to individuals with non-European ancestry, reducing the disparities of the uneven application of genetic discoveries.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Bob Handsaker - Other, Broad Institute

Local Ancestry and Ancestry Specific PRS on the All of Us dataset

We aim to study the genetic ancestry and admixture patterns in the 'All of Us' cohort using high-resolution local ancestry inference. This is crucial for understanding genetic variation, health disparities, and population-specific disease risks, ultimately contributing to more inclusive and…

Scientific Questions Being Studied

We aim to study the genetic ancestry and admixture patterns in the 'All of Us' cohort using high-resolution local ancestry inference. This is crucial for understanding genetic variation, health disparities, and population-specific disease risks, ultimately contributing to more inclusive and accurate genetic research and personalized medicine.

Project Purpose(s)

  • Educational
  • Methods Development
  • Ancestry

Scientific Approaches

We will use the 'All of Us' genomic dataset and apply the Gnomix model for high-resolution local ancestry inference. This involves preprocessing the data, generating initial ancestry estimates, and refining these with a smoothing module. We will validate the results using reference populations and calibrate for accuracy, uncovering detailed ancestry compositions and admixture patterns. Additionally, polygenic risk score (PRS) models that incorporate ancestry information will be trained and evaluated to measure the utility of these ancestry estimates.

Anticipated Findings

We expect to find detailed ancestry compositions and admixture patterns, revealing genetic variation within the cohort. These findings will improve our understanding of population structure, enhance genetic research accuracy, and support personalized medicine by identifying population-specific genetic risks, benefiting public health strategies.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Thomas Gillespie - Graduate Trainee, University of California, Santa Cruz
  • Qudsi Aljabiri - Undergraduate Student, University of California, Santa Cruz
  • Harrison Heath - Graduate Trainee, University of California, Santa Cruz
  • Cecilia Padilla Iglesias - Research Fellow, University of California, Santa Cruz
1 - 25 of 18836
<
>
Request a Review of this Research Project

You can request that the All of Us Resource Access Board (RAB) review a research purpose description if you have concerns that this research project may stigmatize All of Us participants or violate the Data User Code of Conduct in some other way. To request a review, you must fill in a form, which you can access by selecting ‘request a review’ below.