Jaan Altosaar

One Fact Foundation

5 active projects

Payless.Health

We intend to study the potential of synthetic patient data in augmenting health informatics research. The specific questions are: 1. Can synthetic patient data alleviate limitations of scope and access inherent in de-identified patient datasets? 2. How can aggregated, de-identified…

Scientific Questions Being Studied

We intend to study the potential of synthetic patient data in augmenting health informatics research. The specific questions are:

1. Can synthetic patient data alleviate limitations of scope and access inherent in de-identified patient datasets?
2. How can aggregated, de-identified statistics for health-related variables within All of Us data be used to enhance statistics around health systems research?
3. Which anonymization algorithm is most effective in preserving privacy while maximizing the utility of health data (e.g. rounding up for sparse populations or comorbidities)?

This study is important as it can lead to advancements in health informatics, ensuring broader, more diverse datasets that can aid in precision medicine, disease prevention strategies, and insights into human health while ensuring patient privacy. At this stage, due to the rich resources within AoU, this work is more exploratory in nature and not as prescriptively defined.

Project Purpose(s)

  • Population Health
  • Methods Development
  • Other Purpose (Not-For-Profit Purpose The data will be used by a not-for-profit entity for research or product or service development (e.g. for understanding hospital costs, prevalence rates, and improving health transparency).)

Scientific Approaches

We will utilize the All of Us Research Hub dataset, which includes genomic data, survey responses, physical measurements, EHRs, and wearables data. Our approaches include statistical analyses, exploratory data analysis (EDA), and the computation of marginals for relevant variables. Tools like Python, R, and SQL will be employed for data manipulation and analysis. Additionally, we'll evaluate various anonymization algorithms and assess their efficacy.

Anticipated Findings

We anticipate that synthetic patient data can address limitations in existing datasets and that proper anonymization algorithms can safeguard privacy without compromising data utility. These findings would contribute to the body of knowledge by showcasing the potential of synthetic data in health informatics, informing best practices for data anonymization, and facilitating precision medicine and population health research through enriched datasets.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

SQL query example for creating a cohort

Creating a SQL query to construct a cohort as there is no documentation on this. A SQL query is needed because the cohort builder tool allows us to select only people age 18 or older, but in the questionnaire there…

Scientific Questions Being Studied

Creating a SQL query to construct a cohort as there is no documentation on this. A SQL query is needed because the cohort builder tool allows us to select only people age 18 or older, but in the questionnaire there is an option for 'adolescents', and we need age 12-17 and 18-25 for our studies, so we need to create this example.

Project Purpose(s)

  • Educational

Scientific Approaches

Creating a SQL query to construct a cohort as there is no documentation on this. A SQL query is needed because the cohort builder tool allows us to select only people age 18 or older, but in the questionnaire there is an option for 'adolescents', and we need age 12-17 and 18-25 for our studies, so we need to create this example.

Anticipated Findings

Creating a SQL query to construct a cohort as there is no documentation on this. A SQL query is needed because the cohort builder tool allows us to select only people age 18 or older, but in the questionnaire there is an option for 'adolescents', and we need age 12-17 and 18-25 for our studies, so we need to create this example.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Sex at Birth
  • Gender Identity
  • Sexual Orientation
  • Geography
  • Disability Status
  • Access to Care
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Kammarauche Aneni - Other, Yale University
  • Kakuyon Mataeh - Project Personnel, One Fact Foundation
  • Eugeniu Plamadeala - Other, One Fact Foundation

Pharmaceutical Companies Targeting Black Communities for Profit

The scientific question we hope to answer by using the data is whether pharmaceutical companies use biased algorithms to exploit black communities for profit. We hope to address the cultural and societal impact of such machine learning models in health…

Scientific Questions Being Studied

The scientific question we hope to answer by using the data is whether pharmaceutical companies use biased algorithms to exploit black communities for profit. We hope to address the cultural and societal impact of such machine learning models in health care. This proposed project will enable personalized models for health care to best treat minority populations subject to behavioral health disorders, regardless of their insurance status, serious mental illness status, or any other legally protected class.

Project Purpose(s)

  • Population Health
  • Social / Behavioral
  • Educational
  • Methods Development
  • Ethical, Legal, and Social Implications (ELSI)

Scientific Approaches

Use the electronic health records available through AIM-AHEAD, such as the database of OCHIN, and All of US, to characterize the cohort of Black patients in health record diagnosed with serious mental illness such as bipolar disorder, schizophrenia, or borderline personality disorder. Conduct a value chain analysis of how psychiatric medicine is distributed to minority populations across the United States and develop a cross-walk methodology using natural language processing to link the cohort of patients with serious mental illness to the database of hospital prices. The next step will be to conduct statistical hypothesis testing to assess the policy impacts of our analysis: quantify whether the cohort of Black patients is subject to diagnoses which require higher or lower medication, higher or lower cost, and whether these patients reside in areas with higher or lower median income, alongside analyzing other social and environmental determinants and behavioral health.

Anticipated Findings

The project has the potential to expose dangerous algorithms within the pharmaceutical industry and suggest new ones that benefit minority patients. This research hopes to enable health care models that best treat minority populations subject to behavioral diagnoses no matter any legally protected status.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Disability Status
  • Access to Care
  • Income Level

Data Set Used

Registered Tier

Research Team

Owner:

Substance misuse predictions

The purpose of this study is to investigate social and behavioral features that are predictive of substance misuse among young adults (18-30 years) using machine learning algorithms. Aim 1: Identify electronic health record data features that predict substance use disorder,…

Scientific Questions Being Studied

The purpose of this study is to investigate social and behavioral features that are predictive of substance misuse among young adults (18-30 years) using machine learning algorithms.

Aim 1: Identify electronic health record data features that predict substance use disorder, depression and anxiety among young adults
Aim 2: Determine if electronic health record features that are predictive of above identified behavioral disorders differ by racial/ethnic groups using the National Institutes of Health racial ethnic categories.

Project Purpose(s)

  • Disease Focused Research (substance misuse)
  • Population Health
  • Social / Behavioral
  • Methods Development

Scientific Approaches

Machine learning – through the use of large-scale electronic health records data and predictive analytics – offers an innovative approach for identifying adolescents who are at risk for future substance misuse and mental disorders. Electronic health records represent a large amount of information over the care journey of a patient and for some adolescents, data from birth is available. As such, the electronic health record is a rich dataset that allows for the use of machine learning methods to identify at-risk adolescents early. The overall aim of this study is to use electronic health record data from the AllofUs research database to develop a machine learning model that predicts adolescents struggling with mental health disorders and substance use disorders among adolescents receiving care at Fair Haven.

Anticipated Findings

An automated method of identifying adolescents at risk has several benefits: it can allow for early identification and treatment referral which can potentially prevent the onset of future mental health problems; it can alleviate the current care burden on providers with preventive, automated screening; it can have downstream effects on reducing wait-times for outpatient mental health visits and at-capacity emergency rooms due to implementation of targeted preventive interventions early; it can reveal bias in that can exist in the identification and referral for substance use or mental health treatment and thus improve health equity.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Sex at Birth
  • Gender Identity
  • Sexual Orientation
  • Geography
  • Disability Status
  • Access to Care
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

immunology

Compared to men, women are often diagnosed with autoimmune diseases much later—sometimes the delay in diagnosis is years. We will build methods using machine learning and mutual information to learn representations of immune system function dependent on gender—to enable the…

Scientific Questions Being Studied

Compared to men, women are often diagnosed with autoimmune diseases much later—sometimes the delay in diagnosis is years. We will build methods using machine learning and mutual information to learn representations of immune system function dependent on gender—to enable the extraction of clinically significant patterns from large observational datasets in the All Of Us database for the research program related to immunological disease. The first aim of this study is to build machine learning models that aggregate information across these diseases and account for genetic information and biological sex. We will use tools from semi-supervised machine learning and information theory to describe gender disparities in autoimmune disease, in addition to building predictive models of disease. The second aim of this study will be to derive new biomarkers that correlate to risk scores of immunological diseases. Risk scores will be validated using existing immunological disease risk scores.

Project Purpose(s)

  • Disease Focused Research (autoimmune diseases)
  • Methods Development
  • Control Set
  • Ancestry
  • Ethical, Legal, and Social Implications (ELSI)

Scientific Approaches

The representations of immune function will be learned using my previous work, ClinicalBERT (Huang, Altosaar, and Ranganath 2020) by pooling patients in All Of Us across autoimmune diseases alongside controls. Such pooling will enable extracting the maximum amount of clinically significant information from gender differences in many low-prevalence diseases such as rheumatoid arthritis, type I diabetes, vitiligo, Grave's disease, psoriasis, inflammatory bowel disease, and others. We will build on this work by incorporating multi-modal information in All Of Us into the \gls{cb} model, for example by using recent work for learning latent associations in psychiatric data (Grotzinger et al. 2020).

Anticipated Findings

Validating our methods on patient-level prediction tasks such as immunological disease risk scores will reinforce that biological sex is an important variable to improve the historical under-representation of women. Additional validation will inspect the learned associations between immune system function, disease state and treatment utilization and contribute to the knowledge of autoimmune diseases derived from machine learning approaches. Such population-level effect estimation can impact downstream tasks such as product safety assessement or studies of comparative effectiveness. For example, some clinical trials have a male bias (Fahed et al. 2020), and the methods we build might infer such bias in clinical guidelines and give policymakers a tool to remedy such guidelines. Our work could be extended to develop risk scores for immuno-metabolic markers associated with diseases such as depression or other outcome variables as in Kappelmann et al. (2021) that account for gender bias.

Demographic Categories of Interest

  • Others

Data Set Used

Registered Tier

Research Team

Owner:

1 - 5 of 5
<
>
Request a Review of this Research Project

You can request that the All of Us Resource Access Board (RAB) review a research purpose description if you have concerns that this research project may stigmatize All of Us participants or violate the Data User Code of Conduct in some other way. To request a review, you must fill in a form, which you can access by selecting ‘request a review’ below.