## Ethan Young

Undergraduate Student, University of California, Los Angeles

2 active projects

Emory KG Healthcare

We are intending to study problems relating to gene data. The main goal of our project would be to use machine learning and computational algorithms to find more a representative subset that is either more accurate or just as accurate…

Scientific Questions Being Studied

We are intending to study problems relating to gene data. The main goal of our project would be to use machine learning and computational algorithms to find more a representative subset that is either more accurate or just as accurate as the whole set of data, but also saving computational cost. This would allow for further research to be conducted on these subsets that could solve future problems. (ie. drug analysis, regression)

Project Purpose(s)

- Ancestry

Scientific Approaches

The first data set we want to use is a dataset with Diabetes gene data. Here, we will use this data to see if we can find subset that work within the dataset. We will use different subset selection techniques such as k-means and topological data analysis. For example we would use 1. compute embeddings of observations 2. find centroids using k-means 3. Use TDA to extract important features. Using these methods we would hopefully find the most important features and find the most representative datasets. Some methods that we would use to help this process along is SHAP, t-SNE, and UMAP. SHAP is a mathematical method to explain the predictions of machine learning models. t-SNE is a statistical method for visualizing data by giving each datapoint a location in a two or three-dimensional map. UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis.

Anticipated Findings

The anticipated finding are a subset of the given dataset that works just as well if not better. This will help reduce computational cost in the future and can provide better results on various problems relating to the given disease.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled TierEmory Gene Clustering Work

We are intending to study problems relating to gene data. The main goal of our project would be to use machine learning and computational algorithms to find more a representative subset that is either more accurate or just as accurate…

Scientific Questions Being Studied

We are intending to study problems relating to gene data. The main goal of our project would be to use machine learning and computational algorithms to find more a representative subset that is either more accurate or just as accurate as the whole set of data, but also saving computational cost. This would allow for further research to be conducted on these subsets that could solve future problems. (ie. drug analysis, regression)

Project Purpose(s)

- Ancestry
- Other Purpose (The purpose of for using this workspace is to find potential new research on gene clustering data with the intent to publish)

Scientific Approaches

The first data set we want to use is a dataset with Diabetes gene data. Here, we will use this data to see if we can find subset that work within the dataset. We will use different subset selection techniques such as k-means and topological data analysis. For example we would use 1. compute embeddings of observations 2. find centroids using k-means 3. Use TDA to extract important features. Using these methods we would hopefully find the most important features and find the most representative datasets. Some methods that we would use to help this process along is SHAP, t-SNE, and UMAP. SHAP is a mathematical method to explain the predictions of machine learning models. t-SNE is a statistical method for visualizing data by giving each datapoint a location in a two or three-dimensional map. UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis

Anticipated Findings

The anticipated finding are a subset of the given dataset that works just as well if not better. This will help reduce computational cost in the future and can provide better results on various problems relating to the given disease.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled TierResearch Team

Owner:

- Carl Yang - Early Career Tenure-track Researcher, Emory University
- Ethan Young - Undergraduate Student, University of California, Los Angeles
- Mathias Heider heider - Undergraduate Student, University of Delaware

You can request that the *All of Us* Resource Access Board (RAB) review a research purpose description if you have concerns that this research project may stigmatize *All of Us* participants or violate the Data User Code of Conduct in some other way. To request a review, you must fill in a form, which you can access by selecting ‘request a review’ below.