https://github.com/lu-m-dev/biostatistics-eda
Exploratory data analysis and visualization system for biostatistical research
https://github.com/lu-m-dev/biostatistics-eda
biostatistics data-analysis data-visualization eda
Last synced: about 9 hours ago
JSON representation
Exploratory data analysis and visualization system for biostatistical research
- Host: GitHub
- URL: https://github.com/lu-m-dev/biostatistics-eda
- Owner: lu-m-dev
- Created: 2025-11-02T20:31:03.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-01-20T02:03:21.000Z (5 months ago)
- Last Synced: 2026-01-20T09:06:57.806Z (5 months ago)
- Topics: biostatistics, data-analysis, data-visualization, eda
- Language: Jupyter Notebook
- Homepage:
- Size: 12.2 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Exploratory Data Analysis and Visualization System for Biostatistical Research
## [`data/`](./data/)
_Please note: For the public repository, `data/` has been omitted to respect data privacy/licensing._
The [`data/`](./data/) directory contains the datasets used in the project. It includes two subdirectories:
original/
Original data files imported from outside this repository
- [`adni/`](./data/original/adni/)
- [`aric/`](./data/original/aric/)
- [`calcium/`](./data/original/calcium/)
- [`other/`](./data/original/other/)
processed/
Interim data files generated manually or by a script within this repository
- [`adni/`](./data/processed/adni/)
- [`aric/`](./data/processed/aric/)
- [`other/`](./data/processed/other/)
## [`assets/`](./assets/)
The [`assets/`](./assets/) directory contains final, presentation-ready tables, figures, and slides
tables/
Demographic characteristics tables, summary statistics tables, etc.
- [`adni/`](./assets/tables/adni/)
- [`aric/`](./assets/tables/aric/)
figures/
Saved figures in PNG (raster/pixel) and PDF (vector) formats by interactive Dash apps
- [`adni/`](./assets/figures/adni/)
- [`aric/`](./assets/figures/aric/)
slides/
Summary slides
## [`src/`](./src/)
The [`src/`](./src/) directory contains the source code for this project. It is organized into the following subdirectories:
### [`src/lib/`](./src/lib/)
This directory contains the library code for the project. The utility functions are organized into the following categories:
general.py
General utility functions
- `get_stage_list()` returns the list of names of stages stratified by biomarkers Ab42, amyloid PET, (p-Tau or t-Tau)
stats.py
Statistical analysis
- `demographics_characteristics()` computes a summary statistics table of the study population
- `multiple_linear_regression()` fits a multiple linear regression model and returns model statistics
- `cluster_corr_df()` hierarchially clusters a correlation matrix
- `get_linkage_methods()` returns list of available linkage methods for hierarchical clustering
- `get_cluster_criteria()` returns list of available cluster criteria for hierarchical clustering
- `remove_diagonal()` masks the diagonal of a square matrix with NaN
- `fill_mirror()` fills a triangular matrix to a square matrix with its transpose
- `mask_outlier()` returns a mask that removes outliers when applied. Outliers are determined by [Local Outlier Factor](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html)
- `corr_remove_outliers()` computes the outlier-removed correlation coefficient for each pair of variables; returns a correlation matrix
dialog.py
Tkinter dialogs
- `dialog_select_directory()` prompts the user to select a directory in a dialog selection window and returns its absolute path
- `dialog_select_file()` prompts the user to select a file in a dialog selection window and returns its absolute path
plotly.py
Modifications to Plotly figure objects
- `standard_layout()` configures a standard layout for plot template, axes, and fonts
- `add_box()` draws a box plot on top of a strip plot
- `add_pairwise_comparison()` annotates pairwise comparison results on top of a strip plot or box plot
- `annotation_t_test()` computes the p-value from independent-sample t-test
- `annotation_cohens_d()` computes Cohen's d effect size
- `annotation_tukey()` performs Tukey's multiple comparison post-hoc to obtain the p-value
r_interface.py
Interface to R
- `tukey()` conducts Tukey's multiple comparison post-hoc in R for a single dependent variable
- `tukey_multiple_dvs()` conducts Tukey's multiple comparison post-hoc in R sequentially for a list of dependent variables; returns a table of the resultant p-values
### [`src/processing/`](./src/processing/)
This directory contains scripts that process [_original files_](./data/original/) and/or [_processed files_](./data/processed/) to [_processed files_](./data/processed/) for downstream analyses.
bmi.ipynb
Body Mass Index (BMI)- Input: [`VITALS_14Jul2023.csv`](./data/original/adni/VITALS_14Jul2023.csv)
- Output: [`bmi.csv`](./data/processed/adni/bmi.csv)
strem2.ipynb
CSF soluble triggering receptor expressed on myeloid cells 2 (sTREM2)- Input: [`ADNI_HAASS_WASHU_LAB_13Jul2023.csv`](./data/original/adni/ADNI_HAASS_WASHU_LAB_13Jul2023.csv)
- Output: [`strem2.csv`](./data/processed/adni/strem2.csv)
demographics.ipynb
Basic demographics- Input: [`ADNIMERGE_14Jul2023.csv`](./data/original/adni/ADNIMERGE_14Jul2023.csv), [`bmi.csv`](./data/processed/adni/bmi.csv)
- Output: [`demographics.csv`](./data/processed/adni/demographics.csv)
demographics_tau.ipynb
Demographics with tau biomarker data- Input: [`ADNIMERGE_14Jul2023.csv`](./data/original/adni/ADNIMERGE_14Jul2023.csv), [`bmi.csv`](./data/processed/adni/bmi.csv)
- Output: [`demographics_tau.csv`](./data/processed/adni/demographics_tau.csv)
demographics_biomarkers.ipynb
Demographics with amyloid and tau biomarker data and stage assignment- Input: [`ADNIMERGE_14Jul2023.csv`](./data/original/adni/ADNIMERGE_14Jul2023.csv), [`bmi.csv`](./data/processed/adni/bmi.csv), [`strem2.csv`](./data/processed/adni/strem2.csv)
- Output: [`demographics_biomarkers.csv`](./data/processed/adni/demographics_biomarkers.csv)
lipidomics.ipynb
Plasma lipidomics, Meikle lab, longitudinal- Input: [`ADMCLIPIDOMICSMEIKLELABLONG_13Jul2023.csv`](./data/original/adni/ADMCLIPIDOMICSMEIKLELABLONG_13Jul2023.csv), [`Lipid_Models_Final.xlsx`](./data/original/adni/Lipid_Models_Final.xlsx)
- Output: [`lipidomics.csv`](./data/processed/adni/lipidomics.csv), [`lipidomics_total.csv`](./data/processed/adni/lipidomics_total.csv), [`lipidomics_dictionary.csv`](./data/processed/adni/lipidomics_dictionary.csv)
lipoprotein.ipynb
Nightingale NMR analysis of lipoproteins and metabolites- Input: [`ADNINIGHTINGALELONG_05_24_21_27Jul2023.csv`](./data/original/adni/ADNINIGHTINGALELONG_05_24_21_27Jul2023.csv)
- Output: [`lipoprotein.csv`](./data/processed/adni/lipoprotein.csv), [`lipoprotein_dict.csv`](./data/processed/adni/lipoprotein_dict.csv)
somascan.ipynb
CSF proteomics SOMAscan 7000+ proteins post-QC, Cruchaga lab- Input: [`CruchagaLab_CSF_SOMAscan7k_Protein_matrix_postQC_20230620.csv`](./data/original/adni/CruchagaLab_CSF_SOMAscan7k_Protein_matrix_postQC_20230620.csv), [`ADNI_Cruchaga_lab_CSF_SOMAscan7k_analyte_information_20_06_2023.csv`](./data/original/adni/ADNI_Cruchaga_lab_CSF_SOMAscan7k_analyte_information_20_06_2023.csv)
- Output: [`somascan.csv`](./data/processed/adni/somascan.csv), [`somascan_dict.csv`](./data/processed/adni/somascan_dict.csv)
converters.ipynb
Longitudinal decline in cognitive status (CN to MCI, MCI to AD, or CN to AD), excluding participants diagnosed with AD at baseline- Input: [`ADNIMERGE_14Jul2023.csv`](./data/original/adni/ADNIMERGE_14Jul2023.csv)
- Output: [`converters.csv`](./data/processed/adni/converters.csv)
converters_to_ad.ipynb
Longitudinal decline in cognitive status from CN or MCI to AD, excluding participants diagnosed with AD at baseline- Input: [`ADNIMERGE_14Jul2023.csv`](./data/original/adni/ADNIMERGE_14Jul2023.csv)
- Output: [`converters_to_ad.csv`](./data/processed/adni/converters_to_ad.csv)
pilot.ipynb
Demographics and brain MRI data from the ARIC server for participants included in the pilot study- Input from ARIC server: `ARIC_NP/DATA_NP/Visits/Visit 5/derive54_np.sas7bdat`, `DATA_NP/Visits/Visit 5/derive_ncs51_np.sas7bdat`, `DATA_NP/Visits/Visit 1/derive13_np.sas7bdat`
- Input: [`all_eleigible_samples_AS2021_25v3.xlsx`](./data/original/aric/sample_selection/all_eleigible_samples_AS2021_25v3.xlsx), [`lipoproteins_6_29_23.csv`](./data/original/aric/lipoproteins_6_29_23.csv), [`dictionary.csv`](./data/processed/aric/dictionary.csv)
- Output: [`lipoprotein_list.csv`](./data/processed/aric/lipoprotein_list.csv), [`pilot.csv`](./data/processed/aric/pilot.csv), [`demographic_characteristics.csv`](./downloads/tables/aric/demographic_characteristics.csv)
pilot_eligible.ipynb
Demographics and brain MRI data for ARIC participants eligible under the inclusion criteria- Input from ARIC server: `ARIC_NP/DATA_NP/Visits/Visit 5/derive54_np.sas7bdat`, `DATA_NP/Visits/Visit 5/derive_ncs51_np.sas7bdat`, `DATA_NP/Visits/Visit 1/derive13_np.sas7bdat`, `DATA_NP/Visits/MultiVisit/V5_V11 Longitudinal MRI data/v5_v11_mri_derv_np_240221.sas7bdat`
- Input: [`all_eleigible_samples_AS2021_25v3.xlsx`](./data/original/aric/sample_selection/all_eleigible_samples_AS2021_25v3.xlsx), [`ARIC_Pilot_Updated_06032022.csv`](./data/original/aric/ARIC_Pilot_Updated_06032022.csv), [`lipoproteins_6_29_23.csv`](./data/original/aric/lipoproteins_6_29_23.csv), [`dictionary.csv`](./data/processed/aric/dictionary.csv)
- Output: [`lipoprotein_list.csv`](./data/processed/aric/lipoprotein_list.csv), [`pilot.csv`](./data/processed/aric/pilot.csv), [`demographic_characteristics.csv`](./downloads/tables/aric/demographic_characteristics.csv)
davidson.ipynb
Sean Davidson HDL Proteome Watch 2023- Input: [`HDL Proteome Watch 2023 Final.xlsx`](./data/original/other/HDL%20Proteome%20Watch%202023%20Final.xlsx)
- Output: [`hdl_proteome_davidson.csv`](./data/processed/other/hdl_proteome_davidson.csv)
### [`src/analysis/`](./src/analysis/)
This directory contains Jupyter notebook files that perform analyses.
- [`lipidomics_tukey.ipynb`](./src/analysis/adni/lipidomics_tukey.ipynb) ANCOVA followed by Tukey post-hoc to determine which plasma lipids or biomarkers differ significantly between stages
- [`lipidomics_boxplot.ipynb`](./src/analysis/adni/lipidomics_boxplot.ipynb) Distribution of plasma lipids or biomarkers across stages
- [`survival.ipynb`](./src/analysis/adni/survival.ipynb) Survival analysis (Kaplan-Meier survival curve, Cox's proportional hazard model) comparing risk of conversion to AD between biomarker groups.
- [`survival_hdl_ratio.ipynb`](./src/analysis/adni/survival_hdl_ratio.ipynb) Survival analysis comparing cognitive decline between tertiles of non-small HDL FC-to-CE ratio.
- [`somascan_pca.ipynb`](./src/analysis/adni/somascan_pca.ipynb) Clustering of CSF proteins by PCA, followed by linear regression with dependent variable pTau
- [`somascan_boxplot.ipynb`](./src/analysis/adni/lipidomics_tukey.ipynb) Distribution of CSF proteins across cognitive statuses
- [`strem2_lipidomics_regression.ipynb`](./src/analysis/adni/strem2_lipidomics_regression.ipynb) Linear regression of CSF sTREM2 on plasma lipids.
- [`strem2_lipoprotein_regression.ipynb`](./src/analysis/adni/strem2_lipoprotein_regression.ipynb) Linear regression of CSF sTREM2 on plasma lipoprotein subclasses.
- [`calcium_all_sites.ipynb`](./src/analysis/calcium/calcium_all_sites.ipynb) Distribution of calcium measurements compared between Vista and Roche, data from all sites combined
- [`imagej_particle_results_hdl.ipynb`](./src/processing/other/imagej_particle_results_hdl.ipynb) HDL1 and HDL2 particle analysis on EM images using results exported from ImageJ
## Publications
The analysis in this repository contributed to the following publications:
- Li, D.; Mantyh, W. G.; Men, L.; Jain, I.; Glittenberg, M.; An, B.; Zhang, L.; Li, L.; for the Alzheimer’s Disease Neuroimaging Initiative. sTREM2 in Discordant CSF Aβ42 and P‐tau181. _Alz & Dem Diag Ass & Dis Mo_ **2025**, _17_ (1), e70072. https://doi.org/10.1002/dad2.70072.
- Li, D.; An, B.; Men, L.; Glittenberg, M.; Lutsey, P. L.; Mielke, M. M.; Yu, F.; Hoogeveen, R. C.; Gottesman, R.; Zhang, L.; Meyer, M.; Sullivan, K.; Zantek, N.; Alonso, A.; Walker, K. A. The Association of High-Density Lipoprotein Cargo Proteins with Brain Volume in Older Adults in the Atherosclerosis Risk in Communities (ARIC). _Journal of Alzheimer’s Disease_ **2025**, _103_ (3), 724–734. https://doi.org/10.1177/13872877241305806.
## Data Sources
- [Alzheimer's Disease Neuroimaging Initiative (ADNI)](https://adni.loni.usc.edu/)
- [The Atherosclerosis Risk in Communities Study (ARIC)](https://aric.cscc.unc.edu/aric9/)