Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/johnsesana/statistics-for-ds

General guide on topics to study for Data Science and Machine Learning
https://github.com/johnsesana/statistics-for-ds

data-science machine-learning python statistics

Last synced: 10 days ago
JSON representation

General guide on topics to study for Data Science and Machine Learning

Awesome Lists containing this project

README

        

![image](https://github.com/user-attachments/assets/1483c2ec-19e9-40c5-85c5-f09932815fd6)

# Statistics for Data Science

This repo contains a list of essential statistics topics to study for data science and machine learning.
- Every topic contains a jupyter notebook with explanations and code examples. (Work in Progress)

## 1. [Descriptive Statistics](https://github.com/JohnSesana/Statistics-for-DS/blob/main/01-Descriptive-Statistics.ipynb)

- **Measures of Central Tendency**: Mean, Median, Mode
- **Measures of Dispersion**: Variance, Standard Deviation, Range,Interquartile Range (IQR)
- **Skewness and Kurtosis**: Understanding the shape of data distributions
- **Percentiles and Quartiles**: Breaking data into sections

## 2. [Probability Theory](https://github.com/JohnSesana/Statistics-for-DS/blob/main/02-Probability-Theory.ipynb)

- **Probability Distributions**: Uniform, Normal (Gaussian), Binomial, Poisson, Exponential, etc.
- **Conditional Probability**: Bayes’ Theorem, Independence, and Conditional Independence
- **Law of Large Numbers and Central Limit Theorem (CLT)**
- **Combinatorics**: Permutations and Combinations

## 3. [Statistical Inference](https://github.com/JohnSesana/Statistics-for-DS/blob/main/03-Statistical-Inference.ipynb)

- **Hypothesis Testing**: Null and Alternative Hypotheses, p-value, Significance Levels (α)
- **Confidence Intervals**: Estimating population parameters
- **Z-test, t-test, Chi-Square Test, ANOVA**: Parametric and Non-parametric tests
- **Type I and Type II Errors**: False positives and false negatives
- **Power of a Test**: Understanding how likely a test is to detect an effect

## 4. [Regression Analysis](https://github.com/JohnSesana/Statistics-for-DS/blob/main/04-Regression-Analysis.ipynb)

- **Linear Regression**: Simple and Multiple Regression, Assumptions, Interpretation of Coefficients
- **Logistic Regression**: Binary classification, Log-Odds, Interpretation
- **Polynomial Regression**: Fitting non-linear data with polynomial terms
- **Ridge and Lasso Regression**: Regularization techniques to prevent overfitting
- **Bias-Variance Tradeoff**: Understanding the balance between model complexity and prediction accuracy

## 5. Experimental Design

- **A/B Testing**: Randomized controlled experiments, hypothesis testing
- **Randomization and Blocking**: Techniques for reducing bias in experiments
- **Design of Experiments (DOE)**: Factorial design, fractional factorial design, Latin squares

## 6. Bayesian Statistics

- **Bayesian Inference**: Prior, Likelihood, Posterior, and Updating Beliefs
- **Markov Chain Monte Carlo (MCMC)**: Techniques for Bayesian estimation
- **Hierarchical Bayesian Models**: Modeling complex data structures

## 7. Time Series Analysis

- **Stationarity**: Understanding trends, seasonality, and how to make data stationary
- **Autocorrelation and Partial Autocorrelation**
- **ARIMA Models**: Autoregressive Integrated Moving Average
- **Exponential Smoothing**: Simple, Holt’s, and Holt-Winters models

## 8. Multivariate Statistics

- **Principal Component Analysis (PCA)**: Dimensionality reduction techniques
- **Factor Analysis**: Identifying latent variables
- **Clustering**: K-Means, Hierarchical Clustering, DBSCAN
- **Discriminant Analysis**: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA)

## 9. Probability Distributions

- **Discrete Distributions**: Bernoulli, Binomial, Poisson, Geometric
- **Continuous Distributions**: Normal, Exponential, Gamma, Beta
- **Multivariate Distributions**: Multivariate Normal Distribution, Covariance, and Correlation

## 10. Resampling Methods

- **Bootstrap**: Estimating the sampling distribution of a statistic by resampling with replacement
- **Cross-Validation**: Techniques like k-fold cross-validation for model validation
- **Jackknife**: Estimating bias and variance in small samples

## 11. Advanced Machine Learning Metrics

- **Confusion Matrix**: True Positives, True Negatives, False Positives, False Negatives
- **Precision, Recall, F1-Score**: Evaluating classification models
- **ROC Curves and AUC**: Receiver Operating Characteristic curve for evaluating classifiers
- **Lift and Gain Charts**: Performance of models in marketing analytics

## 12. Survival Analysis

- **Censoring and Survival Functions**: Understanding time-to-event data
- **Kaplan-Meier Estimator**
- **Cox Proportional Hazards Model**

## 13. Non-Parametric Methods

- **Mann-Whitney U Test**
- **Kruskal-Wallis Test**
- **Spearman’s Rank Correlation**
- **Wilcoxon Signed-Rank Test**

## 14. Spatial Statistics

- **Spatial Autocorrelation**: Moran’s I, Geary’s C
- **Kriging**: Geostatistical interpolation
- **Point Pattern Analysis**: Analyzing spatial point data

## 15. Missing Data Handling

- **Imputation Techniques**: Mean/Median Imputation, KNN Imputation, Multiple Imputation
- **Dealing with Missingness**: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR)

## 16. Causal Inference

- **Correlation vs. Causation**
- **Causal Diagrams (DAGs)**: Direct and Indirect Effects
- **Instrumental Variables**: Handling confounding variables
- **Difference-in-Differences (DiD)**: Comparing treatment and control groups over time
- **Propensity Score Matching**: Reducing selection bias

## 17. Generalized Linear Models (GLM)

- **Log-Linear Models**: Extension of linear models for count data
- **Poisson and Negative Binomial Regression**: For modeling count data
- **Probit and Tobit Models**: Handling censored or ordinal data

## 18. Statistical Programming

- **Python/R for Statistics**: Libraries like numpy, pandas, statsmodels, scikit-learn, and R packages like ggplot2, dplyr, caret, and glmnet
- **Simulation and Monte Carlo Methods**: Simulating data and probabilistic models