Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/johnsesana/statistics-for-ds
General guide on topics to study for Data Science and Machine Learning
https://github.com/johnsesana/statistics-for-ds
data-science machine-learning python statistics
Last synced: 10 days ago
JSON representation
General guide on topics to study for Data Science and Machine Learning
- Host: GitHub
- URL: https://github.com/johnsesana/statistics-for-ds
- Owner: JohnSesana
- Created: 2024-09-08T20:39:22.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-10-13T05:48:52.000Z (4 months ago)
- Last Synced: 2024-11-16T23:07:30.270Z (2 months ago)
- Topics: data-science, machine-learning, python, statistics
- Language: Jupyter Notebook
- Homepage:
- Size: 1020 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
![image](https://github.com/user-attachments/assets/1483c2ec-19e9-40c5-85c5-f09932815fd6)
# Statistics for Data Science
This repo contains a list of essential statistics topics to study for data science and machine learning.
- Every topic contains a jupyter notebook with explanations and code examples. (Work in Progress)## 1. [Descriptive Statistics](https://github.com/JohnSesana/Statistics-for-DS/blob/main/01-Descriptive-Statistics.ipynb)
- **Measures of Central Tendency**: Mean, Median, Mode
- **Measures of Dispersion**: Variance, Standard Deviation, Range,Interquartile Range (IQR)
- **Skewness and Kurtosis**: Understanding the shape of data distributions
- **Percentiles and Quartiles**: Breaking data into sections## 2. [Probability Theory](https://github.com/JohnSesana/Statistics-for-DS/blob/main/02-Probability-Theory.ipynb)
- **Probability Distributions**: Uniform, Normal (Gaussian), Binomial, Poisson, Exponential, etc.
- **Conditional Probability**: Bayes’ Theorem, Independence, and Conditional Independence
- **Law of Large Numbers and Central Limit Theorem (CLT)**
- **Combinatorics**: Permutations and Combinations## 3. [Statistical Inference](https://github.com/JohnSesana/Statistics-for-DS/blob/main/03-Statistical-Inference.ipynb)
- **Hypothesis Testing**: Null and Alternative Hypotheses, p-value, Significance Levels (α)
- **Confidence Intervals**: Estimating population parameters
- **Z-test, t-test, Chi-Square Test, ANOVA**: Parametric and Non-parametric tests
- **Type I and Type II Errors**: False positives and false negatives
- **Power of a Test**: Understanding how likely a test is to detect an effect## 4. [Regression Analysis](https://github.com/JohnSesana/Statistics-for-DS/blob/main/04-Regression-Analysis.ipynb)
- **Linear Regression**: Simple and Multiple Regression, Assumptions, Interpretation of Coefficients
- **Logistic Regression**: Binary classification, Log-Odds, Interpretation
- **Polynomial Regression**: Fitting non-linear data with polynomial terms
- **Ridge and Lasso Regression**: Regularization techniques to prevent overfitting
- **Bias-Variance Tradeoff**: Understanding the balance between model complexity and prediction accuracy## 5. Experimental Design
- **A/B Testing**: Randomized controlled experiments, hypothesis testing
- **Randomization and Blocking**: Techniques for reducing bias in experiments
- **Design of Experiments (DOE)**: Factorial design, fractional factorial design, Latin squares## 6. Bayesian Statistics
- **Bayesian Inference**: Prior, Likelihood, Posterior, and Updating Beliefs
- **Markov Chain Monte Carlo (MCMC)**: Techniques for Bayesian estimation
- **Hierarchical Bayesian Models**: Modeling complex data structures## 7. Time Series Analysis
- **Stationarity**: Understanding trends, seasonality, and how to make data stationary
- **Autocorrelation and Partial Autocorrelation**
- **ARIMA Models**: Autoregressive Integrated Moving Average
- **Exponential Smoothing**: Simple, Holt’s, and Holt-Winters models## 8. Multivariate Statistics
- **Principal Component Analysis (PCA)**: Dimensionality reduction techniques
- **Factor Analysis**: Identifying latent variables
- **Clustering**: K-Means, Hierarchical Clustering, DBSCAN
- **Discriminant Analysis**: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA)## 9. Probability Distributions
- **Discrete Distributions**: Bernoulli, Binomial, Poisson, Geometric
- **Continuous Distributions**: Normal, Exponential, Gamma, Beta
- **Multivariate Distributions**: Multivariate Normal Distribution, Covariance, and Correlation## 10. Resampling Methods
- **Bootstrap**: Estimating the sampling distribution of a statistic by resampling with replacement
- **Cross-Validation**: Techniques like k-fold cross-validation for model validation
- **Jackknife**: Estimating bias and variance in small samples## 11. Advanced Machine Learning Metrics
- **Confusion Matrix**: True Positives, True Negatives, False Positives, False Negatives
- **Precision, Recall, F1-Score**: Evaluating classification models
- **ROC Curves and AUC**: Receiver Operating Characteristic curve for evaluating classifiers
- **Lift and Gain Charts**: Performance of models in marketing analytics## 12. Survival Analysis
- **Censoring and Survival Functions**: Understanding time-to-event data
- **Kaplan-Meier Estimator**
- **Cox Proportional Hazards Model**## 13. Non-Parametric Methods
- **Mann-Whitney U Test**
- **Kruskal-Wallis Test**
- **Spearman’s Rank Correlation**
- **Wilcoxon Signed-Rank Test**## 14. Spatial Statistics
- **Spatial Autocorrelation**: Moran’s I, Geary’s C
- **Kriging**: Geostatistical interpolation
- **Point Pattern Analysis**: Analyzing spatial point data## 15. Missing Data Handling
- **Imputation Techniques**: Mean/Median Imputation, KNN Imputation, Multiple Imputation
- **Dealing with Missingness**: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR)## 16. Causal Inference
- **Correlation vs. Causation**
- **Causal Diagrams (DAGs)**: Direct and Indirect Effects
- **Instrumental Variables**: Handling confounding variables
- **Difference-in-Differences (DiD)**: Comparing treatment and control groups over time
- **Propensity Score Matching**: Reducing selection bias## 17. Generalized Linear Models (GLM)
- **Log-Linear Models**: Extension of linear models for count data
- **Poisson and Negative Binomial Regression**: For modeling count data
- **Probit and Tobit Models**: Handling censored or ordinal data## 18. Statistical Programming
- **Python/R for Statistics**: Libraries like numpy, pandas, statsmodels, scikit-learn, and R packages like ggplot2, dplyr, caret, and glmnet
- **Simulation and Monte Carlo Methods**: Simulating data and probabilistic models