Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/predicate-logic/awesome-data-science

Links and references for tools
https://github.com/predicate-logic/awesome-data-science

List: awesome-data-science

Last synced: about 1 month ago
JSON representation

Links and references for tools

Awesome Lists containing this project

README

        

# Awesome Data Science
Links and references useful for data science.

## General / Introduction
* [A Visual Introduction To Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
* One of the slickest visual introductions to machine learning ever produced.
* [Seeing Theory](https://seeing-theory.brown.edu/#firstPage)
* Great D3 visualization tutorial on the basics of statistics.

## Libraries (Python)
* [Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
* The current standard for data frames in Python.
* [Yellobrick](http://www.scikit-yb.org/en/latest/index.html)
* "Visualizers" to allow for human-steering of the model selection process.
* [Borata](https://github.com/scikit-learn-contrib/boruta_py)
* Finds a maximal subset of information carrying features in a data set.
* [Featuretools](https://github.com/Featuretools/featuretools)
* Automated feature engineering library.
* [Surprise](http://surpriselib.com/)
* Build and test recommendation systems with a variety of prebuilt algorithms.
* [Snorkel](https://hazyresearch.github.io/snorkel/)
* Library to extract data from structured or "dark data". Includes references to worfklow/tools to augment the labeling process.
* [SpaCy](https://spacy.io/_)
* Fast Natural Language Tool Kit (NLTK).
* [Voluptuous](https://github.com/alecthomas/voluptuous)
* Data validation library.
* [Lime](https://github.com/marcotcr/lime)
* Explain classifiers.
* [FeatureTools](https://docs.featuretools.com/index.html)
* Automated feature engineering.
* [Snorkel](https://www.snorkel.org/)
* Generate training/sample data.

## Notebooks (Python)
* [Parameters & More for Jupyter Notebooks](https://github.com/nteract/papermill)
* Parameterization and reproducability tools for Jupyter notebooks. Used by Netflix.

* [Pyodide](https://github.com/iodide-project/pyodide)
* Python and NumPy compiled into WebAssembly enabled notebook designed to for publishing.

## Forecasting (Python)
* [Forecasting Website Traffic using Facebook's Prophet](http://pbpython.com/prophet-overview.html)

## Visualization
* [So You Want to Build a Scroller](http://vallandingham.me/scroller.html)
* Example code on how to put together a web-based animated scrolling presentation.
* [R Graphics / ggplot2 Tutorial](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html)
* Overview of graphing options in R with a good tutorial on using `ggplot2`.
* [Facets](https://pair-code.github.io/facets/)
* Graphical to explore your data.
* [Tableau's Make-over Monday](http://www.makeovermonday.co.uk/blog/)
* Blog (and data) for slick visualizations using Tableau. "A weekly social data project".
* [R Base Graphics: An Idiot's Guide](http://rstudio-pubs-static.s3.amazonaws.com/7953_4e3efd5b9415444ca065b1167862c349.html)
* Good overview of using R's base graphics plotting system.
* [Top 50 Matplotlib Visualizations Cookbook](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/)
* Examples and explainations of use cases for 50 Matplotlib visualizations.
* [Data Visualization Catalog](https://datavizcatalogue.com/index.html)
* Catalog of useful visualizations with examples and cross-referenced by utility type.

## Blog Posts
* [Stitchfix Algorithms Tour](http://algorithms-tour.stitchfix.com/#data-platform)
* Really great D3 presentation on the algorithms and data science process used at StitchFix.
* [On Average, You're Doing It Wrong](https://towardsdatascience.com/on-average-youre-using-the-wrong-average-geometric-harmonic-means-in-data-analysis-2a703e21ea0)
* Detailed analysis of arithmetic, geometric, and harmonic means, when to use them, and how, with examples.
* [Minikube & Spark Local Setup](http://blog.madhukaraphatak.com/categories/kubernetes-series/)
* Amazing tutorial on setting up Spark inside of a local Kubernetes cluster.
* [Statistical Significance](https://towardsdatascience.com/statistical-significance-hypothesis-testing-the-normal-curve-and-p-values-93274fa32687)
* Good, simple overview of the core concepts of hypothesis testing.
* [Building Toy Neural Networks (Python)](https://iamtrask.github.io/2015/07/12/basic-python-network/)
* Very detailed (yet simple) walkthrough of two toy neural networks in Python. Very educational.

## References
* [Cross Validated](https://stats.stackexchange.com/)
* User answered question site related to Statistics, Machine Learning, and Data Analytics.
* [Statistics HowTo](http://www.statisticshowto.com/probability-and-statistics/)
* Simple video explainations of basic statistical concepts. Very informative.
* [Practical Guide To SQL Isolation](https://begriffs.com/posts/2017-08-01-practical-guide-sql-isolation.html)
* Most understandable and easy-to-understand guide to SQL Isolation levels I've seen.
* [15 Types of Regression](https://www.listendata.com/2018/03/regression-analysis.html)
* Great resource.
* [39 Machine Learning Resources](https://medium.com/@karamanbk/39-machine-learning-resources-that-will-help-you-in-every-essential-step-b2696515ed9)

## References (R)
* [R Documentation](https://rdrr.io/)
* Nice documentation resource for popular R packages.


## How To (Python)
* [Simulating Chutes & Ladders in Python](https://jakevdp.github.io/blog/2017/12/18/simulating-chutes-and-ladders/?utm_campaign=Data%2BElixir&utm_medium=web&utm_source=Data_Elixir_162)
* Very thorough introduction to simulation, Markov Chains, entropy for the board game in Python/Jupyter.
* [Simple/Multiple Linear Regression Tutorial](https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9)
* Complete tutorial of linear regression in `SKlearn`.
* [Python Machine Learning Example: Linear Regression](http://devarea.com/python-machine-learning-example-linear-regression/)
* Complete example of Linear Regression in Python/Pandas.
* [Machine Learning Regression of 911 Calls](http://machinelearningexp.com/machine-learning-regression-911-calls/)
* [Instal IRKernel for Jupyter Notebooks](https://www.datacamp.com/community/blog/jupyter-notebook-r)
* Best installation instructions I've found thus far.
* [Twitter Sentiment Analysis](https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90)
* In-depth, easy-to-understand 4-part NLTK sentiment analysis tutorial.
* [Markov Chains From Scratch](http://www.johnwittenauer.net/markov-chains-from-scratch/)
* Easy to understand tutorial on coding a Trump Tweet generator using Markov Chains.
* [Random Forest In Python](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)
* [Introduction to Python Ensembles](https://www.kdnuggets.com/2018/02/introduction-python-ensembles.html)
* Detailed how-to on constructing and evaluating ensemble ML methods.
* [101 NumPy Excercises](https://www.machinelearningplus.com/101-numpy-exercises-python/)
* [Calculate PCA From Scratch In Python](https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/)
* [Machine Learning Basics](https://github.com/zotroneneis/machine_learning_basics)
* Basic ML algos implemened in Python/Jupyter notebooks.
* [How To Use HDFS In Python](https://www.uetke.com/blog/python/how-to-use-hdf5-files-in-python/)
* [ARIMA Forcasting](https://www.datasciencecentral.com/profiles/blogs/tutorial-forecasting-with-seasonal-arima)
* Detailed walkthrough of creating an ARIMA forecast in a notebook format.

## How To (R)
* [Fundamentals of Linear Regression](https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220)
* One of the better tutorials of Linear Regression (with code).
* [Gradient Descent](http://www.machinegurning.com/rstats/gradient-descent/)
* Intuitive tutorial on Gradient Descent algorithm p(with code).
* [Select Optimal # of Topics for LDA](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)
* [Linear Regression By Hand](https://dsgazette.com/2018/01/10/linear-regression-by-hand/)

## Datasets
* [data.world](https://data.world/)
* Free sample data sets (requires registration).

## To Read / Categorize
* [Practical AI Jupyter Notebooks](https://github.com/GokuMohandas/practicalAI/blob/master/README.md)
* Lots of example notebooks on a variety of ML topics.
* [Understanding Empirical Bayes Estimation (Using Baseball Statistics)](http://varianceexplained.org/r/empirical_bayes_baseball/)
* Using Bayesian priors to assist in estmiation of batting averages in R.
* [Understanding Beta Distributions (Using Baseball Statistics)](http://varianceexplained.org/statistics/beta_distribution_and_baseball/)
* Same author as above but analysis introduces and uses Beta distributions.
* The [post](https://stats.stackexchange.com/questions/47771/what-is-the-intuition-behind-beta-distribution/47782#47782) that started it all.
* [Introductory Econometrics](http://www3.wabash.edu/econometrics/EconometricsBook/index.htm)
* Online Excel examples from textbook (includes discussions of Monte Carlo simulations and Heteroskedasticity).
* [Probabilistic Graphic Models Tutorial](https://blog.statsbot.co/probabilistic-graphical-models-tutorial-and-solutions-e4f1d72af189)
* Novel solution to the "Monty Hall Problem"
* [Applying Bayes Therorem: Simulating the Monty Hall Problem with Python](https://medium.com/@NickDoesData/applying-bayes-theorem-simulating-the-monty-hall-problem-with-python-5054976d1fb5)
* Simulation of "Monty Hall Problem" in Python with decent explainations.
* [Fast.AI Deep Learning Course](http://course.fast.ai/lessons/lesson1.html)
* [Machine Learning Tasks for Beginners](https://elitedatascience.com/machine-learning-projects-for-beginners)
* [Deep Learning: From Image to Webpage](https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/)
* Learn how to write a Deep Learning model to code a webpage from a source image.
* [Kubernetes Cheatsheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/)
* [Installing PostgreSQL via Helm](https://medium.com/@nicdoye/installing-postgresql-via-helm-237e026453b1)

## To Find Resources To Show
* How to fix Heteroskedasticity
* " " " Collinearity
* F-Scores
* ROC-AUC
* p Test
* Linear Regression
* Logistic Regression
* CART (Random Forest)
* Gradient Decent
* Bias vs. Varience Tradeoff
* Binomial Distribution (discrete)
* Poisson Distribution (continuous)