{"id":13071,"url":"https://github.com/predicate-logic/awesome-data-science","name":"awesome-data-science","description":"Links and references for tools","projects_count":65,"last_synced_at":"2026-06-13T20:00:31.625Z","repository":{"id":82254390,"uuid":"116692993","full_name":"predicate-logic/awesome-data-science","owner":"predicate-logic","description":"Links and references for tools","archived":false,"fork":false,"pushed_at":"2020-07-22T15:45:23.000Z","size":50,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-28T05:02:10.395Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/predicate-logic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-01-08T15:20:30.000Z","updated_at":"2020-07-22T15:45:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"1f2ac407-bc07-4d4e-98ec-dab8588b91d1","html_url":"https://github.com/predicate-logic/awesome-data-science","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/predicate-logic/awesome-data-science","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predicate-logic%2Fawesome-data-science","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predicate-logic%2Fawesome-data-science/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predicate-logic%2Fawesome-data-science/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predicate-logic%2Fawesome-data-science/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/predicate-logic","download_url":"https://codeload.github.com/predicate-logic/awesome-data-science/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predicate-logic%2Fawesome-data-science/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34298247,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-13T02:00:06.617Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"created_at":"2024-01-12T20:23:44.980Z","updated_at":"2026-06-13T20:00:31.626Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["References (R)","Visualization","References","General / Introduction","Datasets","Blog Posts","To Read / Categorize","Libraries (Python)","Forecasting (Python)","How To (Python)","How To (R)","Notebooks (Python)"],"sub_categories":[],"readme":"# Awesome Data Science\nLinks and references useful for data science.\n\n## General / Introduction\n   * [A Visual Introduction To Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n      * One of the slickest visual introductions to machine learning ever produced.\n   * [Seeing Theory](https://seeing-theory.brown.edu/#firstPage)\n      * Great D3 visualization tutorial on the basics of statistics.\n      \n## Libraries (Python)\n   * [Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)\n      * The current standard for data frames in Python.   \n   * [Yellobrick](http://www.scikit-yb.org/en/latest/index.html)\n      * \"Visualizers\" to allow for human-steering of the model selection process.\n   * [Borata](https://github.com/scikit-learn-contrib/boruta_py)\n      * Finds a maximal subset of information carrying features in a data set.\n   * [Featuretools](https://github.com/Featuretools/featuretools)\n      * Automated feature engineering library.\n   * [Surprise](http://surpriselib.com/)\n      * Build and test recommendation systems with a variety of prebuilt algorithms.\n   * [Snorkel](https://hazyresearch.github.io/snorkel/)\n      * Library to extract data from structured or \"dark data\".  Includes references to worfklow/tools to augment the labeling process.\n   * [SpaCy](https://spacy.io/_)\n      * Fast Natural Language Tool Kit (NLTK).\n   * [Voluptuous](https://github.com/alecthomas/voluptuous)\n      * Data validation library.\n   * [Lime](https://github.com/marcotcr/lime)\n       * Explain classifiers.\n   * [FeatureTools](https://docs.featuretools.com/index.html)\n       * Automated feature engineering.\n   * [Snorkel](https://www.snorkel.org/)\n       * Generate training/sample data.\n       \n## Notebooks (Python)\n   * [Parameters \u0026 More for Jupyter Notebooks](https://github.com/nteract/papermill)\n      * Parameterization and reproducability tools for Jupyter notebooks.  Used by Netflix.\n      \n   * [Pyodide](https://github.com/iodide-project/pyodide)\n      * Python and NumPy compiled into WebAssembly enabled notebook designed to for publishing.\n   \n## Forecasting (Python)\n   * [Forecasting Website Traffic using Facebook's Prophet](http://pbpython.com/prophet-overview.html)\n   \n## Visualization\n   * [So You Want to Build a Scroller](http://vallandingham.me/scroller.html)\n      * Example code on how to put together a web-based animated scrolling presentation.\n   * [R Graphics / ggplot2 Tutorial](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html)\n       * Overview of graphing options in R with a good tutorial on using `ggplot2`.\n   * [Facets](https://pair-code.github.io/facets/)\n       * Graphical to explore your data.\n   * [Tableau's Make-over Monday](http://www.makeovermonday.co.uk/blog/)\n       * Blog (and data) for slick visualizations using Tableau.  \"A weekly social data project\".\n   * [R Base Graphics: An Idiot's Guide](http://rstudio-pubs-static.s3.amazonaws.com/7953_4e3efd5b9415444ca065b1167862c349.html)\n       * Good overview of using R's base graphics plotting system.\n   * [Top 50 Matplotlib Visualizations Cookbook](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/)\n       * Examples and explainations of use cases for 50 Matplotlib visualizations.\n   * [Data Visualization Catalog](https://datavizcatalogue.com/index.html)\n       * Catalog of useful visualizations with examples and cross-referenced by utility type.\n   \n## Blog Posts\n   * [Stitchfix Algorithms Tour](http://algorithms-tour.stitchfix.com/#data-platform)\n      * Really great D3 presentation on the algorithms and data science process used at StitchFix.\n   * [On Average, You're Doing It Wrong](https://towardsdatascience.com/on-average-youre-using-the-wrong-average-geometric-harmonic-means-in-data-analysis-2a703e21ea0)\n      * Detailed analysis of arithmetic, geometric, and harmonic means, when to use them, and how, with examples.\n   * [Minikube \u0026 Spark Local Setup](http://blog.madhukaraphatak.com/categories/kubernetes-series/)\n      * Amazing tutorial on setting up Spark inside of a local Kubernetes cluster.\n   * [Statistical Significance](https://towardsdatascience.com/statistical-significance-hypothesis-testing-the-normal-curve-and-p-values-93274fa32687)\n      * Good, simple overview of the core concepts of hypothesis testing.\n   * [Building Toy Neural Networks (Python)](https://iamtrask.github.io/2015/07/12/basic-python-network/)\n      * Very detailed (yet simple) walkthrough of two toy neural networks in Python.  Very educational.\n\n## References\n   * [Cross Validated](https://stats.stackexchange.com/)\n      * User answered question site related to Statistics, Machine Learning, and Data Analytics.\n   * [Statistics HowTo](http://www.statisticshowto.com/probability-and-statistics/)\n      * Simple video explainations of basic statistical concepts.  Very informative. \n   * [Practical Guide To SQL Isolation](https://begriffs.com/posts/2017-08-01-practical-guide-sql-isolation.html)\n      * Most understandable and easy-to-understand guide to SQL Isolation levels I've seen.\n   * [15 Types of Regression](https://www.listendata.com/2018/03/regression-analysis.html)\n      * Great resource.\n   * [39 Machine Learning Resources](https://medium.com/@karamanbk/39-machine-learning-resources-that-will-help-you-in-every-essential-step-b2696515ed9)\n      \n## References (R)\n   * [R Documentation](https://rdrr.io/)\n      * Nice documentation resource for popular R packages.\n  \n      \n## How To (Python)\n   * [Simulating Chutes \u0026 Ladders in Python](https://jakevdp.github.io/blog/2017/12/18/simulating-chutes-and-ladders/?utm_campaign=Data%2BElixir\u0026utm_medium=web\u0026utm_source=Data_Elixir_162)\n       * Very thorough introduction to simulation, Markov Chains, entropy for the board game in Python/Jupyter.\n   * [Simple/Multiple Linear Regression Tutorial](https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9)\n       * Complete tutorial of linear regression in `SKlearn`.\n   * [Python Machine Learning Example: Linear Regression](http://devarea.com/python-machine-learning-example-linear-regression/)\n       * Complete example of Linear Regression in Python/Pandas.\n   * [Machine Learning Regression of 911 Calls](http://machinelearningexp.com/machine-learning-regression-911-calls/)\n   * [Instal IRKernel for Jupyter Notebooks](https://www.datacamp.com/community/blog/jupyter-notebook-r)\n       * Best installation instructions I've found thus far.\n   * [Twitter Sentiment Analysis](https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90)\n       * In-depth, easy-to-understand 4-part NLTK sentiment analysis tutorial.\n   * [Markov Chains From Scratch](http://www.johnwittenauer.net/markov-chains-from-scratch/)\n       * Easy to understand tutorial on coding a Trump Tweet generator using Markov Chains.\n   * [Random Forest In Python](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)\n   * [Introduction to Python Ensembles](https://www.kdnuggets.com/2018/02/introduction-python-ensembles.html)\n      * Detailed how-to on constructing and evaluating ensemble ML methods.\n   * [101 NumPy Excercises](https://www.machinelearningplus.com/101-numpy-exercises-python/)\n   * [Calculate PCA From Scratch In Python](https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/)\n   * [Machine Learning Basics](https://github.com/zotroneneis/machine_learning_basics)\n      * Basic ML algos implemened in Python/Jupyter notebooks.\n   * [How To Use HDFS In Python](https://www.uetke.com/blog/python/how-to-use-hdf5-files-in-python/)\n   * [ARIMA Forcasting](https://www.datasciencecentral.com/profiles/blogs/tutorial-forecasting-with-seasonal-arima)\n      * Detailed walkthrough of creating an ARIMA forecast in a notebook format.\n       \n## How To (R)\n   * [Fundamentals of Linear Regression](https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220)\n       * One of the better tutorials of Linear Regression (with code).\n   * [Gradient Descent](http://www.machinegurning.com/rstats/gradient-descent/)\n       * Intuitive tutorial on Gradient Descent algorithm p(with code).\n   * [Select Optimal # of Topics for LDA](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)\n   * [Linear Regression By Hand](https://dsgazette.com/2018/01/10/linear-regression-by-hand/)\n   \n## Datasets\n   * [data.world](https://data.world/)\n       * Free sample data sets (requires registration).\n       \n## To Read / Categorize\n   * [Practical AI Jupyter Notebooks](https://github.com/GokuMohandas/practicalAI/blob/master/README.md)\n      * Lots of example notebooks on a variety of ML topics.\n   * [Understanding Empirical Bayes Estimation (Using Baseball Statistics)](http://varianceexplained.org/r/empirical_bayes_baseball/)\n      * Using Bayesian priors to assist in estmiation of batting averages in R.\n   * [Understanding Beta Distributions (Using Baseball Statistics)](http://varianceexplained.org/statistics/beta_distribution_and_baseball/)\n       * Same author as above but analysis introduces and uses Beta distributions.\n       * The [post](https://stats.stackexchange.com/questions/47771/what-is-the-intuition-behind-beta-distribution/47782#47782) that started it all.\n   * [Introductory Econometrics](http://www3.wabash.edu/econometrics/EconometricsBook/index.htm)\n       * Online Excel examples from textbook (includes discussions of Monte Carlo simulations and Heteroskedasticity).\n   * [Probabilistic Graphic Models Tutorial](https://blog.statsbot.co/probabilistic-graphical-models-tutorial-and-solutions-e4f1d72af189)\n       * Novel solution to the \"Monty Hall Problem\"\n   * [Applying Bayes Therorem: Simulating the Monty Hall Problem with Python](https://medium.com/@NickDoesData/applying-bayes-theorem-simulating-the-monty-hall-problem-with-python-5054976d1fb5)\n       * Simulation of \"Monty Hall Problem\" in Python with decent explainations.\n   * [Fast.AI Deep Learning Course](http://course.fast.ai/lessons/lesson1.html)\n   * [Machine Learning Tasks for Beginners](https://elitedatascience.com/machine-learning-projects-for-beginners)\n   * [Deep Learning: From Image to Webpage](https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/)\n       * Learn how to write a Deep Learning model to code a webpage from a source image. \n   * [Kubernetes Cheatsheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/)\n   * [Installing PostgreSQL via Helm](https://medium.com/@nicdoye/installing-postgresql-via-helm-237e026453b1)\n       \n ## To Find Resources To Show\n   * How to fix Heteroskedasticity\n   * \" \" \" Collinearity\n   * F-Scores\n   * ROC-AUC\n   * p Test\n   * Linear Regression\n   * Logistic Regression\n   * CART (Random Forest)\n   * Gradient Decent\n   * Bias vs. Varience Tradeoff\n   * Binomial Distribution (discrete)\n   * Poisson Distribution (continuous)\n   \n\n","projects_url":"https://awesome.ecosyste.ms/api/v1/lists/predicate-logic%2Fawesome-data-science/projects"}