https://github.com/justmarkham/dat5

General Assembly's Data Science course in Washington, DC
https://github.com/justmarkham/dat5
Last synced: 9 months ago
JSON representation
General Assembly's Data Science course in Washington, DC
Host: GitHub
URL: https://github.com/justmarkham/dat5
Owner: justmarkham
Created: 2015-03-05T17:06:13.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2022-08-19T14:17:55.000Z (over 3 years ago)
Last Synced: 2025-03-30T15:11:13.150Z (10 months ago)
Language: Jupyter Notebook
Size: 22.2 MB
Stars: 185
Watchers: 24
Forks: 206
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          ## DAT5 Course Repository

Course materials for [General Assembly's Data Science course](https://generalassemb.ly/education/data-science/washington-dc/) in Washington, DC (3/18/15 - 6/3/15).

**Instructors:** Brandon Burroughs and Kevin Markham ([Data School blog](http://www.dataschool.io/), [email newsletter](http://www.dataschool.io/subscribe/), [YouTube channel](https://www.youtube.com/user/dataschool))

Monday | Wednesday

--- | ---

 | 3/18: Introduction and Python

3/23: Git and Command Line | 3/25: Exploratory Data Analysis

**3/30:** Visualization and APIs | 4/1: Machine Learning and KNN

**4/6:** Bias-Variance and Model Evaluation | 4/8: Kaggle Titanic

4/13: Web Scraping, Tidy Data, Reproducibility | 4/15: Linear Regression

4/20: Logistic Regression and Confusion Matrices | 4/22: ROC and Cross-Validation

**4/27:** Project Presentation #1 | 4/29: Naive Bayes

5/4: Natural Language Processing | 5/6: Kaggle Stack Overflow

5/11: Decision Trees | 5/13: Ensembles

**5/18:** Clustering and Regularization | 5/20: Advanced scikit-learn and Regex

**5/25:** *No Class* | 5/27: Databases and SQL

6/1: Course Review | **6/3:** Project Presentation #2

### Key Project Dates

* **3/30:** Deadline for discussing your project idea(s) with an instructor

* **4/6:** Project question and dataset (write-up)

* **4/27:** Project presentation #1 (slides, code, visualizations)

* **5/18:** First draft due (draft of project paper, code, visualizations)

* **5/25:** Peer review due

* **6/3:** Project presentation #2 (project paper, slides, code, visualizations, data, data dictionary)

### Key Project Links

* [Course project requirements](other/project.md)

* [Public data sources](other/public_data.md)

* [Kaggle competitions](http://www.kaggle.com/)

* [Examples of student projects](https://github.com/justmarkham/DAT-project-examples)

* [Peer review guidelines](other/peer_review.md)

### Logistics

* Office hours will take place every Saturday and Sunday.

* Homework will be assigned every Wednesday and due on Monday, and you'll receive feedback by Wednesday.

* Our primary tool for out-of-class communication will be a private chat room through [Slack](https://slack.com/).

### Submission Forms

* [Homework submission form](http://bit.ly/dat5homework) (also for project submissions)

    * [Gist](https://gist.github.com/) is an easy way to put your homework online

* [Feedback submission form](http://bit.ly/dat5feedback) (at the end of every class)

### Before the Course Begins

* Install the [Anaconda distribution](http://continuum.io/downloads) of Python 2.7x.

* Install [Git](http://git-scm.com/book/en/v2/Getting-Started-Installing-Git) and create a [GitHub](https://github.com/) account.

* Once you receive an email invitation from Slack, join our "DAT5 team" and add your photo.

* Choose a [Python workshop](https://generalassemb.ly/education?format=classes-workshops) to attend, depending upon your current skill level:

    * Beginner: [Saturday 3/7 10am-2pm](https://generalassemb.ly/education/introduction-to-python-programming/washington-dc/11137) or [Thursday 3/12 6:30pm-9pm](https://generalassemb.ly/education/introduction-to-python-programming/washington-dc/11136)

    * Intermediate: [Saturday 3/14 10am-2pm](https://generalassemb.ly/education/python-for-data-science-intermediate/washington-dc/11167)

* Practice your Python using the resources below.

### Python Resources

* [Codecademy's Python course](http://www.codecademy.com/en/tracks/python): Good beginner material, including tons of in-browser exercises.

* [DataQuest](https://dataquest.io/missions): Similar interface to Codecademy, but focused on teaching Python in the context of data science.

* [Google's Python Class](https://developers.google.com/edu/python/): Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).

* [A Crash Course in Python for Scientists](http://nbviewer.ipython.org/gist/rpmuller/5920182): Read through the Overview section for a quick introduction to Python.

* [Python for Informatics](http://www.pythonlearn.com/book.php): A very beginner-oriented book, with associated [slides](https://drive.google.com/folderview?id=0B7X1ycQalUnyal9yeUx3VW81VDg&usp=sharing) and [videos](https://www.youtube.com/playlist?list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ).

* Code from our [beginner](code/00_python_beginner_workshop.py) and [intermediate](code/00_python_intermediate_workshop.py) workshops: Useful for review and reference.

-----

### Class 1: Introduction and Python

* Introduction to General Assembly

* Course overview ([slides](slides/01_course_overview.pdf))

* Brief tour of Slack

* Checking the setup of your laptop

* Python lesson with [airline safety data](https://github.com/fivethirtyeight/data/tree/master/airline-safety) ([code](code/01_reading_files.py))

**Homework:**

* Python exercises with [Chipotle order data](https://github.com/TheUpshot/chipotle) (listed at bottom of [code](code/01_reading_files.py) file) ([solution](code/01_chipotle_homework_solution.py))

* Work through GA's excellent introductory [command line tutorial](http://generalassembly.github.io/prework/command-line/#/) and then take this brief [quiz](https://gahub.typeform.com/to/J6xirf).

* Read through the [course project requirements](other/project.md) and start thinking about your own project!

**Optional:**

* If we discovered any setup issues with your laptop, please resolve them before Monday.

* If you're not feeling comfortable in Python, keep practicing using the resources above!

-----

### Class 2: Git and Command Line

* Any questions about the course project?

* Command line ([slides](slides/02_Introduction_to_the_Command_Line.md))

* Git and GitHub ([slides](slides/02_git_github.pdf))

**Homework:**

* Command line exercises with [SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) (listed at the bottom of [Introduction to the Command Line](slides/02_Introduction_to_the_Command_Line.md)) ([solution](homework/02_command_line_hw_soln.md))

* **Note**: This homework is not due until Monday. You might want to create a GitHub repo for your homework instead of using Gist!

**Optional:**

* Browse through some [example student projects](https://github.com/justmarkham/DAT-project-examples) to stimulate your thinking and give you a sense of project scope.

**Resources:**

* This [Command Line Primer](http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything) goes a bit more into command line scripting.

* Read the first two chapters of [Pro Git](http://git-scm.com/book/en/v2) to gain a much deeper understanding of version control and basic Git commands.

* Watch [Introduction to Git and GitHub](https://www.youtube.com/playlist?list=PL5-da3qGB5IBLMp7LtN8Nc3Efd4hJq0kD) (36 minutes) for a quick review of a lot of today's material.

* [GitRef](http://gitref.org/) is an excellent reference guide for Git commands, and [Git quick reference for beginners](http://www.dataschool.io/git-quick-reference-for-beginners/) is a shorter guide with commands grouped by workflow.

* The [Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) covers standard Markdown and a bit of "[GitHub Flavored Markdown](https://help.github.com/articles/github-flavored-markdown/)."

-----

### Class 3: Pandas

* Pandas for data exploration, analysis, and visualization ([code](code/03_exploratory_analysis_pandas.py))

    * [Split-Apply-Combine](http://i.imgur.com/yjNkiwL.png) pattern

    * Simple examples of [joins in Pandas](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/#joining)

**Homework:**

* Pandas practice with [Automobile MPG Data](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) (listed at the bottom of [Exploratory Analysis in Pandas](code/03_exploratory_analysis_pandas.py)) ([solution](homework/03_pandas_hw_soln.py))

* Talk to an instructor about your project

* Don't forget about the Command line exercises (listed at the bottom of [Introduction to the Command Line](slides/02_Introduction_to_the_Command_Line.md))

**Optional:**

* To learn more Pandas, review this [three-part tutorial](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/), or review these two excellent (but extremely long) notebooks on Pandas: [introduction](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_5-Introduction-to-Pandas.ipynb) and [data wrangling](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_6-Data-Wrangling-with-Pandas.ipynb).

* Read [How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips](http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2) for an excellent example of exploratory data analysis.

-----

### Class 4: Visualization and APIs

* Visualization ([slides](slides/04_visualization.pdf) and [code](code/04_visualization.py))

* APIs ([slides](slides/04_apis.pdf) and [code](code/04_apis.py))

**Homework:**

* Visualization practice with [Automobile MPG Data](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) (listed at the bottom of [the visualization code](code/04_visualization.py)) ([solution](homework/04_visualization_hw_soln.py))

* **Note**:  This homework isn't due until Monday.

**Optional:**

* Watch [Look at Your Data](https://www.youtube.com/watch?v=coNDCIMH8bk) (18 minutes) for an excellent example of why visualization is useful for understanding your data.

**Resources:**

* For more on Pandas plotting, read this [notebook](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_7-Plotting-with-Pandas.ipynb) or the [visualization page](http://pandas.pydata.org/pandas-docs/stable/visualization.html) from the official Pandas documentation.

* To learn how to customize your plots further, browse through this [notebook on matplotlib](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_4-Matplotlib.ipynb) or this [similar notebook](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb).

* To explore different types of visualizations and when to use them, [Choosing a Good Chart](http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf) and [The Graphic Continuum](http://www.coolinfographics.com/storage/post-images/The-Graphic-Continuum-POSTER.jpg) are handy one-page references, or check out the [R Graph Catalog](http://shinyapps.stat.ubc.ca/r-graph-catalog/).

* For a more in-depth introduction to visualization, browse through these [PowerPoint slides](http://www2.research.att.com/~volinsky/DataMining/Columbia2011/Slides/Topic2-EDAViz.ppt) from Columbia's Data Mining class.

* [Mashape](https://www.mashape.com/explore) and [Apigee](https://apigee.com/providers) allow you to explore tons of different APIs. Alternatively, a [Python API wrapper](http://www.pythonforbeginners.com/api/list-of-python-apis) is available for many popular APIs.

-----

### Class 5: Data Science Workflow, Machine Learning, KNN

* Iris dataset

    * [What does an iris look like?](http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png)

    * [Data](http://archive.ics.uci.edu/ml/datasets/Iris) hosted by the UCI Machine Learning Repository

    * "Human learning" exercise ([solution](code/05_iris_exercise.py))

* Introduction to data science ([slides](slides/05_intro_to_data_science.pdf))

    * [Quora: What is data science?](https://www.quora.com/What-is-data-science/answer/Michael-Hochster)

    * [Data science Venn diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

    * [Quora: What is the workflow of a data scientist?](http://www.quora.com/What-is-the-work-flow-or-process-of-a-data-scientist/answer/Ryan-Fox-Squire)

    * Example student project: [MetroMetric](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/bus_presentation.pdf)

* Machine learning and KNN ([slides](slides/05_machine_learning_knn.pdf))

    * [Reddit AMA with Yann LeCun](http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun)

    * [Characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/)

* Introduction to scikit-learn ([code](code/05_sklearn_knn.py))

    * Documentation: [user guide](http://scikit-learn.org/stable/modules/neighbors.html), [module reference](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors), [class documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

**Homework:**

* Complete your visualization homework assigned in class 4

* [Reading assignment on the bias-variance tradeoff](homework/06_bias_variance.md)

* A write-up about your [project question and dataset](other/project.md) is due on Monday! ([example one](https://github.com/justmarkham/DAT4-students/blob/master/jason/jk_project_idea.md), [example two](https://github.com/justmarkham/DAT4-students/blob/master/alexlee/project_question.md))

**Optional:**

* For a useful look at the different types of data scientists, read [Analyzing the Analyzers](http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf) (32 pages).

* For some thoughts on what it's like to be a data scientist, read these short posts from [Win-Vector](http://www.win-vector.com/blog/2012/09/on-being-a-data-scientist/) and [Datascope Analytics](http://datascopeanalytics.com/what-we-think/2014/07/31/six-qualities-of-a-great-data-scientist).

* For a fun (yet enlightening) look at the data science workflow, read [What I do when I get a new data set as told through tweets](http://simplystatistics.org/2014/06/13/what-i-do-when-i-get-a-new-data-set-as-told-through-tweets/).

* For a more in-depth introduction to data science, browse through these [PowerPoint slides](http://www2.research.att.com/~volinsky/DataMining/Columbia2011/Slides/Topic1-DMIntro.ppt) from Columbia's Data Mining class.

* For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/). (It's a free PDF download!)

* For a really nice comparison of supervised versus unsupervised learning, plus an introduction to reinforcement learning, watch this [video](http://work.caltech.edu/library/014.html) (13 minutes) from Caltech's [Learning From Data](http://work.caltech.edu/telecourse.html) course.

**Resources:**

* Quora has a [data science topic FAQ](https://www.quora.com/What-is-the-Data-Science-topic-FAQ) with lots of interesting Q&A.

* Keep up with local data-related events through the Data Community DC [event calendar](http://www.datacommunitydc.org/calendar) or [weekly newsletter](http://www.datacommunitydc.org/thenewsletter/).

-----

### Class 6: Bias-Variance Tradeoff and Model Evaluation

* Brief introduction to the IPython Notebook

* Exploring the bias-variance tradeoff ([notebook](notebooks/06_bias_variance.ipynb))

* Discussion of the [assigned reading](homework/06_bias_variance.md) on the bias-variance tradeoff

* Model evaluation procedures ([notebook](notebooks/06_model_evaluation_procedures.ipynb))

**Resources:**

* If you would like to learn the IPython Notebook, the official [Notebook tutorials](http://nbviewer.ipython.org/github/ipython/ipython/blob/master/examples/Notebook/Index.ipynb) are useful.

* To get started with Seaborn for visualization, the official website has a series of [tutorials](http://web.stanford.edu/~mwaskom/software/seaborn/tutorial.html) and an [example gallery](http://web.stanford.edu/~mwaskom/software/seaborn/examples/index.html).

* Hastie and Tibshirani have an excellent [video](https://www.youtube.com/watch?v=_2ij6eaaSl0&t=2m34s) (12 minutes, starting at 2:34) that covers training error versus testing error, the bias-variance tradeoff, and train/test split (which they call the "validation set approach").

* Caltech's Learning From Data course includes a fantastic [video](http://work.caltech.edu/library/081.html) (15 minutes) that may help you to visualize bias and variance.

-----

### Class 7: Kaggle Titanic

* Guest instructor: [Josiah Davis](https://generalassemb.ly/instructors/josiah-davis/3315)

* Participate in Kaggle's [Titanic competition](http://www.kaggle.com/c/titanic-gettingStarted)

    * Work in pairs, but the goal is for every person to make at least one submission by the end of the class period!

**Homework:**

* Option 1 is to do the [Glass identification homework](homework/07_glass_identification.md). This is a good option if you are still getting comfortable with what we have learned so far, and prefer a very structured assignment. ([solution](code/07_glass_id_homework_solution.py))

* Option 2 is to keep working on the Titanic competition, and see if you can make some additional progress! This is a good assignment if you are feeling comfortable with the material and want to learn a bit more on your own.

* In either case, please submit your code as usual, and include lots of code comments!

-----

### Class 8: Web Scraping, Tidy Data, Reproducibility

* Web scraping ([slides](slides/08_web_scraping.pdf) and [code](code/08_web_scraping.py))

    * [HTML Tree](http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png)

* Tidy data:

    * [Introduction](http://stat405.had.co.nz/lectures/18-tidy-data.pdf)

    * Example datasets: [Bob Ross](https://github.com/fivethirtyeight/data/blob/master/bob-ross/elements-by-episode.csv), [NFL ticket prices](https://github.com/fivethirtyeight/data/blob/master/nfl-ticket-prices/2014-average-ticket-price.csv), [airline safety](https://github.com/fivethirtyeight/data/blob/master/airline-safety/airline-safety.csv), [Jets ticket prices](https://github.com/fivethirtyeight/data/blob/master/nfl-ticket-prices/jets-buyer.csv), [Chipotle orders](https://github.com/TheUpshot/chipotle/blob/master/orders.tsv)

* Reproducibility:

    * [Introduction](http://www.dataschool.io/reproducibility-is-not-just-for-researchers/), [Tweet](https://twitter.com/jakevdp/status/519563939177197571)

    * [Components of reproducible analysis](https://github.com/jtleek/datasharing)

    * Examples: [Classic rock](https://github.com/fivethirtyeight/data/tree/master/classic-rock), [student project 1](https://github.com/jwknobloch/DAT4_final_project), [student project 2](https://github.com/justmarkham/DAT4-students/tree/master/Jonathan_Bryan/Project_Files)

**Resources:**

* This [web scraping tutorial from Stanford](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html) provides an example of getting a list of items.

* If you want to learn more about tidy data, [Hadley Wickham's paper](http://www.jstatsoft.org/v59/i10/paper) has a lot of nice examples.

* If your co-workers tend to create spreadsheets that are [unreadable by computers](https://bosker.wordpress.com/2014/12/05/the-government-statistical-services-terrible-spreadsheet-advice/), perhaps they would benefit from reading this list of [tips for releasing data in spreadsheets](http://www.clean-sheet.org/). (There are some additional suggestions in this [answer](http://stats.stackexchange.com/questions/83614/best-practices-for-creating-tidy-data/83711#83711) from Cross Validated.)

* Here's [Colbert on reproducibility](http://thecolbertreport.cc.com/videos/dcyvro/austerity-s-spreadsheet-error) (8 minutes).

-----

### Class 9: Linear Regression

* Linear regression ([notebook](notebooks/09_linear_regression.ipynb))

    * Simple linear regression

    * Estimating and interpreting model coefficients

    * Confidence intervals

    * Hypothesis testing and p-values

    * R-squared

    * Multiple linear regression

    * Feature selection

    * Model evaluation metrics for regression

    * Handling categorical predictors

**Homework:**

* If you're behind on homework, use this time to catch up.

* Keep working on your project... your first presentation is in less than two weeks!!

**Resources:**

* To go much more in-depth on linear regression, read Chapter 3 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), from which this lesson was adapted. Alternatively, watch the [related videos](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/) or read my [quick reference guide](http://www.dataschool.io/applying-and-interpreting-linear-regression/) to the key points in that chapter.

* To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on [simple linear regression](http://www.datarobot.com/blog/ordinary-least-squares-in-python/) and [multiple linear regression](http://www.datarobot.com/blog/multiple-regression-using-statsmodels/).

* This [introduction to linear regression](http://people.duke.edu/~rnau/regintro.htm) is much more detailed and mathematically thorough, and includes lots of good advice.

* This is a relatively quick post on the [assumptions of linear regression](http://pareonline.net/getvn.asp?n=2&v=8).

-----

### Class 10: Logistic Regression and Confusion Matrices

* Logistic regression ([slides](slides/10_logistic_regression_confusion_matrix.pdf) and [code](code/10_logistic_regression_confusion_matrix.py))

* Confusion matrices (same links as above)

**Homework:**

* Video assignment on [ROC Curves and Area Under the Curve](homework/11_roc_auc.md)

* Review the notebook from class 6 on [model evaluation procedures](notebooks/06_model_evaluation_procedures.ipynb)

**Resources:**

* For more on logistic regression, watch the [first three videos](https://www.youtube.com/playlist?list=PL5-da3qGB5IC4vaDba5ClatUmFppXLAhE) (30 minutes total) from Chapter 4 of An Introduction to Statistical Learning.

* UCLA's IDRE has a handy table to help you remember the [relationship between probability, odds, and log-odds](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm).

* Better Explained has a very friendly introduction (with lots of examples) to the [intuition behind "e"](http://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/).

* Here are some useful lecture notes on [interpreting logistic regression coefficients](http://www.unm.edu/~schrader/biostat/bio2/Spr06/lec11.pdf).

* Kevin wrote a [simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) that you can use as a reference guide.

-----

### Class 11: ROC Curves and Cross-Validation

* ROC curves and Area Under the Curve

    * Discuss the [video assignment](homework/11_roc_auc.md)

    * Exercise: [drawing an ROC curve](slides/11_drawing_roc.pdf)

    * Calculating AUC and plotting an ROC curve ([notebook](notebooks/11_roc_auc.ipynb))

* Cross-validation ([notebook](notebooks/11_cross_validation.ipynb))

* Discuss this article on [Smart Autofill for Google Sheets](http://googleresearch.blogspot.com/2014/10/smart-autofill-harnessing-predictive.html)

**Homework:**

* Your first [project presentation](other/project.md) is on Monday! Please submit a link to your project repository (with slides, code, and visualizations) before class using the homework submission form.

**Optional:**

* Titanic exercise ([notebook](notebooks/11_titanic_exercise.ipynb))

**Resources:**

* scikit-learn has extensive documentation on [model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html).

* For more on cross-validation, read section 5.1 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (11 pages) or watch the related videos: [K-fold and leave-one-out cross-validation](https://www.youtube.com/watch?v=nZAM5OXrktY) (14 minutes), [cross-validation the right and wrong ways](https://www.youtube.com/watch?v=S06JpVoNaA0) (10 minutes).

-----

### Class 12: Project Presentation #1

* Project presentations!

**Homework:**

* Read these [Introduction to Probability](https://docs.google.com/presentation/d/1cM2dVbJgTWMkHoVNmYlB9df6P2H8BrjaqAcZTaLe9dA/edit#slide=id.gfc3caad2_00) slides (from the [OpenIntro Statistics textbook](https://www.openintro.org/stat/textbook.php)) and try the included quizzes. Pay specific attention to the following terms: probability, sample space, mutually exclusive, independent.

* Reading assignment on [spam filtering](homework/13_spam_filtering.md).

-----

### Class 13: Naive Bayes

* Conditional probability and Bayes' theorem

    * [Slides](slides/13_bayes_theorem.pdf) (adapted from [Visualizing Bayes' theorem](http://oscarbonilla.com/2009/05/visualizing-bayes-theorem/))

    * [Visualization of conditional probability](http://setosa.io/conditional/)

    * Applying Bayes' theorem to iris classification ([notebook](notebooks/13_bayes_iris.ipynb))

* Naive Bayes classification

    * [Slides](slides/13_naive_bayes.pdf)

    * Example with spam email ([notebook](notebooks/13_naive_bayes_spam.ipynb))

    * Discuss the reading assignment on [spam filtering](homework/13_spam_filtering.md)

    * [Airport security example](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt)

    * Classifying [SMS messages](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) ([code](code/13_naive_bayes.py))

**Homework:**

* Please download/install the following for the NLP class on Monday

    * In Spyder, `import nltk` and run `nltk.download('all')`.  This downloads all of the necessary resources for the Natural Language Tool Kit.

    * We'll be using two new packages/modules for this class:  textblob and lda.  Please install them.  **Hint**:  In the Terminal (Mac) or Git Bash (Windows), run `pip install textblob` and `pip install lda`.

**Resources:**

* For other intuitive introductions to Bayes' theorem, here are two good blog posts that use [ducks](https://planspacedotorg.wordpress.com/2014/02/23/bayes-rule-for-ducks/) and [legos](http://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego).

* For more on conditional probability, these [slides](https://docs.google.com/presentation/d/1psUIyig6OxHQngGEHr3TMkCvhdLInnKnclQoNUr4G4U/edit#slide=id.gfc69f484_00) may be useful.

* For more details on Naive Bayes classification, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).

* If you enjoyed Paul Graham's article, you can read [his follow-up article](http://www.paulgraham.com/better.html) on how he improved his spam filter and this [related paper](http://www.merl.com/publications/docs/TR2004-091.pdf) about state-of-the-art spam filtering in 2004.

* If you're planning on using text features in your project, it's worth exploring the different types of [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html) and the many options for [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

-----

### Class 14: Natural Language Processing

* Natural Language Processing ([notebook](notebooks/14_nlp.ipynb))

* NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition, LDA

* Alternative: TextBlob

**Resources:**

* [Natural Language Processing with Python](http://www.nltk.org/book/): free online book to go in-depth with NLTK

* [NLP online course](https://www.coursera.org/course/nlp): no sessions are available, but [video lectures](https://class.coursera.org/nlp/lecture) and [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) are still accessible

* [Brief slides](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) on the major task areas of NLP

* [Detailed slides](https://github.com/ga-students/DAT_SF_9/blob/master/16_Text_Mining/DAT9_lec16_Text_Mining.pdf) on a lot of NLP terminology

* [A visual survey of text visualization techniques](http://textvis.lnu.se/): for exploration and inspiration

* [DC Natural Language Processing](http://www.meetup.com/DC-NLP/): active Meetup group

* [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml): suite of tools if you want to get serious about NLP

* Getting started with regex: [Python introductory lesson](https://developers.google.com/edu/python/regular-expressions) and [reference guide](https://github.com/justmarkham/DAT3/blob/master/code/99_regex_reference.py), [real-time regex tester](https://regex101.com/#python), [in-depth tutorials](http://www.rexegg.com/)

* [A good explanation of LDA](http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation)

* [Textblob documentation](http://textblob.readthedocs.org/en/dev/)

* [SpaCy](http://honnibal.github.io/spaCy/): a new NLP package

-----

### Class 15: Kaggle Stack Overflow

* Overview of how Kaggle works ([slides](slides/15_kaggle.pdf))

* Kaggle In-Class competition: [Predict whether a Stack Overflow question will be closed](https://inclass.kaggle.com/c/dat5-stack-overflow) ([code](code/15_kaggle.py))

**Optional:**

* Keep working on this competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Wednesday, May 27 (class 20).

**Resources:**

* For a great overview of the diversity of problems tackled by Kaggle competitions, watch [Kaggle Transforms Data Science Into Competitive Sport](https://www.youtube.com/watch?v=8w4UY66GKcM) (28 minutes) by Jeremy Howard (past president of Kaggle).

* [Getting in Shape for the Sport of Data Science](https://www.youtube.com/watch?v=kwt6XEh7U3g) (74 minutes), also by Jeremy Howard, contains a lot of tips for competitive machine learning.

* [Learning from the best](http://blog.kaggle.com/2014/08/01/learning-from-the-best/) is an excellent blog post covering top tips from Kaggle Masters on how to do well on Kaggle.

* [Feature Engineering Without Domain Expertise](https://www.youtube.com/watch?v=bL4b1sGnILU) (17 minutes), a talk by Kaggle Master Nick Kridler, provides some simple advice about how to iterate quickly and where to spend your time during a Kaggle competition.

* Kevin's [project presentation video](https://www.youtube.com/watch?v=HGr1yQV3Um0) (16 minutes) gives a nice tour of the end-to-end machine learning process for a Kaggle competition. (Or, just check out the [slides](https://speakerdeck.com/justmarkham/allstate-purchase-prediction-challenge-on-kaggle).)

-----

### Class 16: Decision Trees

* Decision trees ([notebook](notebooks/16_decision_trees.ipynb))

**Resources:**

* scikit-learn documentation: [Decision Trees](http://scikit-learn.org/stable/modules/tree.html)

**Installing Graphviz (optional):**

* Mac:

    * [Download and install PKG file](http://www.graphviz.org/Download_macos.php)

* Windows:

    * [Download and install MSI file](http://www.graphviz.org/Download_windows.php)

    * **Add it to your Path:** Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: `C:\Program Files (x86)\Graphviz2.38\bin`

-----

### Class 17: Ensembles

* Ensembles and random forests ([notebook](notebooks/17_ensembling.ipynb))

**Homework:**

* Your [project draft](other/project.md#may-18-first-draft-due) is due on Monday! Please submit a link to your project repository (with paper, code, and visualizations) before class using the homework submission form.

    * Your peers and your instructors will be giving you feedback on your project draft.

    * Here's an example of a great [final project paper](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf) from a past student.

* Make at least one new submission to our [Kaggle competition](https://inclass.kaggle.com/c/dat5-stack-overflow)! We suggest trying Random Forests or building your own ensemble of models. For assistance, you could use this [framework code](code/17_ensembling_exercise.py), or refer to the [complete code](code/15_kaggle.py) from class 15. You can optionally submit your code to us if you want feedback.

**Resources:**

* scikit-learn documentation: [Ensembles](http://scikit-learn.org/stable/modules/ensemble.html)

* Quora: [How do random forests work in layman's terms?](http://www.quora.com/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1)

-----

### Class 18: Clustering and Regularization

* Clustering ([slides](slides/18_clustering.pdf) and [code](code/18_clustering.py))

* Regularization ([notebook](notebooks/18_regularization.ipynb) and [code](code/18_regularization.py))

**Homework:**

* You will be assigned to review the project drafts of two of your peers. You have until next Monday to provide them with feedback, according to [these guidelines](other/peer_review.md).

**Resources:**

* [Introduction to Data Mining](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) has a thorough [chapter on cluster analysis](http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf).

* The scikit-learn user guide has a nice [section on clustering](http://scikit-learn.org/stable/modules/clustering.html).

* Wikipedia article on [determining the number of clusters](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).

* This [K-means clustering visualization](http://shiny.rstudio.com/gallery/kmeans-example.html) allows you to set different numbers of clusters for the iris data, and this [other visualization](http://asa.1gb.ru/kmeans/1.html) allows you to see the effects of different initial positions for the centroids.

* Fun examples of clustering: [A Statistical Analysis of the Work of Bob Ross](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) (with [data and Python code](https://github.com/fivethirtyeight/data/tree/master/bob-ross)), [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/), and [characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/).

* An Introduction to Statistical Learning has useful videos on [K-means clustering](https://www.youtube.com/watch?v=aIybuNt9ps4&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2&index=3) (17 minutes), [ridge regression](https://www.youtube.com/watch?v=cSKzqb0EKS0&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=6) (13 minutes), and [lasso regression](https://www.youtube.com/watch?v=A5I1G1MfUmA&index=7&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI) (15 minutes).

* Caltech's Learning From Data course has a great video introducing [regularization](http://work.caltech.edu/library/121.html) (8 minutes) that builds upon their video about the [bias-variance tradeoff](http://work.caltech.edu/library/081.html).

* Here is a longer example of [feature scaling](http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/about_standardization_normalization.ipynb) in scikit-learn, with additional discussion of the types of scaling you can use.

* [Clever Methods of Overfitting](http://hunch.net/?p=22) is a classic post by John Langford.

-----

### Class 19: Advanced scikit-learn and Regular Expressions

* Advanced scikit-learn ([code](code/19_advanced_sklearn.py))

    * Searching for optimal parameters: [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html)

        * [Exercise](code/19_gridsearchcv_exercise.py)

    * Standardization of features: [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

    * Chaining steps: [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html)

* Regular expressions ("regex")

    * Motivating example: [data](data/homicides.txt), [code](code/19_regex_exercise.py)

    * Reference guide: [code](code/19_regex_reference.py)

**Optional:**

* Use regular expressions to create a list of causes from the homicide data. Your list should look like this: `['shooting', 'shooting', 'blunt force', ...]`. If the cause is not listed for a particular homicide, include it in the list as `'unknown'`.

**Resources:**

* scikit-learn has an incredibly active [mailing list](https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/index.html) that is often much more useful than Stack Overflow for researching a particular function.

* The scikit-learn documentation includes a [machine learning map](http://scikit-learn.org/stable/tutorial/machine_learning_map/) that may help you to choose the "best" model for your task.

* In you want to build upon the regex material presented in today's class, Google's Python Class includes an excellent [lesson](https://developers.google.com/edu/python/regular-expressions) (with an associated [video](https://www.youtube.com/watch?v=kWyoYtvJpe4&index=4&list=PL5-da3qGB5IA5NwDxcEJ5dvt8F9OQP7q5)).

* [regex101](https://regex101.com/#python) is an online tool for testing your regular expressions in real time.

* If you want to go really deep with regular expressions, [RexEgg](http://www.rexegg.com/) includes endless articles and tutorials.

* [Exploring Expressions of Emotions in GitHub Commit Messages](http://geeksta.net/geeklog/exploring-expressions-emotions-github-commit-messages/) is a fun example of how regular expressions can be used for data analysis.

-----

### Class 20: Databases and SQL

* Databases and SQL ([slides](slides/20_sql.pdf) and [code](code/20_sql.py))

**Homework:**

* Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf).

* Your [final project](other/project.md#june-3-project-presentation-2) is due next Wednesday!

    * Please submit a link to your project repository before Wednesday's class using the homework submission form.

    * Your presentation should start with a recap of the key information from the previous presentation, but you should spend most of your presentation discussing what has happened since then.

    * Don't forget to practice your presentation and time yourself!

**Resources:**

* [SQLZOO](http://sqlzoo.net/wiki/SQL_Tutorial), [Mode Analytics](http://sqlschool.modeanalytics.com/), and [Code School](http://campus.codeschool.com/courses/try-sql/contents) all have online SQL tutorials that look promising.

* [w3schools](http://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all) has a sample database that allows you to practice your SQL.

* [10 Easy Steps to a Complete Understanding of SQL](http://tech.pro/tutorial/1555/10-easy-steps-to-a-complete-understanding-of-sql) is a good article for those who have some SQL experience and want to understand it at a deeper level.

* [A Comparison Of Relational Database Management Systems](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems) gives the pros and cons of SQLite, MySQL, and PostgreSQL.

* If you want to go deeper into databases and SQL, Stanford has a well-respected series of [14 mini-courses](https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about).

-----

### Class 21: Course Review

* Pipelines ([code](code/19_advanced_sklearn.py))

* Class review

* Creating an ensemble ([code](code/21_ensembles_example.py))

**Resources:**

* [Data science review](https://docs.google.com/document/d/1XCdyrsQwU5OC5os7RHdVTEtS-tpHBbsoKKWLpYI6Svo/edit?usp=sharing): A summary of key concepts from the Data Science course.

* [Comparing supervised learning algorithms](https://docs.google.com/spreadsheets/d/15_QJXm6urctsbIXO-C_eXrsSffbHedio8z0E5ozxO-M/edit?usp=sharing): Kevin's table comparing the machine learning models we studied in the course.

* [Choosing a Machine Learning Classifier](http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/): Edwin Chen's short and highly readable guide.

* [Machine Learning Done Wrong](http://ml.posthaven.com/machine-learning-done-wrong) and [Common Pitfalls in Machine Learning](http://danielnee.com/?p=155): Thoughtful advice on common mistakes to avoid in machine learning.

* [Practical machine learning tricks from the KDD 2011 best industry paper](http://blog.david-andrzejewski.com/machine-learning/practical-machine-learning-tricks-from-the-kdd-2011-best-industry-paper/): More advanced advice than the resources above.

* [An Empirical Comparison of Supervised Learning Algorithms](http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf): Research paper from 2006.

* [Many more resources for continued learning!](other/resources.md)

-----

### Class 22: Project Presentation #2

* Presentations!

 

**Class is over!  What should I do now?**

* Take a break!

* Go back through class notes/code/videos to make sure you feel comfortable with what we've learned.

* Take a look at the **Resources** for each class to get a deeper understanding of what we've learned.  Start with the **Resources** from Class 21 and move to topics you are most interested in.

* You might not realize it, but you are at a point where you can continue learning on your own.  You have all of the skills necessary to read papers, blogs, documentation, etc.

* GA Data Guild

 * [8/24/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13274)

 * [9/21/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13275)

 * [10/19/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13276)

 * [11/9/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13277)

* Follow data scientists on Twitter.  This will help you stay up on the latest news/models/applications/tools.

* Participate in [Data Community DC](http://www.datacommunitydc.org/) events.  They sponsor meetups, workshops, etc, notably the [Data Science DC Meetup](http://www.meetup.com/Data-Science-DC/).  Sign up for their [newsletter](http://www.datacommunitydc.org/newsletter/) also!

* Read blogs to keep learning.  I really like [District Data Labs](http://districtdatalabs.silvrback.com/).

* Do Kaggle competitions!  This is a good way to continue and hone your skillset.  Plus, you'll learn a ton along the way.

And finally, don't forget about [graduation](https://generalassemb.ly/education/graduation-april-may-june-courses/washington-dc/12892)!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/justmarkham/dat5

Awesome Lists containing this project

README