https://github.com/justmarkham/dat4
General Assembly's Data Science course in Washington, DC
https://github.com/justmarkham/dat4
Last synced: 8 months ago
JSON representation
General Assembly's Data Science course in Washington, DC
- Host: GitHub
- URL: https://github.com/justmarkham/dat4
- Owner: justmarkham
- Created: 2014-12-10T19:38:29.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2021-02-15T23:26:28.000Z (almost 5 years ago)
- Last Synced: 2025-04-12T10:59:49.789Z (9 months ago)
- Language: Jupyter Notebook
- Size: 56.3 MB
- Stars: 799
- Watchers: 82
- Forks: 665
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## DAT4 Course Repository
Course materials for [General Assembly's Data Science course](https://generalassemb.ly/education/data-science/washington-dc/) in Washington, DC (12/15/14 - 3/16/15).
**Instructors:** Sinan Ozdemir and Kevin Markham ([Data School blog](http://www.dataschool.io/), [email newsletter](http://www.dataschool.io/subscribe/), [YouTube channel](https://www.youtube.com/user/dataschool))
**Teaching Assistant:** Brandon Burroughs
**Office hours:** 1-3pm on Saturday and Sunday ([Starbucks at 15th & K](http://www.yelp.com/biz/starbucks-washington-15)), 5:15-6:30pm on Monday (GA)
**[Course Project information](project.md)**
Monday | Wednesday
--- | ---
12/15: [Introduction](#class-1-introduction) | 12/17: [Python](#class-2-python)
12/22: [Getting Data](#class-3-getting-data) | 12/24: *No Class*
12/29: *No Class* | 12/31: *No Class*
1/5: [Git and GitHub](#class-4-git-and-github) | 1/7: [Pandas](#class-5-pandas)
**Milestone:** Question and Data Set
1/12: [Numpy, Machine Learning, KNN](#class-6-numpy-machine-learning-knn) | 1/14: [scikit-learn, Model Evaluation Procedures](#class-7-scikit-learn-model-evaluation-procedures)
1/19: *No Class* | 1/21: [Linear Regression](#class-8-linear-regression)
1/26: [Logistic Regression,
Preview of Other Models](#class-9-logistic-regression-preview-of-other-models) | 1/28: [Model Evaluation Metrics](#class-10-model-evaluation-metrics)
**Milestone:** Data Exploration and Analysis Plan
2/2: [Working a Data Problem](#class-11-working-a-data-problem) | 2/4: [Clustering and Visualization](#class-12-clustering-and-visualization)
**Milestone:** Deadline for Topic Changes
2/9: [Naive Bayes](#class-13-naive-bayes) | 2/11: [Natural Language Processing](#class-14-natural-language-processing)
2/16: *No Class* | 2/18: [Decision Trees](#class-15-decision-trees)
**Milestone:** First Draft
2/23: [Ensembling](#class-16-ensembling) | 2/25: [Databases and MapReduce](#class-17-databases-and-mapreduce)
3/2: [Recommenders](#class-18-recommenders) | 3/4: [Advanced scikit-learn](#class-19-advanced-scikit-learn)
**Milestone:** Second Draft (Optional)
3/9: [Course Review](#class-20-course-review) | 3/11: [Project Presentations](#class-21-project-presentations)
3/16: [Project Presentations](#class-22-project-presentations) |
### Installation and Setup
* Install the [Anaconda distribution](http://continuum.io/downloads) of Python 2.7x.
* Install [Git](http://git-scm.com/book/en/v2/Getting-Started-Installing-Git) and create a [GitHub](https://github.com/) account.
* Once you receive an email invitation from [Slack](https://slack.com/), join our "DAT4 team" and add your photo!
### Class 1: Introduction
* Introduction to General Assembly
* Course overview: our philosophy and expectations ([slides](slides/01_course_overview.pdf))
* Data science overview ([slides](slides/01_intro_to_data_science.pdf))
* Tools: check for proper setup of Anaconda, overview of Slack
**Homework:**
* Resolve any installation issues before next class.
**Optional:**
* Review the [code](code/00_python_refresher.py) from Saturday's Python refresher for a recap of some Python basics.
* Read [Analyzing the Analyzers](http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf) for a useful look at the different types of data scientists.
* Subscribe to the [Data Community DC newsletter](http://www.datacommunitydc.org/thenewsletter/) or check out their [event calendar](http://www.datacommunitydc.org/calendar) to become acquainted with the local data community.
### Class 2: Python
* Brief overview of Python environments: Python interpreter, IPython interpreter, Spyder
* Python quiz ([solution](code/02_python_quiz_solution.py))
* Working with data in Python
* Obtain data from a [public data source](public_data.md)
* [FiveThirtyEight alcohol data](https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption), and [revised data](data/drinks.csv) (continent column added)
* Reading and writing files in Python ([code](code/02_file_io.py))
**Homework:**
* [Python exercise](code/02_file_io_homework.py) ([solution](code/02_file_io_homework_solution.py))
* Read through the [project page](project.md) in detail.
* Review a few [projects from past Data Science courses](https://github.com/justmarkham/DAT-project-examples) to get a sense of the variety and scope of student projects.
* Check for proper setup of Git by running `git clone https://github.com/justmarkham/DAT-project-examples.git`
**Optional:**
* If you need more practice with Python, review the "Python Overview" section of [A Crash Course in Python](http://nbviewer.ipython.org/gist/rpmuller/5920182), work through some of [Codecademy's Python course](http://www.codecademy.com/en/tracks/python), or work through [Google's Python Class](https://developers.google.com/edu/python/) and its exercises.
* For more project inspiration, browse the [student projects](http://cs229.stanford.edu/projects2013.html) from Andrew Ng's [Machine Learning course](http://cs229.stanford.edu/) at Stanford.
**Resources:**
* [Online Python Tutor](http://pythontutor.com/) is useful for visualizing (and debugging) your code.
### Class 3: Getting Data
* Checking your homework
* Regular expressions, web scraping, APIs ([slides](slides/03_getting_data.pdf), [regex code](code/03_re_example.py), [web scraping and API code](code/03_getting_data.py))
* Any questions about the course project?
**Homework:**
* Think about your project question, and start looking for data that will help you to answer your question.
* Prepare for our next class on Git and GitHub:
* You'll need to know some command line basics, so please work through GA's excellent [command line tutorial](http://generalassembly.github.io/prework/command-line/#/) and then take this brief [quiz](https://gahub.typeform.com/to/J6xirf).
* Check for proper setup of Git by running `git clone https://github.com/justmarkham/DAT-project-examples.git`. If that doesn't work, you probably need to [install Git](http://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
* Create a [GitHub account](https://github.com/). (You don't need to download anything from GitHub.)
**Optional:**
* If you aren't feeling comfortable with the Python we've done so far, keep practicing using the resources above!
**Resources:**
* [regex101](https://regex101.com/#python) is an excellent tool for testing your regular expressions. For learning more regular expressions, Google's Python Class includes an [excellent regex lesson](https://developers.google.com/edu/python/regular-expressions) (which includes a [video](http://www.youtube.com/watch?v=kWyoYtvJpe4)).
* [Mashape](https://www.mashape.com/explore) and [Apigee](https://apigee.com/providers) allow you to explore tons of different APIs. Alternatively, a [Python API wrapper](http://www.pythonforbeginners.com/api/list-of-python-apis) is available for many popular APIs.
### Class 4: Git and GitHub
* Special guest: Nick DePrey presenting his class project from DAT2
* Git and GitHub ([slides](slides/04_git_github.pdf))
**Homework:**
* Project milestone: Submit your [question and data set](project.md) to your folder in [DAT4-students](https://github.com/justmarkham/DAT4-students) before class on Wednesday! (This is a great opportunity to practice writing Markdown and creating a pull request.)
**Optional:**
* Clone this repo (DAT4) for easy access to the course files.
**Resources:**
* Read the first two chapters of [Pro Git](http://git-scm.com/book/en/v2) to gain a much deeper understanding of version control and basic Git commands.
* [GitRef](http://gitref.org/) is an excellent reference guide for Git commands.
* [Git quick reference for beginners](http://www.dataschool.io/git-quick-reference-for-beginners/) is a shorter reference guide with commands grouped by workflow.
* The [Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) covers standard Markdown and a bit of "[GitHub Flavored Markdown](https://help.github.com/articles/github-flavored-markdown/)."
### Class 5: Pandas
* Pandas for data exploration, analysis, and visualization ([code](code/05_pandas.py))
* [Split-Apply-Combine](http://i.imgur.com/yjNkiwL.png) pattern
* Simple examples of [joins in Pandas](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/#joining)
**Homework:**
* [Pandas homework](homework/05_pandas.md)
**Optional:**
* To learn more Pandas, review this [three-part tutorial](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/), or review these three excellent (but extremely long) notebooks on Pandas: [introduction](http://nbviewer.ipython.org/urls/raw.github.com/fonnesbeck/Bios366/master/notebooks/Section2_5-Introduction-to-Pandas.ipynb), [data wrangling](http://nbviewer.ipython.org/urls/raw.github.com/fonnesbeck/Bios366/master/notebooks/Section2_6-Data-Wrangling-with-Pandas.ipynb), and [plotting](http://nbviewer.ipython.org/urls/raw.github.com/fonnesbeck/Bios366/master/notebooks/Section2_7-Plotting-with-Pandas.ipynb).
**Resources:**
* For more on Pandas plotting, read the [visualization page](http://pandas.pydata.org/pandas-docs/stable/visualization.html) from the official Pandas documentation.
* To learn how to customize your plots further, browse through this [notebook on matplotlib](http://nbviewer.ipython.org/github/fonnesbeck/Bios366/blob/master/notebooks/Section2_4-Matplotlib.ipynb).
* To explore different types of visualizations and when to use them, [Choosing a Good Chart](http://www.extremepresentation.com/uploads/documents/choosing_a_good_chart.pdf) is a handy one-page reference, and Columbia's Data Mining class has an excellent [slide deck](http://www2.research.att.com/~volinsky/DataMining/Columbia2011/Slides/Topic2-EDAViz.ppt).
### Class 6: Numpy, Machine Learning, KNN
* Numpy ([code](code/06_numpy.py))
* "Human learning" with iris data ([code](code/06_iris_prework.py), [solution](code/06_iris_solution.py))
* Machine Learning and K-Nearest Neighbors ([slides](slides/06_ml_knn.pdf))
**Homework:**
* Read this excellent article, [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html), and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
* In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
* In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
* In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
* How does the choice of K affect model bias? How about variance?
* As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
* Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
* Does a high value for K cause over-fitting or under-fitting?
**Resources:**
* For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/). (It's a free PDF download!)
### Class 7: scikit-learn, Model Evaluation Procedures
* Introduction to scikit-learn with iris data ([code](code/07_sklearn_knn.py))
* Exploring the scikit-learn documentation: [user guide](http://scikit-learn.org/stable/modules/neighbors.html), [module reference](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors), [class documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* Discuss the [article](http://scott.fortmann-roe.com/docs/BiasVariance.html) on the bias-variance tradeoff
* Model evaluation procedures ([slides](slides/07_model_evaluation_procedures.pdf), [code](code/07_model_evaluation_procedures.py))
**Homework:**
* Keep working on your project. Your [data exploration and analysis plan](project.md) is due in two weeks!
**Optional:**
* Practice what we learned in class today!
* If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
* If you don't yet have your project data: Pick a suitable dataset from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets.html), try using KNN for classification, and evaluate your model. The [Glass Identification Data Set](http://archive.ics.uci.edu/ml/datasets/Glass+Identification) is a good one to start with.
* Either way, you can submit your commented code to DAT4-students, and we'll give you feedback.
**Resources:**
* Here's a great [30-second explanation of overfitting](http://www.quora.com/What-is-an-intuitive-explanation-of-overfitting/answer/Jessica-Su).
* For more on today's topics, these videos from Hastie and Tibshirani are useful: [overfitting and train/test split](https://www.youtube.com/watch?v=_2ij6eaaSl0) (14 minutes), [cross-validation](https://www.youtube.com/watch?v=nZAM5OXrktY) (14 minutes). (Note that they use the terminology "validation set" instead of "test set".)
* Alternatively, read section 5.1 (12 pages) of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), which covers the same content as the videos.
* This video from Caltech's machine learning course presents an [excellent, simple example of the bias-variance tradeoff](http://work.caltech.edu/library/081.html) (15 minutes) that may help you to visualize bias and variance.
### Class 8: Linear Regression
* Linear regression ([IPython notebook](http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb))
**Homework:**
* Keep working on your project. Your [data exploration and analysis plan](project.md) is due next Wednesday!
**Optional:**
* Similar to last class, your optional exercise is to practice what we have been learning in class, either on your project data or on another dataset.
**Resources:**
* To go much more in-depth on linear regression, read Chapter 3 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), from which this lesson was adapted. Alternatively, watch the [related videos](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/) or read my [quick reference guide](http://www.dataschool.io/applying-and-interpreting-linear-regression/) to the key points in that chapter.
* To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on [simple linear regression](http://www.datarobot.com/blog/ordinary-least-squares-in-python/) and [multiple linear regression](http://www.datarobot.com/blog/multiple-regression-using-statsmodels/).
* This [introduction to linear regression](http://people.duke.edu/~rnau/regintro.htm) is much more detailed and mathematically thorough, and includes lots of good advice.
* This is a relatively quick post on the [assumptions of linear regression](http://pareonline.net/getvn.asp?n=2&v=8).
### Class 9: Logistic Regression, Preview of Other Models
* Logistic regression ([slides](slides/09_logistic_regression.pdf), [exercise](code/09_logistic_regression_exercise.py), [solution](code/09_logistic_regression_class.py))
* Preview of other models
**Resources:**
* For more on logistic regression, watch the [first three videos](https://www.youtube.com/playlist?list=PL5-da3qGB5IC4vaDba5ClatUmFppXLAhE) (30 minutes total) from Chapter 4 of An Introduction to Statistical Learning.
* UCLA's IDRE has a handy table to help you remember the [relationship between probability, odds, and log-odds](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm).
* Better Explained has a very friendly introduction (with lots of examples) to the [intuition behind "e"](http://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/).
* Here are some useful lecture notes on [interpreting logistic regression coefficients](http://www.unm.edu/~schrader/biostat/bio2/Spr06/lec11.pdf).
### Class 10: Model Evaluation Metrics
* Finishing model evaluation procedures ([slides](slides/07_model_evaluation_procedures.pdf), [code](code/07_model_evaluation_procedures.py))
* Review of test set approach
* Cross-validation
* Model evaluation metrics ([slides](slides/10_model_evaluation_metrics.pdf))
* Regression:
* Root Mean Squared Error ([code](code/10_rmse.py))
* Classification:
* Confusion matrix ([code](code/10_confusion_roc.py))
* ROC curve ([video](https://www.youtube.com/watch?v=OAl6eAyP-yo))
**Homework:**
* [Model evaluation homework](homework/10_model_evaluation.md), due by midnight on Sunday.
* [Sample solution code](code/10_glass_id_homework_solution.py).
* Watch Kevin's [Kaggle project presentation video](https://www.youtube.com/watch?v=HGr1yQV3Um0) (16 minutes) for an overview of the end-to-end machine learning process, including some aspects that we have not yet covered in class.
* Read this short article on Google's [Smart Autofill](http://googleresearch.blogspot.com/2014/10/smart-autofill-harnessing-predictive.html), and see if you can figure out exactly how the system works.
**Optional:**
* For more on Kaggle, watch [Kaggle Transforms Data Science Into Competitive Sport](https://www.youtube.com/watch?v=8w4UY66GKcM) (28 minutes).
**Resources:**
* scikit-learn has extensive documentation on [model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html).
* The Kaggle wiki has a decent page describing other common [model evaluation metrics](https://www.kaggle.com/wiki/Metrics).
* Kevin wrote a [simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) that you can use as a reference guide.
* Kevin's [blog post about the ROC video](http://www.dataschool.io/roc-curves-and-auc-explained/) includes the complete transcript and screenshots, in case you learn better by reading instead of watching.
* Rahul Patwari has two excellent and highly accessible videos on [Sensitivity and Specificity](https://www.youtube.com/watch?v=U4_3fditnWg&list=PL41ckbAGB5S2PavLIXUETzAmi5reIod23) (9 minutes) and [ROC Curves](https://www.youtube.com/watch?v=21Igj5Pr6u4&list=PL41ckbAGB5S2PavLIXUETzAmi5reIod23) (12 minutes).
### Class 11: Working a Data Problem
* Today we will work on a real world data problem! Our [data](data/ZYX_prices.csv) is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.
* Project overview ([slides](slides/11_GA_Stocks.pdf))
* Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...
### Class 12: Clustering and Visualization
* The [slides](slides/12_clustering.pdf) today will focus on our first look at unsupervised learning, K-Means Clustering!
* The [code](code/) for today focuses on two main examples:
* We will investigate simple clustering using the iris data set.
* We will take a look at a harder example, using Pandora songs as data. See [data](data/songs.csv).
**Homework:**
* Read Paul Graham's [A Plan for Spam](http://www.paulgraham.com/spam.html) and be prepared to **discuss it in class on Monday**. Here are some questions to think about while you read:
* Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
* Before he tried the "statistical approach" to spam filtering, what was his approach?
* How exactly does his statistical filtering system work?
* What did Paul say were some of the benefits of the statistical approach?
* How good was his prediction of the "spam of the future"?
* Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
* **Confusion matrix:** [Kevin's guide](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) roughly mirrors the lecture from class 10.
* **Sensitivity and specificity:** Rahul Patwari has an [excellent video](https://www.youtube.com/watch?v=U4_3fditnWg&list=PL41ckbAGB5S2PavLIXUETzAmi5reIod23) (9 minutes).
* **Basics of probability:** These [introductory slides](https://docs.google.com/presentation/d/1cM2dVbJgTWMkHoVNmYlB9df6P2H8BrjaqAcZTaLe9dA/edit#slide=id.gfc3caad2_00) (from the [OpenIntro Statistics textbook](https://www.openintro.org/stat/textbook.php)) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
* You should definitely be working on your project! **Your rough draft is due in two weeks!**
**Resources:**
* [Introduction to Data Mining](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) has a nice [chapter on cluster analysis](http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf).
* The scikit-learn user guide has a nice [section on clustering](http://scikit-learn.org/stable/modules/clustering.html).
### Class 13: Naive Bayes
* Briefly discuss [A Plan for Spam](http://www.paulgraham.com/spam.html)
* Probability and Bayes' theorem
* [Slides](slides/13_naive_bayes.pdf) part 1
* [Visualization of conditional probability](http://setosa.io/conditional/)
* Applying Bayes' theorem to iris classification ([code](code/13_bayes_iris.py))
* Naive Bayes classification
* [Slides](slides/13_naive_bayes.pdf) part 2
* Example with spam email
* [Airport security example](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt)
* Naive Bayes classification in scikit-learn ([code](code/13_naive_bayes.py))
* Data set: [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
* scikit-learn documentation: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)
**Resources:**
* The first part of the slides was adapted from [Visualizing Bayes' theorem](http://oscarbonilla.com/2009/05/visualizing-bayes-theorem/), which includes an additional example (using Venn diagrams) of how this applies to testing for breast cancer.
* For an alternative introduction to Bayes' Theorem, [Bayes' Rule for Ducks](https://planspacedotorg.wordpress.com/2014/02/23/bayes-rule-for-ducks/), this [5-minute video on conditional probability](https://www.youtube.com/watch?v=Zxm4Xxvzohk), or these [slides on conditional probability](https://docs.google.com/presentation/d/1psUIyig6OxHQngGEHr3TMkCvhdLInnKnclQoNUr4G4U/edit#slide=id.gfc69f484_00) may be helpful.
* For more details on Naive Bayes classification, Wikipedia has two useful articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has an excellent [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).
* If you enjoyed Paul Graham's article, you can read [his follow-up article](http://www.paulgraham.com/better.html) on how he improved his spam filter and this [related paper](http://www.merl.com/publications/docs/TR2004-091.pdf) about state-of-the-art spam filtering in 2004.
**Homework:**
* Download all of the NLTK collections.
* In Python, use the following commands to bring up the download menu.
* ```import nltk```
* ```nltk.download()```
* Choose "all".
* Alternatively, just type ```nltk.download('all')```
* Install two new packages: ```textblob``` and ```lda```.
* Open a terminal or command prompt.
* Type ```pip install textblob``` and ```pip install lda```.
### Class 14: Natural Language Processing
* Overview of Natural Language Processing ([slides](slides/14_natural_language_processing.pdf))
* Real World Examples
* Natural Language Processing ([code](code/14_nlp_class.py))
* NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
* Alternative: TextBlob
**Resources:**
* [Natural Language Processing with Python](http://www.nltk.org/book/): free online book to go in-depth with NLTK
* [NLP online course](https://www.coursera.org/course/nlp): no sessions are available, but [video lectures](https://class.coursera.org/nlp/lecture) and [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) are still accessible
* [Brief slides](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) on the major task areas of NLP
* [Detailed slides](https://github.com/ga-students/DAT_SF_9/blob/master/16_Text_Mining/DAT9_lec16_Text_Mining.pdf) on a lot of NLP terminology
* [A visual survey of text visualization techniques](http://textvis.lnu.se/): for exploration and inspiration
* [DC Natural Language Processing](http://www.meetup.com/DC-NLP/): active Meetup group
* [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml): suite of tools if you want to get serious about NLP
* Getting started with regex: [Python introductory lesson](https://developers.google.com/edu/python/regular-expressions) and [reference guide](https://github.com/justmarkham/DAT3/blob/master/code/99_regex_reference.py), [real-time regex tester](https://regex101.com/#python), [in-depth tutorials](http://www.rexegg.com/)
* [SpaCy](http://honnibal.github.io/spaCy/): a new NLP package
### Class 15: Decision Trees
* Decision trees ([IPython notebook](http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/15_decision_trees.ipynb))
**Homework:**
* By next Wednesday (before class), review the project drafts of your two assigned peers according to [these guidelines](peer_review.md). You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT4-students. If your last name is Smith and you are reviewing Jones, you should name your file `smith_reviews_jones.md`.
**Resources:**
* scikit-learn documentation: [Decision Trees](http://scikit-learn.org/stable/modules/tree.html)
**Installing Graphviz (optional):**
* Mac:
* [Download and install PKG file](http://www.graphviz.org/Download_macos.php)
* Windows:
* [Download and install MSI file](http://www.graphviz.org/Download_windows.php)
* Add it to your Path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: `C:\Program Files (x86)\Graphviz2.38\bin`
### Class 16: Ensembling
* Ensembling ([IPython notebook](http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/16_ensembling.ipynb))
**Resources:**
* scikit-learn documentation: [Ensemble Methods](http://scikit-learn.org/stable/modules/ensemble.html)
* Quora: [How do random forests work in layman's terms?](http://www.quora.com/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1)
### Class 17: Databases and MapReduce
* Learn the basics of databases [database code](code/17_sql.py)
* MapReduce basics [slides](slides/17_db_mr.pdf)
* MapReduce example in python [code](code/17_map_reduce.py)
**Resources:**
* [Forbes: Is it Time for Hadoop Alternatives?](http://www.forbes.com/sites/johnwebster/2014/12/08/is-it-time-for-hadoop-alternatives/)
* [IBM: What is MapReduce?](http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/)
* [Wakari MapReduce IPython notebook](https://www.wakari.io/sharing/bundle/nkorf/MapReduce%20Example)
* [What Every Data Scientist Needs to Know about SQL](http://joshualande.com/data-science-sql/)
* [Brandon's SQL Bootcamp](https://github.com/brandonmburroughs/sql_bootcamp)
* SQL tutorials from [SQLZOO](http://sqlzoo.net/wiki/Main_Page) and [Mode Analytics](http://sqlschool.modeanalytics.com/)
### Class 18: Recommenders
* Recommendation Engines [slides](slides/18_recommendation_engines.pdf)
* Recommendation Engine Example [code](code/18_recommenders_class.py)
**Resources:**
* [The Netflix Prize](http://www.netflixprize.com/)
* [Why Netflix never implemented the winning solution](https://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml)
* [Visualization of the Music Genome Project](http://www.music-map.com/)
* [The People Inside Your Machine](http://www.npr.org/blogs/money/2015/01/30/382657657/episode-600-the-people-inside-your-machine) (23 minutes) is a Planet Money podcast episode about how Amazon Mechanical Turks can assist with recommendation engines (and machine learning in general).
### Class 19: Advanced scikit-learn
* Advanced scikit-learn ([code](code/19_advanced_sklearn.py))
* Searching for optimal parameters: [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html)
* Standardization of features: [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* Chaining steps: [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html)
* Regularized regression ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/19_regularization.ipynb)): [Ridge, RidgeCV, Lasso, LassoCV](http://scikit-learn.org/stable/modules/linear_model.html)
* Regularized classification: [LogisticRegression](http://scikit-learn.org/stable/modules/linear_model.html)
* Feature selection: [RFE, RFECV](http://scikit-learn.org/stable/modules/feature_selection.html)
**Homework:**
* Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf).
**Resources:**
* Here is a longer example of [feature scaling](http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/about_standardization_normalization.ipynb) in scikit-learn, with additional discussion of the types of scaling you can use.
* [Clever Methods of Overfitting](http://hunch.net/?p=22) is a classic post by John Langford.
* [Common Pitfalls in Machine Learning](http://danielnee.com/?p=155) is similar to Langford's post, but broader and a bit more readable.
### Class 20: Course Review
* [Data science review](https://docs.google.com/document/d/1XCdyrsQwU5OC5os7RHdVTEtS-tpHBbsoKKWLpYI6Svo/edit?usp=sharing)
* [Comparing supervised learning algorithms](https://docs.google.com/spreadsheets/d/15_QJXm6urctsbIXO-C_eXrsSffbHedio8z0E5ozxO-M/edit?usp=sharing)
**Resources:**
* [Choosing a Machine Learning Classifier](http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/): Edwin Chen's short and highly readable guide.
* [scikit-learn "machine learning map"](http://scikit-learn.org/stable/tutorial/machine_learning_map/): Their guide for choosing the "right" estimator for your task.
* [Machine Learning Done Wrong](http://ml.posthaven.com/machine-learning-done-wrong): Thoughtful advice on common mistakes to avoid in machine learning.
* [Practical machine learning tricks from the KDD 2011 best industry paper](http://blog.david-andrzejewski.com/machine-learning/practical-machine-learning-tricks-from-the-kdd-2011-best-industry-paper/): More advanced advice than the resources above.
* [An Empirical Comparison of Supervised Learning Algorithms](http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf): Research paper from 2006.
* [Getting in Shape for the Sport of Data Science](https://www.youtube.com/watch?v=kwt6XEh7U3g): 75-minute video of practical tips for machine learning (by the past president of Kaggle).
* [Resources for continued learning!](resources.md)
### Class 21: Project Presentations
### Class 22: Project Presentations