Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jasonmdev/learning-python-predictive-analytics
Tracking, notes and programming snippets while learning predictive analytics
https://github.com/jasonmdev/learning-python-predictive-analytics
dataset linear-regression logistic-regression predictive-analytics python
Last synced: about 4 hours ago
JSON representation
Tracking, notes and programming snippets while learning predictive analytics
- Host: GitHub
- URL: https://github.com/jasonmdev/learning-python-predictive-analytics
- Owner: JasonMDev
- Created: 2016-04-29T08:05:52.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-05-23T19:30:17.000Z (over 8 years ago)
- Last Synced: 2024-08-08T23:23:22.464Z (3 months ago)
- Topics: dataset, linear-regression, logistic-regression, predictive-analytics, python
- Language: Python
- Size: 1.86 MB
- Stars: 43
- Watchers: 5
- Forks: 38
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Predictive Analytics with Python
These are my notes from working through the book
[*Learning Predictive Analytics with Python*](https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-python)
by [Ashish Kumar](https://in.linkedin.com/in/ashishk64)
and published on Feb 2016.## General
###Chapter 1: Getting Started with Predictive Modelling
- [x] Installed Anaconda Package.
- [x] Python3.5 has been installed.
- [x] Book follows python2, so some codes is modified along the way for python3.###Chapter 2: Data Cleaning
- [x] Reading the data: variations and examples
- [x] Data frames and delimiters.####Case 1: Reading a dataset using the read_csv method
- [x] File: titanicReadCSV.py
- [x] File: titanicReadCSV1.py
- [x] File: readCustomerChurn.py
- [x] File: readCustomerChurn2.py
- [x] File: changeDelimiter.py####Case 2: Reading a dataset using the open method of Python
- [x] File: readDatasetByOpenMethod.py####Case 3: Reading data from a URL
- [x] Modified the code that it works and prints out line by line dictionary of the dataset.
- [x] File: readURLLib2Iris.py
- [x] File: readURLMedals.py####Case 4: Miscellaneous cases
- [x] File: readXLS.py
- [x] Created the file above to read from both .xls an .xlsx####Basics: Summary, dimensions, and structure
- [x] File: basicDataCheck.py
- [x] Created the file above to read from both .xls an .xlsx####Handling missing values
- [x] File: basicDataCheck.py
- [x] RE: Treating missing data like NaN or None
- [x] Deletion orr imputaion####Creating dummy variables
- [x] File: basicDataCheck.py
- [x] Split into new variable 'sex_female' and 'sex_male'
- [x] Remove column 'sex'
- [x] Add both dummy column created above.####Visualizing a dataset by basic plotting
- [x] File: plotData.py
- [x] Figure file: ScatterPlots.jpeg
- [x] Plot Types: Scatterplot, Histograms and boxplots###Chapter 3: Data Wrangling
####Subsetting a dataset
- [x] Selecting Columns
- [x] File: subsetDataset.py
- [x] Selecting Rows
- [x] File: subsetDatasetRows.py
- [x] Selecting a combination of rows and columns
- [x] File: subsetColRows.py
- [x] Creating new columns
- [x] File: subsetNewCol.py####Generating random numbers and their usage
- [x] Various methods for generating random numbers
- [x] File: generateRandomNumbers.py
- [x] Seeding a random number
- [x] File: generateRandomNumbers.py
- [x] Generating random numbers following probability distributions
- [x] File: generateRandomProbDistr.py
- [x] Probability density function: PDF = Prob(X=x)
- [x] Cumulative density function: CDF(x) = Prob(X<=x)
- [x] Uniform distribution: random variables occur with the same (uniform) frequency/probability
- [x] Normal distribution: Bell Curve and most ubiquitous and versatile probability distribution
- [x] Using the Monte-Carlo simulation to find the value of pi
- [x] File: calcPi.py
- [x] Geometry and mathematics behind the calculation of pi
- [x] Generating a dummy data frame
- [x] File: generateDummyDataFrame.py####Grouping the data – aggregation, filtering, and transformation
- [x] File: groupData.py
- [x] Grouping
- [x] Aggregation
- [x] Filtering
- [x] Transformation
- [x] Miscellaneous operations####Random sampling – splitting a dataset in training and testing datasets
- [ ] File: splitDataTrainTest.py
- [x] Method 1: using the Customer Churn Model
- [x] Method 2: using sklearn
- [ ] Method 3: using the shuffle function####Concatenating and appending data
- [x] File: concatenateAndAppend.py
- [x] File: appendManyFiles.py####Merging/joining datasets
- [x] File: mergeJoin.py
- [x] Inner Join
- [x] Left Join
- [x] Right Join
- [x] An example of the Inner Join
- [x] An example of the Left Join
- [x] An example of the Right Join
- [x] Summary of Joins in terms of their length###Chapter 4: Statistical Concepts for Predictive Modelling
####Random sampling and central limit theorem
####Hypothesis testing
- [x] Null versus alternate hypothesis
- [x] Z-statistic and t-statistic
- [x] Confidence intervals, significance levels, and p-values
- [x] Different kinds of hypothesis test
- [x] A step-by-step guide to do a hypothesis test
- [x] An example of a hypothesis test####Chi-square testing
####Correlation
- [x] File: linearRegression.py
- [x] File: linearRegressionFunction.py
- [x] Picture: TVSalesCorrelationPlot.png
- [x] Picture: RadioSalesCorrelationPlot.png
- [x] Picture: NewspaperSalesCorrelationPlot.png###Chapter 5: Linear Regression with Python
####Understanding the maths behind linear regression
- [x] Linear regression using simulated data
- [x] File: linearRegression.py
- [x] Picture: CurrentVsPredicted1.png
- [x] Picture: CurrentVsPredictedVsMean1.png
- [x] Picture: CurrentVsPredictedVsModel1.png####Making sense of result parameters
- [x] File: linearRegression.py
- [x] p-values
- [x] F-statistics
- [x] Residual Standard Error (RSE)####Implementing linear regression with Python
- [x] File: linearRegressionSMF.py
- [x] Linear regression using the statsmodel library
- [x] Multiple linear regression
- [x] Multi-collinearity: sub-optimal performance of the model
- [x] Variance Inflation Factor
- [x] It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables.####Model validation
- [x] Training and testing data split
- [x] File: linearRegressionSMF.py
- [x] Linear regression with scikit-learn
- [x] File: linearRegressionSKL.py
- [x] Feature selection with scikit-learn
- [x] Recursive Feature Elimination (RFE)
- [x] File: linearRegressionRFE.py####Handling other issues in linear regression
- [x] Handling categorical variables
- [x] File: linearRegressionECom.py
- [x] Transforming a variable to fit non-linear relations
- [x] File: nonlinearRegression.py
- [x] Picture: MPGVSHorsepower.png
- [x] Picture: MPGVSHorsepowerVsLine.png
- [x] Picture: MPGVSHorsepowerModels.png
- [x] Handling outliers
- [x] Other considerations and assumptions for linear regression###Chapter 6: Logistic Regression with Python
####Linear regression versus logistic regression
####Understanding the math behind logistic regression
- [x] File: logisticRegression.py
- [x] Contingency tables
- [x] Conditional probability
- [x] Odds ratio
- [x] Moving on to logistic regression from linear regression
- [x] Estimation using the Maximum Likelihood Method
- [x] Building the logistic regression model from scratch
- [x] File: logisticRegressionScratch.py
- [ ] Read above again.
- [x] Making sense of logistic regression parameters
- [x] Wald test
- [x] Likelihood Ratio Test statistic
- [x] Chi-square test
- [x]####Implementing logistic regression with Python
- [x] File: logisticRegressionImplementation.py
- [x] Processing the data
- [x] Data exploration
- [x] Data visualization
- [x] Creating dummy variables for categorical variables
- [x] Feature selection
- [x] Implementing the model####Model validation and evaluation
- [x] File: logisticRegressionImplementation.py
- [x] Cross validation####Model validation
- [x] File: logisticRegressionImplementation.py
- [x] The ROC curve {see terms}###Chapter 7: Clustering with Python
####Introduction to clustering – what, why, and how?
- [x] What is clustering?
- [x] How is clustering used?
- [x] Why do we do clustering?####Mathematics behind clustering
- [x] Distances between two observations
- [x] Euclidean distance
- [x] Manhattan distance
- [x] Minkowski distance
- [x] The distance matrix
- [x] Normalizing the distances
- [x] Linkage methods
- [x] Single linkage
- [x] Compete linkage
- [x] Average linkage
- [x] Centroid linkage
- [x] Ward's method uses ANOVA method
- [x] Hierarchical clustering
- [x] K-means clustering
- [x] File: kMeanClustering.py####Implementing clustering using Python
- [x] File: clusterWine.py
- [x] Importing and exploring the dataset
- [x] Normalizing the values in the dataset
- [x] Hierarchical clustering using scikit-learn
- [x] K-Means clustering using scikit-learn
- [x] Interpreting the cluster####Fine-tuning the clustering
- [x] The elbow method
- [x] Silhouette Coefficient###Chapter 8: Trees and Random Forests with Python
####Introducing decision trees
- [x] A decision tree####Understanding the mathematics behind decision trees
- [x] Homogeneity
- [x] Entropy
- [x] Information gain
- [x] ID3 algorithm to create a decision tree
- [x] Gini index
- [x] Reduction in Variance
- [x] Pruning a tree
- [x] Handling a continuous numerical variable
- [x] Handling a missing value of an attribute####Implementing a decision tree with scikit-learn
- [x] File: decisionTreeIris.py
- [x] Visualizing the tree
- [x] Picture: dtree2.png
- [x] File: dtree2.dot
- [x] Cross-validating and pruning the decision tree####Understanding and implementing regression trees
- [x] File: regressionTree.py
- [x] Regression tree algorithm
- [x] Implementing a regression tree using Python####Understanding and implementing random forests
- [x] File: randomForest.py
- [x] The random forest algorithm
- [x] Implementing a random forest using Python
- [x] Why do random forests work?
- [x] Important parameters for random forests###Chapter 9: Best Practices for Predictive Modelling
####Best practices for coding
- [x] Commenting the codes
- [x] Defining functions for substantial individual tasks
- [x] Example 1
- [x] Example 2
- [x] Example 3
- [x] Avoid hard-coding of variables as much as possible
- [x] Version control
- [x] Using standard libraries, methods, and formulas####Best practices for data handling
####Best practices for algorithms
####Best practices for statistics
####Best practices for business contexts