An open API service indexing awesome lists of open source software.

https://github.com/kamicollo/practical-ml

Course project for Practical Machine learning class
https://github.com/kamicollo/practical-ml

Last synced: 5 months ago
JSON representation

Course project for Practical Machine learning class

Awesome Lists containing this project

README

          

---
title: "Course project - Practical machine learning"
author: "Aurimas R."
date: "02/17/2015"
output:
html_document:
keep_md: true
---
```{r, knitr::opts_chunk$set(echo=FALSE, fig.path='figures/')}
```

This reports presents the process of building a machine learning algorithm for differentiating between correct and incorrect usage of dumbbell based on data collected from wearable sensors attached to dumbbell users. First, the data collected is considered; then, its preprocessing and feature selection is discussed. Three different ML models are built and using cross-validation techniques the best model is selected. The best model uses random forests technique and achieved near 100% accuracy in out-of-sample testing.

#Data overview

```{r, echo=FALSE, message=FALSE}
source("~/Coursera/predmachlearn-011/project/preprocessing.R")
```

The data used to build the ML algorithms is from a study by Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, [(H. Qualitative Activity Recognition of Weight Lifting Exercises)](http://groupware.les.inf.puc-rio.br/har). Note that the original study does not include a codebook, so this report infers them from their names. The training dataset includes `r dim(o_training)[1]` observations and `r dim(o_training)[2]` variables. The variables can be grouped into the following categories:

* Metadata: observation id ("X"), name of participant ("user_name")
* Time-related metadata: timestamps of observations and variables indicating the sliding window of observations (`r paste("'", names(o_training)[3:7], "'", collapse=", ")`)
* Sensor data from four sensors attached to belt, arm, forearm and dumbbell:
* Six types of data are included: yaw, roll and pitch (3-dimensional axes), acceleration (x/y/z dimensions), magnetical data (x/y/z dimensions), and gyroscope (x/y/z dimensions) data.
* For each sensor-data combination the following statistics are included: raw data, kurtosis, skenewess, maximum, minimum, amplitude, average, standard deviation and variance.
* Class, indicating whether the dumbbell was used correctly (A class indicates correct lifting; classes B-E indicate different wrong lifting types; see original paper for details)

The data is quite balanced, with all classes represented relatively equally (slightly more of class A):
```{r echo=FALSE}
table(o_training$classe)
```

#Preprocessing

Several pre-processing steps were performed:

1) All variables were converted to appropriate forms (e.g. numeric / time or date / factors)
2) Variables with over 50% of missing entries were deleted (this removed `r length(missing_features)` variables)
3) metadata (see data description above, #1 and #2) was excluded and a ML prediction algorithm should be generic and not depend on this data.
4) Zero-variance variables were deleted (defined as variables where less than 5% of values are unique). This resulted in additional `r length(zerovar)` variables removed.
5) A check for linearly related features was performed (none found).
6) All numeric features were scaled to 1 standard deviation and centered around 0.

This resulted in a training dataset with `r dim(training)[1]` observations and `r dim(training)[2]` remaining features. For implementation details, we refer to the [preprocessing R script](preprocessing.R).

#Analysis performed

Before a model is built, it is important to consider if all variables should be included as features. However, due to lack of domain knowledge and detailed codebook, the author of this paper had to rule out any knowledge-based selection. Instead, an ad-hoc graphical inspection was performed. As the density plots indicate, randomly selected 4 variables all show limited variations between classes. A more detailed review of correlation matrices also did not reveal clear indications for model selection. Instead, all variables will be included as features into the models.

```{r, echo=FALSE}

featurePlot(x = training[, (c(6,20,22,50)), with=F],
y = training$classe,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
```

#Model selection

```{r echo=FALSE, message=FALSE}
source("~/Coursera/predmachlearn-011/project/prediction.R")
```

Model selection was performed as follows:

1) Three different algorithms were considered: classification trees, stochastic gradient boosting and random forests;
2) Training dataset was divided into three parts: training set, cross-validation set and testing set.
* Training set will be used to train the three different models
* Cross-validation set will be used to determine the best performing algorithm (based on accuracy)
* Testing set will be used to approximate an out-of-sample accuracy
3) All algorithms were tuned by using simple bootstrapping (25 resampling iterations, caret default). In particular, this means that the classification trees were tuned to 1 parameter, stochastic gradient boosting to 3 parameters and random forests to 1 parameter.

The implementation details of the model selection are presented in [a prediction R script](prediction.R). Note that model training is resource-heavy, and while the underlying code relies on `data.table` and `parallel` packages to speed up the process, the whole procedure takes ~1 hour on a modern, 8 GB RAM, 8-core processor laptop.

##Performance of three models

The three models achieved the following accuracy on the training set:
```{r, echo=FALSE}
print(paste("Accuracy on training data - CART: ", percent(cm_tr_rpart$overall[["Accuracy"]])))
print(paste("Accuracy on training data - random forests: ", percent(cm_tr_rf$overall[["Accuracy"]])))
print(paste("Accuracy on training data - Gradient Boosting: ", percent(cm_tr_gbm$overall[["Accuracy"]])))
```

The performance on the cross-validation set was as follows:
```{r, echo=FALSE}
print(paste("Accuracy on CV data - CART: ", percent(cm_cv1_rpart$overall[["Accuracy"]])))
print(paste("Accuracy on CV data - random forests: ", percent(cm_cv1_rf$overall[["Accuracy"]])))
print(paste("Accuracy on CV data - Gradient Boosting: ", percent(cm_cv1_gbm$overall[["Accuracy"]])))
```

Based on the above, we selected random forests as the top performing algorithm. To estimate the out-of-sample accuracy (1 - error rate), we tested its accuracy over the testing set:

```{r, echo=FALSE}
print(cm_cv2_rf)
```

It appears that overfitting should not be a significant issue, as the model performed nearly perfectly - just as it did in the previous datasets. Note that all three models are provided as RData files in folder [`data`](data) in case the reader would like to try them out independently.

#Potential improvements

While the above model performed perfectly, there are potential improvements to be made. In particular, the training time is very high (~30 minutes for the selected model). We investigate possibilities to reduce the number of features in the model to cut down training time.

This was done by selecting only top 5 features identified as most important by `varImp()` function.

```{r, echo=FALSE}
imp <- varImp(m_rf$finalModel, scale=T)
n <- order(imp[,1], decreasing = T)[1:5]
key_features <- rownames(imp)[order(imp[,1], decreasing = T)[1:5]]
key_features
```

We then retrained random forest model using only the top 5 features. The results were as follows:
```{r, echo=FALSE}
tr_selected <- tr_data[, (c(".outcome", key_features)), with=F]
m_rf_s <- train(.outcome ~ ., data=tr_selected, method="rf", trControl=trainControl(seeds=setSeeds(135790)))
trs_pred_rf <- predict(m_rf_s, tr_data)
cm_trs_rf <- confusionMatrix(trs_pred_rf, tr_data$.outcome)
cv1s_pred_rf <- predict(m_rf_s, cv1_data)
cm_cv1s_rf <- confusionMatrix(cv1s_pred_rf, cv1_data$classe)
print(paste("Accuracy on training data - random forests (5-features): ", percent(cm_trs_rf$overall[["Accuracy"]])))
print(paste("Accuracy on CV data - random forests (5-features): ", percent(cm_cv1s_rf$overall[["Accuracy"]])))
```

In conclusion, if the small precision loss is acceptable, significant efficiency gains can be gained by limiting features to top 5. This already indicates that not all sensors are needed and further analysis could reveal that high accuracy may be achievable with 2 or even 1 sensor at all. This would have significant implications to actual implementation of such monitoring in real world.