How well are we working out? / by Usamah Khan


Overview

Nowadays you look around and it seems like everyone at the gym or the park is wearing the coloured band of a FitBit, FuelBand, Apple Watch, Jawbone Up or some other activity tracker on their arm. You might very well have one on as you read this. These tools work to collect huge amounts of data about our personal activity, relatively inexpensively and easily. This allows us to quantify how much we do to improve our health by taking measurements of ourselves performing activites such as running, swimming and lifting weights. However, rarely do we try to quantify how well we perform an activity.

The Human Activity Recognition Project (HAR) aimed to do just that. After placing accelerometers on the belt, forearm, arm and dumbell of 6 participants they were all asked to perform lifts correctly and incorrectly (using light weights under proper supervision to avoid injury). The goal was to examine how could they determine a reference point for how well an exercise is being performed.

By taking the data of the sets of “good” and “bad” workouts we can create tools that can help us stay on track of performing workouts well and safely. The purpose of this project is to create and contrast Machine Learning algorithms using different packages in R to determine these questions. R has many ML libraries available and as such, determining the pros and cons of each is a useful undertaking.


Loading libraries

So now that the purpose of this project has been outlined as to test packages and contrast the differences between them to determine a reference for proper workout technique, as part of showing my workflow, I’ve loaded up the packages we’ll need first and set a random seed for reproducibility. If anyone wants to have a go these are all the packages that you’ll need.

library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(rattle)
library(e1071)

set.seed(12321)

Data


Reading

All information surrounding and describing the experiment can be found here: http://groupware.les.inf.puc-rio.br/har. The data can be found at the following Urls and downloaded as shown.

fileUrl1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileUrl2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(fileUrl1, destfile = "./train.csv", method = "curl")
download.file(fileUrl2, destfile = "./test.csv", method = "curl")

train_set <- read.csv("train.csv", header = TRUE, sep = ",", na.strings=c("",NA))
test_set <- read.csv("test.csv", header = TRUE, sep = ",", na.strings=c("",NA))

dim(train_set)
## [1] 19622   160
dim(test_set)
## [1]  20 160

Pre-processing

The first step in building any Machine Learning problem is to pre-process and clean the data.

As far as we can tell the data is clean and has the 160 variables we need and the test set is comprised of the full 20 variables. Looking through however, in the data we can observe many cases of missing values and NAs. This poses an issue since model fitting with missing values usually results in error.

To deal with this we could either impute the data or find ways to remove the data. Imputation has its own complications. In this case we can see that since the missing values occur systematically, the best course of action seems to be to remove the columns where a certain threshold of missing values is exceeded. In this case, lets say 90%.

df <- data.frame(1)
df2 <- data.frame(1)
count = 1

for(i in 1:ncol(train_set)){
    df[i,] <- (sum(is.na(train_set[,i]))/19622)
      if (df[i,] > 0.9) {
        df2[count,] = i 
        count = count + 1
      }
}

count = 0

for(i in 1:nrow(df2)){
  train_set[,(df2[i,]-count)] <- NULL
  test_set[,(df2[i,]-count)] <- NULL
  count = count + 1
  }

Finally, since the first 7 columns seem to be for bookkeeping processes only, they too can be dropped.

train_set <- train_set[,-c(1,2,3,4,5,6,7)]
test_set <- test_set[,-c(1,2,3,4,5,6,7)]

Voila. Now, with the data fully pre-processed, we can begin to fit and play with some models.


Building Models


So now that we’re trying to determine an effective model, we can’t very well use our test set to build upon. So, to move forward on this we have to partition our training data set. Any reasonable partition can be used, I generally like to stick to a 60/40 split.

inTrain <- createDataPartition(train_set$classe, p = 0.6, list = FALSE)
train_model <- train_set[inTrain,]
train_validate <- train_set[-inTrain,]

dim(train_model)    
## [1] 11776    53
dim(train_validate) 
## [1] 7846   53

We can now begin using the data sets to create our prediction models. To test the packages on effectiveness and efficiency we can set it up to call upon the system time functions to determine how long the models take to train.


Model 1: Decision Trees

Let’s start building the first model designed with the use of Decision Trees for prediction. Decision trees map information and obeservations of variables in a data set to conclusions about the target value, i.e. what classification. It is one of the simpler, more basic forms of Machine Learning however it’s also an extremely useful visual tool. Leaves are used to show classification while branches display a conjunction of features and their strength in the model leading to those conclusions.

A simple tree can be built with the use of the caret package calling upon the rpart method in train.

startTime <- Sys.time()

modFit <- train(classe ~ ., data = train_model, method = "rpart")

runTime <- Sys.time() - startTime
runTime
## Time difference of 38.73353 secs
pred_1 <- predict(modFit, train_validate)
confusionMatrix(pred_1, train_validate$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2037  627  677  599  206
##          B   28  385   21  229   90
##          C  163  506  670  458  458
##          D    0    0    0    0    0
##          E    4    0    0    0  688
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4818          
##                  95% CI : (0.4707, 0.4929)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3224          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9126  0.25362  0.48977   0.0000  0.47712
## Specificity            0.6243  0.94185  0.75533   1.0000  0.99938
## Pos Pred Value         0.4913  0.51129  0.29712      NaN  0.99422
## Neg Pred Value         0.9473  0.84027  0.87516   0.8361  0.89460
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2596  0.04907  0.08539   0.0000  0.08769
## Detection Prevalence   0.5284  0.09597  0.28741   0.0000  0.08820
## Balanced Accuracy      0.7685  0.59773  0.62255   0.5000  0.73825

We can also call upon the rattle package to create the plot of the tree to see the flow.

fancyRpartPlot(modFit$finalModel)

The results of the confusion matrix aren’t promising and the statistics show this model as very weak with only a 48% accuracy, with a run time of 38.7 seconds. This was to be expected since with the amount of variables, it would be hard to accurately fit a tree model. This tool comes from the caret package so we can ask, would another package, attempting the same thing, be as effective?


Model 2: Decision Trees

Another package that can help with decision trees is the standalone Rpart package. To call and use the tools, it works very similarly to train from the caret package.

startTime <- Sys.time()

modFit_rp <- rpart(classe ~ ., data = train_model, method = "class")

runTime <- Sys.time() - startTime
runTime
## Time difference of 2.049898 secs
pred_2 <- predict(modFit_rp, train_validate, type = "class")
confusionMatrix(pred_2, train_validate$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1947  173   45   46   53
##          B  114  946  144  170  256
##          C   53  166 1086  157  135
##          D   93  127   92  818   95
##          E   25  106    1   95  903
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7265          
##                  95% CI : (0.7165, 0.7363)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6539          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8723   0.6232   0.7939   0.6361   0.6262
## Specificity            0.9435   0.8919   0.9211   0.9380   0.9646
## Pos Pred Value         0.8600   0.5804   0.6800   0.6678   0.7991
## Neg Pred Value         0.9489   0.9080   0.9549   0.9293   0.9197
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2482   0.1206   0.1384   0.1043   0.1151
## Detection Prevalence   0.2886   0.2077   0.2035   0.1561   0.1440
## Balanced Accuracy      0.9079   0.7575   0.8575   0.7870   0.7954
fancyRpartPlot(modFit_rp)