knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(caret)
library(Hmisc)
library(GGally)
library(e1071)
load("variables.RData") ## because machine learning models are time consuming!
my_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
my_url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(my_url, "training")
download.file(my_url2, "testing")


testing <- read.csv("testing")
training <- read.csv("training")

Introduction

In this paper we analyse the Weight Lifting Exercise Dataset (WLE) from Groupware-LES. This data was collected by taking measurements of weight lifters doing an exercise in 5 different ways (the variable “classe” encodes these). The measurements were obtained using gyroscopes on the arm, the forearm, the waist, and a dumbbell.

Data Preparation

We first prepare the data for analysis by creating a training, a test, and a validation set. The data is of medium size with ~20,000 observations. However, the goal of this analysis is to produce a good estimate of the out-of-sample error rate for 20 samples, per the course requirements of Johns Hopkins University’s Intro to Machine Learning. Therefore, while we treat our test set produced in this analysis as truly an untouched sample of the data, we won’t be excessively concerned about our algorithm’s performance on the test set. The implication, then, is that we won’t need a huge test set - we break it up here into 80% training (further broken into 60% training and 20% validation), and 20% testing.

set.seed(1233)
fortest <- createDataPartition(training$classe,p=.2,list=FALSE)
test <- training[fortest,]


pretrain <- training[-fortest,]
  
forvalidation <- createDataPartition(pretrain$classe,p=.25,list=FALSE) #.25*.8=.2 obvi
train <- pretrain[-forvalidation,]
validation <- pretrain[forvalidation,]

Exploratory Data Analysis

We explore the pretrain object, which is our undifferentiated training set. The data has 15,695 observations with 160 variables. From the user_name field, we see that 6 individuals participated in producing the data. Thus, 6 individuals repeated 5 different forms of each exercise. Some information is provided which we will not be concerned about including time_stamps and “window” variables, which have little variation.

One immediate observation is that the variables for each of the bands include a decomposition of role, pitch, yaw, and acceleration many other components including directional information and descriptive statistics about that information. Most of the descriptive statistics are not tabulated for each observation so these can be automatically thrown away (in this case, the mean, standard deviation, kurtosis, and skewness are not sufficient statistics for our class probabilities in the classification models below).

If we look at a table of our outcome variable, we notice that class A is over-represented, which may have to be dealt with by penalizing the fit on this class.

## [1] "Number of Observations Per Workout Form"
## 
##    A    B    C    D    E 
## 4464 3037 2737 2572 2885

Here are some Very Hungry Caterpillar-esque graphs relating user and exercise form to a few of the various movements.

ggpairs(data=smaller, mapping=aes(color=classe, alpha=.5),columns=c(2:5))

ggpairs(data=smaller, mapping=aes(color=classe),columns=c(6:9))

ggpairs(data=smaller, mapping=aes(color=classe),columns=c(10:13))

ggplot(data=pretrain,aes(x=total_accel_arm,y=total_accel_dumbbell))+
    geom_jitter(colour=pretrain$raw_timestamp_part_1)+facet_grid(classe~user_name)

Model Selection and Feature Selection

One initial caveat is that we should not use user_name for predicting the new data. If we include user_name, the value of our model goes way down because it’s not generalizable to people besides those 6 in the data set. If we want to eventually turn this model into a data product, we can’t have users specify their names and hope that something about their name will give us predictive power. But in our data set, the name gives us extraordinary predictive power. Take a gander at the graphs above for the roll_forearm and belt_pitch - every user has a specific frequency associated with how they move their weight around. Of course, the variance of these is tight enough to provide enough information without user names.

Baseline Model

Our baseline model uses the chosen features and the defaults for the random forests algorithm.

modelFitInitail <- train(classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
                           roll_arm+pitch_arm+yaw_arm+total_accel_arm+
                           roll_dumbbell+pitch_dumbbell+yaw_dumbbell+total_accel_dumbbell+
                           roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm, method='rf',
                         data=train)
predsMFI <- predict(modelFitInitail,validation)
x <- confusionMatrix(predsMFI,validation$classe)

Our in-sample error is

predsMFIIS <- predict(modelFitInitail,training)
confusionMatrix(predsMFIIS,training$classe)$overall[1]
##  Accuracy 
## 0.9934257

while our out-of-sample error for the validation set is

print(x$overall[1])
##  Accuracy 
## 0.9844626

One interesting note on the model above is that the variable importance ranks the names of the users at the very end. We don’t lose a ton of accuracy, and in fact, we have a pretty good model already.

Parameter search for Random Forests, Boosting, and SVM Models

SVM

I haven’t had good luck with svm yet, but I thought I would include it since it was covered in class. The tuning for this function is quite different from caret’s built in tuning.

predictors <- subset(train,select = c(roll_belt,pitch_belt,yaw_belt,total_accel_belt,
                           roll_arm,pitch_arm,yaw_arm,total_accel_arm,
                           roll_dumbbell,pitch_dumbbell,yaw_dumbbell,total_accel_dumbbell,
                           roll_forearm,pitch_forearm,yaw_forearm,total_accel_forearm))
outcome <- subset(train,select= c(classe))


svm_tune <- tune(svm, train.x=predictors, train.y=outcome$classe, 
              kernel="radial", ranges=list(cost=c(1,50,100), gamma=(c(.5,1,1.5))))
predictors.validation <- subset(validation,select = c(roll_belt,pitch_belt,yaw_belt,total_accel_belt,
                           roll_arm,pitch_arm,yaw_arm,total_accel_arm,
                           roll_dumbbell,pitch_dumbbell,yaw_dumbbell,total_accel_dumbbell,
                           roll_forearm,pitch_forearm,yaw_forearm,total_accel_forearm))

svm_final <- svm(predictors,outcome$classe,cost=100,gamma=.5,kernal="radial")
svmpreds <- predict(svm_final, predictors.validation)
svmaccuracy <- predict(svm_final, predictors)
confusionMatrix(svmpreds,validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1107    9    2    4    1
##          B    5  736   13    4    8
##          C    1   11  659   17    3
##          D    0    1    9  612    5
##          E    3    3    2    6  705
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9727          
##                  95% CI : (0.9672, 0.9776)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9655          
##  Mcnemar's Test P-Value : 0.1987          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9919   0.9684   0.9620   0.9518   0.9765
## Specificity            0.9943   0.9905   0.9901   0.9954   0.9956
## Pos Pred Value         0.9858   0.9608   0.9537   0.9761   0.9805
## Neg Pred Value         0.9968   0.9924   0.9920   0.9906   0.9947
## Prevalence             0.2843   0.1936   0.1745   0.1638   0.1839
## Detection Rate         0.2820   0.1875   0.1679   0.1559   0.1796
## Detection Prevalence   0.2860   0.1951   0.1760   0.1597   0.1831
## Balanced Accuracy      0.9931   0.9795   0.9761   0.9736   0.9860
confusionMatrix(svmaccuracy,train$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2274    1    0    0
##          C    0    3 2048   13    0
##          D    0    0    3 1914    2
##          E    0    0    0    2 2161
## 
## Overall Statistics
##                                          
##                Accuracy : 0.998          
##                  95% CI : (0.997, 0.9987)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9974         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9987   0.9981   0.9922   0.9991
## Specificity            1.0000   0.9999   0.9984   0.9995   0.9998
## Pos Pred Value         1.0000   0.9996   0.9922   0.9974   0.9991
## Neg Pred Value         1.0000   0.9997   0.9996   0.9985   0.9998
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1932   0.1740   0.1626   0.1836
## Detection Prevalence   0.2845   0.1933   0.1754   0.1631   0.1838
## Balanced Accuracy      1.0000   0.9993   0.9982   0.9959   0.9994

Although Random Forests, SVM, and Boosting were all very accurate, I turn to Random Forests for the final model, since they take the least time to fit.

Estimating the Out-of-Sample Accuracy

We now have a model that seems to be well-validated and we want to establish a conjecture about its accuracy on the testing set.

I proceed to do a random 5 fold cross-validation of my final model on the “pretrain” object, our original training set that was used to train and evaluate the model.

kfold<- createFolds(pretrain$classe,5)

#intialize variables
results <- data.frame(matrix(ncol=7,nrow=5))
trainfinal <- data.frame()
validationfinal <- data.frame()
grid2 <- expand.grid(.mtry=c(4))
i <- 1
for(i in 1:5) {
  trainfinal <- pretrain[-kfold[[i]],]
  validationfinal <- pretrain[kfold[[i]],]
  
  ## build model
  rf_random2 <- train(classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
                           roll_arm+pitch_arm+yaw_arm+total_accel_arm+
                           roll_dumbbell+pitch_dumbbell+yaw_dumbbell+total_accel_dumbbell+
                           roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm, data=trainfinal, 
                           method="rf", metric=metric, tuneGrid=grid2, trControl=control)
}
  predfinal <- predict(rf_random2,validationfinal)
  results[i,] <- confusionMatrix(predfinal,validationfinal$classe)$overall
print(results[,1])
## [1] 0.9837528 0.9910800 0.9875757 0.9869344 0.9882166

The first column of the results represents the estimates for our model’s out-of-sample accuracy. When we compare these to the final accuracy, given by the confusion matrix below it, we see that the out-of-sample accuracy was actually slightly higher, but still within .3% of the accuracy with the pretrain model. The error rate is about 1%, taking the average of the results above.

rf_random_final <- train(classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
                           roll_arm+pitch_arm+yaw_arm+total_accel_arm+
                           roll_dumbbell+pitch_dumbbell+yaw_dumbbell+total_accel_dumbbell+
                           roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm, data=pretrain, 
                           method="rf", metric=metric, tuneGrid=grid2, trControl=control)
predtest <- predict(rf_random,test)
confusionMatrix(predtest,test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1113    3    0    0    0
##          B    3  741    2    1    2
##          C    0   12  679    6    2
##          D    0    4    4  637    0
##          E    0    0    0    0  718
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9901          
##                  95% CI : (0.9864, 0.9929)
##     No Information Rate : 0.2842          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9874          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9750   0.9912   0.9891   0.9945
## Specificity            0.9989   0.9975   0.9938   0.9976   1.0000
## Pos Pred Value         0.9973   0.9893   0.9714   0.9876   1.0000
## Neg Pred Value         0.9989   0.9940   0.9981   0.9979   0.9988
## Prevalence             0.2842   0.1935   0.1744   0.1640   0.1839
## Detection Rate         0.2834   0.1887   0.1729   0.1622   0.1828
## Detection Prevalence   0.2842   0.1907   0.1780   0.1642   0.1828
## Balanced Accuracy      0.9981   0.9862   0.9925   0.9933   0.9972
predholdout <- predict(rf_random,testing)

Our final model does pretty well, with 99% accuracy. This is in line with our prediction for the out-of-sample error rate above.

Conclusion

With little effort, we were able to fit 3 robust models (random forests, boosting, and svm) to the WLE dataset. These models could be built into fitbit gear in the future, and give real-time advice to users about form issues. Further extensions would include more users, more forms, and different types of exercises. Integration with apps such as MyFitnessPal could enable users to upload their own data with ‘good’ and ‘bad’ examples of exercise form, spawning a vast data lake for which the exercise community would benefit.