Weightlifting Exercise Prediction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, I used data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

The training data for this project are available here.

The test data are available here.

The data for this project come from this source.

What was the goal of the project?

The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. Certain variables had to be excluded in order to correctly do the prediction. For example, it was not appropriate to use test subject as a predictor since in general we would not have information mapping who did the exercise to how they did it. This model was cross validated and correctly predicted 20 different test cases.

Load Libraries

suppressMessages(library(caret)) #machine learning
suppressMessages(library(e1071)) #svm
suppressMessages(library(rpart)) #partition
suppressMessages(library(randomForest)) #random forests (is faster than caret)

Reproducibility

set.seed(1)

Explore the Data

There are a number of different formats which values that are not available appear in.

training <- read.csv("pml-training.csv", na.strings=c("", "NA", "#DIV/0!"), row.names = 1)
testing <- read.csv("pml-testing.csv", na.strings=c("", "NA", "#DIV/0!"), row.names = 1)

If we try complete.cases on train and test, we will eliminate most of the data. Fortunately, it looks like it is only some of the columns which are missing values, so we get rid of these.

training <- training[,!sapply(training,function(x) any(is.na(x)))]
testing <- testing[,!sapply(testing,function(x) any(is.na(x)))]

Running sum(complete.cases(training)) and similarly for the test set shows that we don’t have any more NA values, so we might just use these features. But if we do, we’ll run into problems later since random forests won’t know what to do if it gets a new name of a person, which is one of the features. Thus, we will get rid of information that has no practical predictive ability, like people’s names, so we get rid of the first 6 columns.

training <-training[,-c(1:6)]
testing <-testing[,-c(1:6)]

My definition of validation set follows the convention of the coursera machine learning specialization for python, which corresponds to the ‘test’ set of the data science specialization.

Partition Data

Now we need to split our training set into a training and validation set. We do so below.

indices <- createDataPartition(y=training$classe, p=0.8, list=FALSE)
Train <- training[indices, ] 
Validation <- training[-indices, ]

Train Data

This is a large data set, so we expect that random forests will perform the best. Let’s try training with this algorithm. It’s cheap to train different models, so let’s try a few others. Note: this will take a few minutes. We will try svm and lda. We could do many more, but limit ourselves to these cases.

model_forests <- randomForest(classe ~ ., data = Train)
model_svm <- svm(classe ~. , data=Train)
model_lda <- train(classe ~. , data=Train, method="lda")
## Loading required package: MASS

Predict

Now that we have our model, we will see how well it does on the validation set with random forests.

prediction_forest <- predict(model_forests, Validation, type = "class")
confusionMatrix(prediction_forest, Validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1114    2    0    0    0
##          B    1  756    6    0    0
##          C    0    1  675    6    0
##          D    0    0    3  636    1
##          E    1    0    0    1  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9944          
##                  95% CI : (0.9915, 0.9965)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9929          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9960   0.9868   0.9891   0.9986
## Specificity            0.9993   0.9978   0.9978   0.9988   0.9994
## Pos Pred Value         0.9982   0.9908   0.9897   0.9938   0.9972
## Neg Pred Value         0.9993   0.9991   0.9972   0.9979   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1927   0.1721   0.1621   0.1835
## Detection Prevalence   0.2845   0.1945   0.1738   0.1631   0.1840
## Balanced Accuracy      0.9987   0.9969   0.9923   0.9939   0.9990

We only missed a few values, so this is very good performance. Let’s compare to the confusion matrix on the training set, which should be around at least as good.

prediction_forest_train <- predict(model_forests, Train, type = "class")
confusionMatrix(prediction_forest_train, Train$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Refreshingly, random forests does even better on the training data, as it should. Note that if we keep all variables, we get even better performance and appearance of good predictions on the validation data, but can’t predict on the test data when we get names of new people. So we’ve overfit to the names of the people. The performance is still quite good, though.

Given this, we expect near perfect accuracy on the test data (validation data in the data science specialization convention).

prediction_forest_test <- predict(model_forests, testing, type = "class")
prediction_forest_test
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

So random forests is probably sufficient, but let’s just compare to other methods on the validation set.

mean(predict(model_forests, Validation) == Validation$classe)
## [1] 0.994647
mean(predict(model_svm, Validation) == Validation$classe)
## [1] 0.9541167
mean(predict(model_lda, Validation) == Validation$classe)
## [1] 0.7037981

We could probably do better by combining the models, but we won’t worry about that here.