Practical Machine Learning:

Introduction
Data Exploration and Preprocessing
Modeling
Results
Conclusions
References

Introduction

With wearable heath and fitness devices becoming increasingly popular, and correspondingly, the amount of data collected by such devices burgeoning, the question of how to make use of these vast stores of information holds much potential. Many devices report simple aggregate summaries of the quantity of particular activities performed, perhaps in relation to some goal. On the other hand, assessing the quality of the activities is more challenging and may represent significant untapped potential hidden in the data.

The Weight Lifting Exercises Dataset aims to test the feasibility of assessing the quality of weight lifting exercises performed by inexperienced individuals, potentially as a precursor to personalized digital personal trainers. The data were collected by attaching sensors on a dumbbell and the participant’s arm, forearm, and belt. They were then instructed to perform 10 repetitions of unilateral dumbbell biceps curls in one correct and four incorrect ways: “exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).” More details are available at the researchers’ website.

The goal is to develop machine learning algorithms which correctly classify the quality of the exercise performed given the sensor data.

Data Exploration and Preprocessing

The training and testing data sets are downloaded and read, if not already present.

if(!exists("pml_training") & !exists("pml_testing")){
        training_file <- "pml-training.csv"
        testing_file <- "pml-testing.csv"
        
        if(!file.exists(training_file) & !file.exists(testing_file)){
            training_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
            testing_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
            download(training_url, training_file)
            download(testing_url, testing_file)
        }
        
        pml_training <- read.csv("pml-training.csv", na.strings = c(NA, ""))
        pml_testing <- read.csv("pml-testing.csv", na.strings = c(NA, ""))
}

dim(pml_training)

## [1] 19622   160

names(pml_training)

##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"

Before any exploration, the data are first partitioned into training and validation sets, with the validation set left aside to prevent making any analysis decisions on what should be treated as “unseen” data at this point.

ind_training <- createDataPartition(pml_training$classe, p = 0.75, list = FALSE)
training <- pml_training[ind_training, ]
validation <- pml_training[-ind_training, ]

str(training, list.len = 15)

## 'data.frame':    14718 obs. of  160 variables:
##  $ X                       : int  1 2 3 5 6 7 8 9 11 12 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 196328 304277 368296 440390 484323 500302 528316 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.45 1.42 1.42 1.43 1.45 1.43 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.07 8.06 8.09 8.13 8.16 8.18 8.18 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 396 levels "0.000673","0.005503",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : Factor w/ 316 levels "0.006078","-0.021887",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : Factor w/ 394 levels "0.000000","0.000748",..: NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

A few strategic desicions are made in treating the data. First, we notice that several variables are very sparsely populated. We could impute missing values if there are small gaps, but if the ratio of missing values to non-missing values is high, we will prefer to omit those variables.

rows <- nrow(training)
(na_ratio <- sapply(1:ncol(training), function(x) sum(is.na(training[ , x]))/rows))

##   [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##   [8] 0.0000000 0.0000000 0.0000000 0.0000000 0.9798206 0.9798206 0.9798206
##  [15] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [22] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [29] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [36] 0.9798206 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [43] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [50] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [57] 0.9798206 0.9798206 0.9798206 0.0000000 0.0000000 0.0000000 0.0000000
##  [64] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.9798206 0.9798206
##  [71] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [78] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.0000000
##  [85] 0.0000000 0.0000000 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [92] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
##  [99] 0.9798206 0.9798206 0.9798206 0.0000000 0.9798206 0.9798206 0.9798206
## [106] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
## [113] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [120] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.9798206 0.9798206
## [127] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
## [134] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.0000000
## [141] 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206 0.9798206
## [148] 0.9798206 0.9798206 0.9798206 0.0000000 0.0000000 0.0000000 0.0000000
## [155] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

(nonsparse <- na_ratio < 0.8)

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [67]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [155]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

sum(nonsparse)

## [1] 60

There are 100 sparse variables. Eliminating them should make our models simpler and more computationally efficient.

Secondly, although the data were clearly collected as a time series, as the experimental design and timestamp variables show, we will initially attempt a simplification: discarding all time and individual participant information (the first 7 columns/variables in the data), and instead train our algorithms on the snapshot-like non-sequential view of the data. In the event that this approach fails to produce a sufficient level of performance, a time series approach can be adopted.

We implement a custom preprocessing function to keep only the non-sparse variables, eliminate the time and individual data, and move the response variable to the first column for convenience.

myPreProc <- function(df, keep.cols, skip.cols = 7) {
    
    proc_df <- df[ , keep.cols][ , -(1:skip.cols)]

    # put response variable in first column
    ncols <- ncol(proc_df)
    proc_df <- proc_df[, c(ncols, 1:(ncols-1))]

    # make all columns (except response) numeric
    proc_df[ , 2:ncols] <- as.data.frame(lapply(proc_df[ , -1], as.numeric))
    
    proc_df
}

pr_training <- myPreProc(training, nonsparse)
pr_validation <- myPreProc(validation, nonsparse)
pr_testing <- myPreProc(pml_testing, nonsparse)

The exact same preprocessing that is done to the training set is also done to the validation and testing sets. Note that the testing set does not contain the response variable (correct classes) but a placeholder index which will ultimately be ignored.

Modeling

One approach to modeling is to try many different types of machine learning algorithms initially, evaluate their performance, and then tune the more successful one(s) as needed. This has the benefit of being data-driven based on empirical results. We will try ten algorithms, several of which use very different underlying approaches to the modeling. The caretEnsemble package makes it easy to train many models simultaneously with caretList.

5-fold cross-validation is used for choosing model parameters. The training data set is split into 5 stratified groups, with the model trained on 4 of them, and the model’s performance tested on the remaining group. The groups are then rotated until each group has been used for testing. Often for better model performance, 10-fold repeated cross-validation is suggested, however, here we begin with 5-fold non-repeated cross-validation in the interest of computational speed, and we will consider more stringent methods if the results indicate the need. caret automatically tries several parameter combinations (as required by the model) to build the optimal model. If this fails to produce good results we can later specify the number and values of parameters to try.

mycontrol <- trainControl(method = "cv", number = 5)
mymethods <- c("bayesglm", "C5.0", "fda", "gbm", "knn",  "qda", "rf", "sda", "svmPoly", "xgbLinear")
    
# load models if already trained since training can take several hours
# if models must be changed, delete "model_list.rds" in working directory before running script

model_fname <- "model_list.rds"
if(file.exists(model_fname)){
    
    model_list <- readRDS(model_fname)
    
} else {

    start_time <- proc.time()
    model_list <- caretList(x = pr_training[ , -1],
                            y = pr_training[ , 1],
                            trControl = mycontrol,
                            methodList = mymethods
                            )
    
    # model training runtime for curiosity
    runtime <- proc.time() - start_time

    #save models to speed up later analyses
    saveRDS(model_list, model_fname)
}

if(exists("runtime")) print(runtime) # print runtime only if training was run

dotplot(resamples(model_list))

The plot shows promising performance for several of the models. Both the accuracy and the kappa score (which tries to account for class imbalances) of the models look good on resampled training data.

Next, the built models are used to make predictions on the validation set to obtain our best unbiased estimates of model performance on unseen data. This also helps to make sure we do not have high variance (overfitting) issues. These results guide the decision of which model(s) to use in the future.

pred_list <- lapply(model_list, function(x) predict(x, newdata = pr_validation[ , -1]))

(accuracy <- sort(sapply(pred_list, function(x) confusionMatrix(x, pr_validation[ , 1])$overall[1]), decreasing = TRUE))

## xgbLinear.Accuracy      C5.0.Accuracy        rf.Accuracy 
##          0.9971452          0.9953100          0.9946982 
##   svmPoly.Accuracy       gbm.Accuracy       knn.Accuracy 
##          0.9934747          0.9647227          0.9119086 
##       qda.Accuracy       sda.Accuracy       fda.Accuracy 
##          0.8941680          0.7002447          0.6831158 
##  bayesglm.Accuracy 
##          0.4023246

Several of the models have high accuracies, with xgbLinear having the highest at 99.7%. Equivalently, the out-of-sample, or generalization, error is estimated at 0.3%. Furthermore, the confusion matrix, below, indicates the accuracy is consistently high across all classes. Interestingly, it performs perfectly on Class A, which is the class for the weight lifting exercise performed correctly. This means that when an individual performs the exercise well, it is very unlikely for the model to flag it as incorrect, which would be quite frustrating for the user.

confusionMatrix(pred_list$xgbLinear, pr_validation[ , 1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    1    0    0    0
##          B    0  946    3    0    0
##          C    0    2  850    4    1
##          D    0    0    2  800    1
##          E    0    0    0    0  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9971          
##                  95% CI : (0.9952, 0.9984)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9964          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9968   0.9942   0.9950   0.9978
## Specificity            0.9997   0.9992   0.9983   0.9993   1.0000
## Pos Pred Value         0.9993   0.9968   0.9918   0.9963   1.0000
## Neg Pred Value         1.0000   0.9992   0.9988   0.9990   0.9995
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1929   0.1733   0.1631   0.1833
## Detection Prevalence   0.2847   0.1935   0.1748   0.1637   0.1833
## Balanced Accuracy      0.9999   0.9980   0.9962   0.9971   0.9989

Results

Considering the high accuracies achieved we will apply the models to the test set. Instead of picking only one model with the highest accuracy, it is interesting to see to what extent the different models agree with each other. We can also use an simple ensembling strategy of unweighted votes to make more robust predictions.

results <- as.data.frame(lapply(model_list, function(x) predict(x, newdata = pr_testing[ , -1])))
results$ensemble <- apply(results, 1, function(x) names(which.max(table(x))))

results

##    bayesglm C5.0 fda gbm knn qda rf sda svmPoly xgbLinear ensemble
## 1         B    B   D   B   B   A  B   B       B         B        B
## 2         A    A   A   A   A   A  A   A       A         A        A
## 3         B    B   C   B   B   B  B   A       B         B        B
## 4         B    A   C   A   A   A  A   C       A         A        A
## 5         B    A   C   A   A   A  A   C       A         A        A
## 6         B    E   C   E   E   E  E   C       E         E        E
## 7         B    D   D   D   D   D  D   D       D         D        D
## 8         B    B   D   B   B   B  B   D       B         B        B
## 9         A    A   A   A   A   A  A   A       A         A        A
## 10        A    A   A   A   A   A  A   A       A         A        A
## 11        B    B   D   B   B   B  B   D       B         B        B
## 12        A    C   C   C   C   C  C   A       C         C        C
## 13        B    B   B   B   D   B  B   B       B         B        B
## 14        A    A   A   A   A   A  A   A       A         A        A
## 15        B    E   C   E   E   E  E   E       E         E        E
## 16        B    E   B   E   E   E  E   A       E         E        E
## 17        A    A   A   A   A   A  A   C       A         A        A
## 18        B    B   D   B   B   B  B   B       B         B        B
## 19        B    B   D   B   B   B  B   B       B         B        B
## 20        B    B   A   B   B   B  B   B       B         B        B

The ensemble result is used as our predictions for the test data set. We see that 5 of the models agree completely with each other and the ensemble result, which gives us more confidence in the predictions.

Conclusions

We have have deployed ten machine learning algorithms using motion sensor data to classify the quality and correctness of exercises performed by individuals. The best performing model was xgbLinear with 99.7% accuracy. The predictions from all ten models were ensembled together for a more robust performance.

Although the performance is sufficient for this application, a more thorough exploration of the models’ parameters could possibly further improve performance and/or reduce computational time by finding equivalently performing simpler models or improving convergence rates. Also, ensembling could use other strategies such as averaging the prediction probabilities from the individual models. Model stacking is also an option, applying another machine learning algorithm to the data set of individual model results to obtain the final predictions.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.