Cleaning Weightlifting Exercise Data

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, I used data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

What was the goal of the project?

The goal of this project is to obtain and clean the data so that it has characteristics of a tidy data set. In brief, tidy data has three characteristics. One is that each variable forms a column. The second is that each observation forms a row. Finally, each type of observational unit forms a table. Real datasets often violate these three precepts of tidy data. Most commonly, they violate these precepts in three ways. First, column headers are values, not variable names. Multiple variables are stored in one column. Variables are stored in both rows and columns. Multiple types of observational units are stored in the same table. A single observational unit is stored in multiple tables. To read more about why the concept of tidy data might be interesting, please visit this page.

Load Libraries

suppressMessages(library(data.table))
suppressMessages(library(stringr))
suppressMessages(library(dplyr))
suppressMessages(library(tidyr))
suppressMessages(library(reshape2))

Download Files

One can download the data in the following manner.

FileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"

if(!file.exists("./data")){dir.create("./data")}

download.file(FileUrl, destfile="./data/UCI_HAR_Dataset.zip")

Once the archive has been obtained, unzip it and read it in.

subject_train <- fread("data/UCI HAR Dataset/train/subject_train.txt")

subject_test <- fread("data/UCI HAR Dataset/test/subject_test.txt")

y_train <- fread("data/UCI HAR Dataset/train/y_train.txt")

y_test <- fread("data/UCI HAR Dataset/test/y_test.txt")

x_train <- fread("data/UCI HAR Dataset/train/X_train.txt", sep = " ")

x_test <- fread("data/UCI HAR Dataset/test/X_test.txt", sep = " ")

We notice that this data set has a format of a set of features and labels, but in this project we are not interested in doing any analysis with the data, only with making it satisfy the precepts of tidy data.

We prefer to bundle the features and labels as one dataset with all variables as columns.

subject <- rbind(subject_train, subject_test)
names(subject) <- c("Subject")
y <- rbind(y_train, y_test)
names(y) <- c("Label")
x <- rbind(x_train, x_test)

subjectxy <- cbind(subject, x, y)

We then set the key on Subject and Label for later.

setkey(subjectxy, Subject, Label)

Here we are only interested in the mean and standard deviation for each measurement. These are features of the data. If this is unclear or you’d like to know more, you can read about characteristics of the data here.

features <- fread("data/UCI HAR Dataset/features.txt")

names(features) <- c("featureIndex", "feature")

features <- features[grepl("mean\\(\\)|std\\(\\)", feature)]
features
##     featureIndex                     feature
##  1:            1           tBodyAcc-mean()-X
##  2:            2           tBodyAcc-mean()-Y
##  3:            3           tBodyAcc-mean()-Z
##  4:            4            tBodyAcc-std()-X
##  5:            5            tBodyAcc-std()-Y
##  6:            6            tBodyAcc-std()-Z
##  7:           41        tGravityAcc-mean()-X
##  8:           42        tGravityAcc-mean()-Y
##  9:           43        tGravityAcc-mean()-Z
## 10:           44         tGravityAcc-std()-X
## 11:           45         tGravityAcc-std()-Y
## 12:           46         tGravityAcc-std()-Z
## 13:           81       tBodyAccJerk-mean()-X
## 14:           82       tBodyAccJerk-mean()-Y
## 15:           83       tBodyAccJerk-mean()-Z
## 16:           84        tBodyAccJerk-std()-X
## 17:           85        tBodyAccJerk-std()-Y
## 18:           86        tBodyAccJerk-std()-Z
## 19:          121          tBodyGyro-mean()-X
## 20:          122          tBodyGyro-mean()-Y
## 21:          123          tBodyGyro-mean()-Z
## 22:          124           tBodyGyro-std()-X
## 23:          125           tBodyGyro-std()-Y
## 24:          126           tBodyGyro-std()-Z
## 25:          161      tBodyGyroJerk-mean()-X
## 26:          162      tBodyGyroJerk-mean()-Y
## 27:          163      tBodyGyroJerk-mean()-Z
## 28:          164       tBodyGyroJerk-std()-X
## 29:          165       tBodyGyroJerk-std()-Y
## 30:          166       tBodyGyroJerk-std()-Z
## 31:          201          tBodyAccMag-mean()
## 32:          202           tBodyAccMag-std()
## 33:          214       tGravityAccMag-mean()
## 34:          215        tGravityAccMag-std()
## 35:          227      tBodyAccJerkMag-mean()
## 36:          228       tBodyAccJerkMag-std()
## 37:          240         tBodyGyroMag-mean()
## 38:          241          tBodyGyroMag-std()
## 39:          253     tBodyGyroJerkMag-mean()
## 40:          254      tBodyGyroJerkMag-std()
## 41:          266           fBodyAcc-mean()-X
## 42:          267           fBodyAcc-mean()-Y
## 43:          268           fBodyAcc-mean()-Z
## 44:          269            fBodyAcc-std()-X
## 45:          270            fBodyAcc-std()-Y
## 46:          271            fBodyAcc-std()-Z
## 47:          345       fBodyAccJerk-mean()-X
## 48:          346       fBodyAccJerk-mean()-Y
## 49:          347       fBodyAccJerk-mean()-Z
## 50:          348        fBodyAccJerk-std()-X
## 51:          349        fBodyAccJerk-std()-Y
## 52:          350        fBodyAccJerk-std()-Z
## 53:          424          fBodyGyro-mean()-X
## 54:          425          fBodyGyro-mean()-Y
## 55:          426          fBodyGyro-mean()-Z
## 56:          427           fBodyGyro-std()-X
## 57:          428           fBodyGyro-std()-Y
## 58:          429           fBodyGyro-std()-Z
## 59:          503          fBodyAccMag-mean()
## 60:          504           fBodyAccMag-std()
## 61:          516  fBodyBodyAccJerkMag-mean()
## 62:          517   fBodyBodyAccJerkMag-std()
## 63:          529     fBodyBodyGyroMag-mean()
## 64:          530      fBodyBodyGyroMag-std()
## 65:          542 fBodyBodyGyroJerkMag-mean()
## 66:          543  fBodyBodyGyroJerkMag-std()
##     featureIndex                     feature

Next, we create indices to filter only features which contain "mean" or "std" from the data subjectxy.

raw_feature_name <- paste0("V", features$featureIndex)

subjectxy <- subjectxy[, c(key(subjectxy), raw_feature_name), with = FALSE]
raw_feature_name
##  [1] "V1"   "V2"   "V3"   "V4"   "V5"   "V6"   "V41"  "V42"  "V43"  "V44" 
## [11] "V45"  "V46"  "V81"  "V82"  "V83"  "V84"  "V85"  "V86"  "V121" "V122"
## [21] "V123" "V124" "V125" "V126" "V161" "V162" "V163" "V164" "V165" "V166"
## [31] "V201" "V202" "V214" "V215" "V227" "V228" "V240" "V241" "V253" "V254"
## [41] "V266" "V267" "V268" "V269" "V270" "V271" "V345" "V346" "V347" "V348"
## [51] "V349" "V350" "V424" "V425" "V426" "V427" "V428" "V429" "V503" "V504"
## [61] "V516" "V517" "V529" "V530" "V542" "V543"

We then use the descriptive activity names provided to name the activities in the data set and appropriately label the data set with descriptive variable names..

descriptive_activity_names <- fread("data/UCI HAR Dataset/activity_labels.txt")

names(descriptive_activity_names) <- c("Label", "activity")

subjectxy <- merge(subjectxy, descriptive_activity_names, by = "Label", all.x = TRUE)

for(i in 1:length(raw_feature_name))
{

    new_name <- features$feature[i]

    search_string = "(^[ft])(Body|Gravity)*(Gyro|Acc)*(Jerk)*(Mag)*-(mean|std)\\(\\)[-]*([XYZ])*"

    new_name <- paste(str_match(new_name, search_string)[,2:8], collapse = '_')

colnames(subjectxy)[colnames(subjectxy)==raw_feature_name[i]] <- new_name 
}
colnames(subjectxy)
##  [1] "Label"                        "Subject"                     
##  [3] "t_Body_Acc_NA_NA_mean_X"      "t_Body_Acc_NA_NA_mean_Y"     
##  [5] "t_Body_Acc_NA_NA_mean_Z"      "t_Body_Acc_NA_NA_std_X"      
##  [7] "t_Body_Acc_NA_NA_std_Y"       "t_Body_Acc_NA_NA_std_Z"      
##  [9] "t_Gravity_Acc_NA_NA_mean_X"   "t_Gravity_Acc_NA_NA_mean_Y"  
## [11] "t_Gravity_Acc_NA_NA_mean_Z"   "t_Gravity_Acc_NA_NA_std_X"   
## [13] "t_Gravity_Acc_NA_NA_std_Y"    "t_Gravity_Acc_NA_NA_std_Z"   
## [15] "t_Body_Acc_Jerk_NA_mean_X"    "t_Body_Acc_Jerk_NA_mean_Y"   
## [17] "t_Body_Acc_Jerk_NA_mean_Z"    "t_Body_Acc_Jerk_NA_std_X"    
## [19] "t_Body_Acc_Jerk_NA_std_Y"     "t_Body_Acc_Jerk_NA_std_Z"    
## [21] "t_Body_Gyro_NA_NA_mean_X"     "t_Body_Gyro_NA_NA_mean_Y"    
## [23] "t_Body_Gyro_NA_NA_mean_Z"     "t_Body_Gyro_NA_NA_std_X"     
## [25] "t_Body_Gyro_NA_NA_std_Y"      "t_Body_Gyro_NA_NA_std_Z"     
## [27] "t_Body_Gyro_Jerk_NA_mean_X"   "t_Body_Gyro_Jerk_NA_mean_Y"  
## [29] "t_Body_Gyro_Jerk_NA_mean_Z"   "t_Body_Gyro_Jerk_NA_std_X"   
## [31] "t_Body_Gyro_Jerk_NA_std_Y"    "t_Body_Gyro_Jerk_NA_std_Z"   
## [33] "t_Body_Acc_NA_Mag_mean_NA"    "t_Body_Acc_NA_Mag_std_NA"    
## [35] "t_Gravity_Acc_NA_Mag_mean_NA" "t_Gravity_Acc_NA_Mag_std_NA" 
## [37] "t_Body_Acc_Jerk_Mag_mean_NA"  "t_Body_Acc_Jerk_Mag_std_NA"  
## [39] "t_Body_Gyro_NA_Mag_mean_NA"   "t_Body_Gyro_NA_Mag_std_NA"   
## [41] "t_Body_Gyro_Jerk_Mag_mean_NA" "t_Body_Gyro_Jerk_Mag_std_NA" 
## [43] "f_Body_Acc_NA_NA_mean_X"      "f_Body_Acc_NA_NA_mean_Y"     
## [45] "f_Body_Acc_NA_NA_mean_Z"      "f_Body_Acc_NA_NA_std_X"      
## [47] "f_Body_Acc_NA_NA_std_Y"       "f_Body_Acc_NA_NA_std_Z"      
## [49] "f_Body_Acc_Jerk_NA_mean_X"    "f_Body_Acc_Jerk_NA_mean_Y"   
## [51] "f_Body_Acc_Jerk_NA_mean_Z"    "f_Body_Acc_Jerk_NA_std_X"    
## [53] "f_Body_Acc_Jerk_NA_std_Y"     "f_Body_Acc_Jerk_NA_std_Z"    
## [55] "f_Body_Gyro_NA_NA_mean_X"     "f_Body_Gyro_NA_NA_mean_Y"    
## [57] "f_Body_Gyro_NA_NA_mean_Z"     "f_Body_Gyro_NA_NA_std_X"     
## [59] "f_Body_Gyro_NA_NA_std_Y"      "f_Body_Gyro_NA_NA_std_Z"     
## [61] "f_Body_Acc_NA_Mag_mean_NA"    "f_Body_Acc_NA_Mag_std_NA"    
## [63] "f_Body_Acc_Jerk_Mag_mean_NA"  "f_Body_Acc_Jerk_Mag_std_NA"  
## [65] "f_Body_Gyro_NA_Mag_mean_NA"   "f_Body_Gyro_NA_Mag_std_NA"   
## [67] "f_Body_Gyro_Jerk_Mag_mean_NA" "f_Body_Gyro_Jerk_Mag_std_NA" 
## [69] "activity"

We now have a data table with descriptive column names, but they contain more than one variable per column. To remedy this, we melt the table so that each column indicates a property of the characteristics of the feature that each row constitutes. For example, the variable "space" tells us whether we are dealing with the frequency (f) or time (t) domain, while the variable "axis" tells us whether we are talking about a measurement along the x, y, or z axis. We then separate out column names so that there is only one variable per column.

subjectxy <- melt(subjectxy, id.vars = c("Label", "Subject", "activity"), value.name = "activity_values", variable.name = "space_frame_device_jerk_magnitude_statistic_axis")

subjectxy <- tbl_df(subjectxy) #conversion to dplyr df

subjectxy <- subjectxy %>% separate(space_frame_device_jerk_magnitude_statistic_axis, c("space","frame","device","jerk","magnitude","statistic","axis"))

subjectxy <- subjectxy[, 2:length(names(subjectxy))]
subjectxy
## # A tibble: 679,734 × 10
##    Subject activity space frame device  jerk magnitude statistic  axis
##      <int>    <chr> <chr> <chr>  <chr> <chr>     <chr>     <chr> <chr>
## 1        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 2        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 3        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 4        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 5        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 6        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 7        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 8        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 9        1  WALKING     t  Body    Acc    NA        NA      mean     X
## 10       1  WALKING     t  Body    Acc    NA        NA      mean     X
## # ... with 679,724 more rows, and 1 more variables: activity_values <dbl>

Finally, we create a summary dataframe using the package dplyr to tell us about the average value of measurements grouped by activity and which subject we are dealing with. This is the end product of the project and gives us summary statistics about data set as a whole.

subjectxy_tidy <- subjectxy %>% group_by(Subject, space,frame,device,jerk,magnitude,statistic,axis) %>% summarize(avg_value = mean(activity_values))
subjectxy_tidy
## Source: local data frame [1,980 x 9]
## Groups: Subject, space, frame, device, jerk, magnitude, statistic [?]
## 
##    Subject space frame device  jerk magnitude statistic  axis  avg_value
##      <int> <chr> <chr>  <chr> <chr>     <chr>     <chr> <chr>      <dbl>
## 1        1     f  Body    Acc  Jerk       Mag      mean    NA -0.4990758
## 2        1     f  Body    Acc  Jerk       Mag       std    NA -0.5418231
## 3        1     f  Body    Acc  Jerk        NA      mean     X -0.5473489
## 4        1     f  Body    Acc  Jerk        NA      mean     Y -0.5073436
## 5        1     f  Body    Acc  Jerk        NA      mean     Z -0.6953051
## 6        1     f  Body    Acc  Jerk        NA       std     X -0.5439798
## 7        1     f  Body    Acc  Jerk        NA       std     Y -0.4662517
## 8        1     f  Body    Acc  Jerk        NA       std     Z -0.7378619
## 9        1     f  Body    Acc    NA       Mag      mean    NA -0.4784485
## 10       1     f  Body    Acc    NA       Mag       std    NA -0.5897102
## # ... with 1,970 more rows
write.table(subjectxy_tidy, file = "tidy_smartphone_data.txt", row.names = FALSE)