Nowadays, wearable devices are getting better and better in providing useful information about many aspects of our life. Among these, smart watches and smart wristbands are getting more and more popular between customers because these devices can monitor, detect and report the activities that their owner is doing while wearing these devices.
An interesting feature for owners of these devices is to get feedback about how well they performed their activity. In this project, we investigate the possibility of detecting correct and incorrect barbell lifts preformed by participant that wear accelerometers on the belt, forearm, arm, and dumbell. The 6 participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information on data is available from here.
library(caret)
library(randomForest)
First we load caret library to be able to perform machine learning algorithms. Next we need to load data into R. The train and test data were provided in two csv files. They are imported as data frames.
## Loading data into R
train_data = read.csv("pml-training.csv")
test_data = read.csv("pml-testing.csv")
The train data has 160 columns and 19622 rows. The columns are the variables that are monitored during each test. Each row represents one experiment. The last column o train_data, “classe”, is the response that we want to predict.
By looking at test data we can see many columns have “NA” data and can’t be used for prediction of the response. We remove these columns from train_data and test_data since they won’t be useful in the model.
## Remove columns with NA data in test_data
train_data = train_data[, colSums(is.na(test_data)) != nrow(test_data)]
test_data = test_data[, colSums(is.na(test_data)) != nrow(test_data)]
The first 7 columns of the dataset are the identifiers of the experiments and participants and will be removed from the data.
train_data = train_data[-c(1:7)]
test_data = test_data[-c(1:7)]
To implement and test the model we split data into train (70%) and test (30%).
inTrain = createDataPartition(y=train_data$classe,p=0.7)[[1]]
training = train_data[inTrain,]
testing = train_data[-inTrain,]
The training data set now has 53 columns and 13737 rows. An important part of any machine learning algorithm is to find important variables that help predict the response. There are many ways to achieve this. The method here used is to identify variables that are highly corrolated and remove one of them. findCorrelation function looks at the correlation matrix and if two variables have a correlation higher than the cutoff value (here 0.7 was chosen) it removes the variable with the largest mean absolute correlation.
colClass = sapply(testing[,1:ncol(testing)-1],class)
testing[colClass=="factor"] = sapply(testing[colClass=="factor"],as.numeric)
training[colClass=="factor"] = sapply(training[colClass=="factor"],as.numeric)
training.scale<- scale(training[,!names(training) %in% "classe"],center=TRUE,scale=TRUE);
corMatMy <- cor(training.scale)
highlyCor <- findCorrelation(corMatMy, 0.70)
#Apply correlation filter at 0.70,
#then we remove all the variable correlated with more 0.7.
training= training[,-highlyCor]
testing = testing[,-highlyCor]
By performing this filtering algorithm the training data set now has 31 columns and 13737 rows.
Now to find a model for the response data we use a random forest method. In preliminary trys of fitting these model it was found that default values of the random forest create a very time consuming procedure. The important feature was found to be the number of trees to grow, ntree. By try and error ntree=80 was chosen. The dafault value was ntree=500.
train_control <- trainControl(method="cv", number=5)
modfit = train(classe ~ ., data=training ,ntree=50,proximity = TRUE,trControl = train_control)
print(modfit)
## Random Forest
##
## 13737 samples
## 30 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 10989, 10989, 10990, 10991, 10989
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9820194 0.9772479 0.001918297 0.002433150
## 16 0.9795443 0.9741167 0.002499929 0.003165286
## 30 0.9712454 0.9636149 0.002347061 0.002972132
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The important variables to predit the response class can be checked
varImpPlot(modfit$finalModel)
Now we can use this model to predict the classe variable in the testing dataset.
pred <- predict(modfit , testing[,1:ncol(testing)-1])
confusionMatrix(testing$classe , pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1667 5 0 1 1
## B 17 1112 7 0 3
## C 0 15 999 12 0
## D 2 0 19 939 4
## E 0 0 0 3 1079
##
## Overall Statistics
##
## Accuracy : 0.9849
## 95% CI : (0.9814, 0.9878)
## No Information Rate : 0.2865
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9809
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9887 0.9823 0.9746 0.9832 0.9926
## Specificity 0.9983 0.9943 0.9944 0.9949 0.9994
## Pos Pred Value 0.9958 0.9763 0.9737 0.9741 0.9972
## Neg Pred Value 0.9955 0.9958 0.9946 0.9967 0.9983
## Prevalence 0.2865 0.1924 0.1742 0.1623 0.1847
## Detection Rate 0.2833 0.1890 0.1698 0.1596 0.1833
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9935 0.9883 0.9845 0.9891 0.9960
As it can be seen from the table above, out of bag sample is around 1.5%.