Practical ML Assignment

Given Training Data is read using the using read.csv. It was observed that some of values are stored as NA string.

setwd("D:\\R")
training<-read.csv("pml-training.csv",na.strings="")
training[training=="NA"]=NA
onlyNAs<-sapply(training,function(x) sum(is.na(x))>19000)
trainingDataClean<-training[!onlyNAs]
trainingDataClean<-trainingDataClean[sapply(trainingDataClean,function(x) is.numeric(x))]
trainingDataClean$classe<-training$classe

trainingDataClean<-subset(trainingDataClean,select=-c(X,raw_timestamp_part_1,raw_timestamp_part_2,num_window))
dim(trainingDataClean)

## [1] 19622    53

DO the same processing on the Test data as well

testing<-read.csv("pml-testing.csv",na.strings="")
testing[testing=="NA"]=NA
testingDataClean<-testing[!onlyNAs]
testingDataClean<-testingDataClean[sapply(testingDataClean,function(x) is.numeric(x))]

testingDataClean<-subset(testingDataClean,select=-c(X,raw_timestamp_part_1,raw_timestamp_part_2,num_window))
dim(testingDataClean)

## [1] 20 53

Observed that there are no covariates with near zero variance. With near Zero Variance such co variates will not be helpful in prediction.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

zeroVar = nearZeroVar(trainingDataClean,saveMetrics=TRUE)
any(zeroVar$nzv)

## [1] FALSE

Split the Training Data into Training and Validation using CreateDataPartition

inTrain<-createDataPartition(y=trainingDataClean$classe,p=0.7,list=FALSE)
t<-trainingDataClean[inTrain,]
v<-trainingDataClean[-inTrain,]
corr<-cor(t[,1:52])
library(corrplot)
corrplot(corr, order = "FPC", method = "color", type = "lower", tl.cex = 0.5, 
         tl.col = rgb(0, 0, 0))

Training Using Random Forests with cross validation of 5 times

model<-train(t$classe~.,data=t,method="rf",trControl=trainControl(method="cv",5),ntree=251)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

model

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 10989, 10989, 10990, 10990, 10990 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9911190  0.9887646  0.001678972  0.002124451
##   27    0.9916285  0.9894094  0.002560695  0.003239873
##   52    0.9877705  0.9845304  0.002017683  0.002551378
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Out of Sample error is determined using the validation data selected earlier

p<-predict(model,v)
acc<-postResample(p,v$classe)
acc

##  Accuracy     Kappa 
## 0.9942226 0.9926926

confusionMatrix(v$classe,p)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    2    2    0    1
##          B    4 1131    4    0    0
##          C    0    5 1020    1    0
##          D    0    0    9  955    0
##          E    0    0    2    4 1076
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9942         
##                  95% CI : (0.9919, 0.996)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9927         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9938   0.9836   0.9948   0.9991
## Specificity            0.9988   0.9983   0.9988   0.9982   0.9988
## Pos Pred Value         0.9970   0.9930   0.9942   0.9907   0.9945
## Neg Pred Value         0.9991   0.9985   0.9965   0.9990   0.9998
## Prevalence             0.2843   0.1934   0.1762   0.1631   0.1830
## Detection Rate         0.2836   0.1922   0.1733   0.1623   0.1828
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9982   0.9961   0.9912   0.9965   0.9989

The Prediction on the test Data is obtained by applying the model

p<-predict(model,testingDataClean)
p

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical ML Assignment

Sathya Prakash

Sunday, May 24, 2015