Given Training Data is read using the using read.csv. It was observed that some of values are stored as NA string.
setwd("D:\\R")
training<-read.csv("pml-training.csv",na.strings="")
training[training=="NA"]=NA
onlyNAs<-sapply(training,function(x) sum(is.na(x))>19000)
trainingDataClean<-training[!onlyNAs]
trainingDataClean<-trainingDataClean[sapply(trainingDataClean,function(x) is.numeric(x))]
trainingDataClean$classe<-training$classe
trainingDataClean<-subset(trainingDataClean,select=-c(X,raw_timestamp_part_1,raw_timestamp_part_2,num_window))
dim(trainingDataClean)## [1] 19622    53DO the same processing on the Test data as well
testing<-read.csv("pml-testing.csv",na.strings="")
testing[testing=="NA"]=NA
testingDataClean<-testing[!onlyNAs]
testingDataClean<-testingDataClean[sapply(testingDataClean,function(x) is.numeric(x))]
testingDataClean<-subset(testingDataClean,select=-c(X,raw_timestamp_part_1,raw_timestamp_part_2,num_window))
dim(testingDataClean)## [1] 20 53Observed that there are no covariates with near zero variance. With near Zero Variance such co variates will not be helpful in prediction.
library(caret)## Loading required package: lattice
## Loading required package: ggplot2zeroVar = nearZeroVar(trainingDataClean,saveMetrics=TRUE)
any(zeroVar$nzv)## [1] FALSESplit the Training Data into Training and Validation using CreateDataPartition
inTrain<-createDataPartition(y=trainingDataClean$classe,p=0.7,list=FALSE)
t<-trainingDataClean[inTrain,]
v<-trainingDataClean[-inTrain,]
corr<-cor(t[,1:52])
library(corrplot)
corrplot(corr, order = "FPC", method = "color", type = "lower", tl.cex = 0.5, 
         tl.col = rgb(0, 0, 0))Training Using Random Forests with cross validation of 5 times
model<-train(t$classe~.,data=t,method="rf",trControl=trainControl(method="cv",5),ntree=251)## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.model## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 10989, 10989, 10990, 10990, 10990 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9911190  0.9887646  0.001678972  0.002124451
##   27    0.9916285  0.9894094  0.002560695  0.003239873
##   52    0.9877705  0.9845304  0.002017683  0.002551378
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.Out of Sample error is determined using the validation data selected earlier
p<-predict(model,v)
acc<-postResample(p,v$classe)
acc##  Accuracy     Kappa 
## 0.9942226 0.9926926confusionMatrix(v$classe,p)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    2    2    0    1
##          B    4 1131    4    0    0
##          C    0    5 1020    1    0
##          D    0    0    9  955    0
##          E    0    0    2    4 1076
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9942         
##                  95% CI : (0.9919, 0.996)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9927         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9938   0.9836   0.9948   0.9991
## Specificity            0.9988   0.9983   0.9988   0.9982   0.9988
## Pos Pred Value         0.9970   0.9930   0.9942   0.9907   0.9945
## Neg Pred Value         0.9991   0.9985   0.9965   0.9990   0.9998
## Prevalence             0.2843   0.1934   0.1762   0.1631   0.1830
## Detection Rate         0.2836   0.1922   0.1733   0.1623   0.1828
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9982   0.9961   0.9912   0.9965   0.9989The Prediction on the test Data is obtained by applying the model
p<-predict(model,testingDataClean)
p##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E