r - How to specify a validation holdout set to caret -


i using caret @ least stages of modeling, it's easy use resampling methods. however, i'm working on model training set has fair number of cases added via semi-supervised self-training , cross-validation results skewed because of it. solution using validation set measure model performance can't see way use validation set directly within caret - missing or not supported? know can write own wrappers caret m, nice if there work-around without having that.

here trivial example of experiencing:

> library(caret) > set.seed(1) >  > #training/validation sets > <- sample(150,50) > train <- iris[-i,] > valid <- iris[i,] >  > #make model > tc <- traincontrol(method="cv") > model.rf <- train(species ~ ., data=train,method="rf",trcontrol=tc) >  > #model parameters selected using cv results... > model.rf 100 samples   4 predictors   3 classes: 'setosa', 'versicolor', 'virginica'   no pre-processing resampling: cross-validation (10 fold)   summary of sample sizes: 90, 90, 90, 89, 90, 92, ...   resampling results across tuning parameters:    mtry  accuracy  kappa  accuracy sd  kappa sd   2     0.971     0.956  0.0469       0.0717     3     0.971     0.956  0.0469       0.0717     4     0.971     0.956  0.0469       0.0717    accuracy used select optimal model using  largest value. final value used model mtry = 2.  >  > #have manually check validation set > valid.pred <- predict(model.rf,valid) > table(valid.pred,valid$species)  valid.pred   setosa versicolor virginica   setosa         17          0         0   versicolor      0         20         1   virginica       0          2        10 > mean(valid.pred==valid$species) [1] 0.94 

i thought creating custom summaryfunction() traincontrol() object cannot see how reference model object predictions validation set (the documentation - http://caret.r-forge.r-project.org/training.html - lists "data", "lev" , "model" possible parameters). example not work:

tc$summaryfunction <- function(data, lev = null, model = null){   data.frame(accuracy=mean(predict(<model object>,valid)==valid$species)) } 

edit: in attempt come ugly fix, i've been looking see if can access model object scope of function, i'm not seeing them model stored anywhere. there elegant solution i'm not coming close seeing...

> tc$summaryfunction <- function(data, lev = null, model = null){ +   browser() +   data.frame(accuracy=mean(predict(model,valid)==valid$species)) + } > train(species ~ ., data=train,method="rf",trcontrol=tc) note: 1 unique complexity parameters in default grid. truncating grid 1 .  called from: trcontrol$summaryfunction(testoutput, classlevels, method) browse[1]> lapply(sys.frames(),function(x) ls(envi=x)) [[1]] [1] "x"  [[2]]  [1] "cons"      "contrasts" "data"      "form"      "m"         "na.action" "subset"     [8] "terms"     "w"         "weights"   "x"         "xint"      "y"          [[3]] [1] "x"  [[4]]  [1] "classlevels" "funccall"    "maximize"    "method"      "metric"      "modelinfo"    [7] "modeltype"   "paramcols"   "ppmethods"   "preprocess"  "starttime"   "testoutput"  [13] "traindata"   "traininfo"   "trcontrol"   "tunegrid"    "tunelength"  "weights"     [19] "x"           "y"            [[5]] [1] "data"  "lev"   "model" 

take @ traincontrol. there options directly specify rows of data used model data (the index argument) , rows should used compute hold-out estimates (called indexout). think looking for.

max


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -