How to pass customized formula to learing_curve_dat of caret?
train(mpg ~ wt+cyl,
mtcars,
method = "lm")
learing_curve_dat(dat=mtcars,
outcome="mpg",
test_prop = 1/5,
method="lm",
trControl = trainControl(method = "cv",
number = 3,
repeats = 1))
Using train in caret, we can easily customize the formula for lm. How can we pass customized formula to learing_curve_dat
?
Regarding parameters for train
, the documentation explicitly says that "These should not include x, y, formula, or data". How can I work around this constrain? Do I have to create and use my own model to do so?
BTW, why it is called learing_curve_dat
, not learning_curve
?
Updates:
I found that the train
in learing_curve_dat
only support Default S3 method
, in order to pass formula as an parameter, we need S3 method for class 'formula'
. I replace the train
function to be an S3 method for class 'formula'
in the source code and add form
as an argument.
mod < train(form=form,
data=dat[in_mod,,drop=F],
...)
See also questions close to this topic

Retrieve currency date with oanda in R
I try to retrieve data about the currency pair USD/EUR in R:
getSymbols("USD/EUR",src="oanda")
And I get a mistake:
Error in open.connection(con, "rb") : HTTP error 404.
Why it is not working?

Why does R not display and recognize all factor levels?
When declaring a variable as factor, R does not recognize all levels as levels
dataset$search_term_id < factor(dataset$search_term_id, levels = unique(dataset$search_term_id), nmax = 100000)
There are only 3,000 levels in the dataset.
However
I have tried multiple ways, using
unique()
etc. R does not display the number of factor levels correctly. Any ideas? 
How to extract repeating rows in a matrix
I R I have this matrix
> Y > [,1] [,2] [,3] [,4] [1,] "0" "2" "9" "5" [2,] "4" "7" "7" "3" [3,] "1" "5" "7" "9" [4,] "7" "8" "7" "4" [5,] "7" "8" "7" "4" [6,] "1" "1" "7" "2" [7,] "7" "8" "7" "4" ...
From this matrix I want to get all the repeating rows that repeat 1 time, 2 times, 3 times and so on.
So for example
"7" "8" "7" "4"
occurs 3 times in Y. How do I find all the other cases?

How should I convert words that contain both letters and numbers into only numbers so that KNeighbors classifier can train it to classify them?
My training data contains of text like
EMI3776438, U9BA7E, 20FXU84P, 4506067765, N8UZ00351
I am using the KNeighbors classifier algorithm.
Right now, the method I am using is to convert the alphabets to a number.
For example,
a
/A
would map to10
,b
/B
would map to11
,c
/C
would map to12
. After the conversion, I will send this data to the KNeighbors classifier.So, for example,
ABI37
becomes1011I37
.The problem with this method is that both
AA
and1010
will map to1010
and there is no way for the algorithm to differentiate them and classify properly.Is there a good method to convert these to only numbers (since this algo only works on numbers) so that the real value and classification can be done correctly?

Best Approach For Feature Selection
I have more than six different table with more than 300 features(attributes). Now i am little bit confused about the right approach to select features for model building. I think about two processes
 Pick one by one attribute and calculate the importance of it for output and add it to the data mart.
 Take all the features from all the table and calculate their correlation and importance and remove the less important features.

Keyerror: 1 While appending a dataframe/list
I am trying to use the least variance filter (machine learning) technique to reduce the dimensionality. The code i tried is
numeric=dataset var = numeric.var() numeric = numeric.head(0) variable = [] for j in range(0,len(var)): if var[j]>=10: #setting the threshold as 10% variable.append(numeric[j+1])
The error is KeyError: 1
Since i am not using any dictionaries, what does the error mean and how can i rectify it.

linear regression with a two level factor classification variable
I'm trying to do a linear regression with a two level factor classification variable:
train.control < trainControl(method = "repeatedcv", number = 5, repeats = 3) # Train the model model < train(Classification ~., data = ds, method = "lm", trControl = train.control)
Error: wrong model type for classification
Why do I have this error? I tried yesterday doing the same thing and I didn't have any error.

Multiple regression of variables with different units
I'm new in statistical modelling and using R, so please excuse my mistake for this question.
I want to make multiple regression model with these variables:
 Revenue (in million USD) as dependent variable
 Customer experience score (with scale 1 to 5) as independent variable
 Number of package return (in unit) as independent variable
Since they have different unit and the variation is quite big, I'm thinking about standardize the variables before perform the regression. Is it will be better to model with standardize variable or do regression directly? I also read from the following source about how to rescale it with R.
But how to interpret the model if the variables are rescaled and no longer has a certain unit?

Positive and negative correlation in linear regression
Can we use Linear regression for a problem where one variable is positively correlated and the other variable is negatively correlated? For example, I am trying to predict the wage of a football player and the variables are Overall rating and age. As age increases, wage decreases but, as overall rating increases, the wage increases.
Can I apply Linear regression on this? as in y=overallx1+agex1+c

Similar sensitivity and specificity but different area under ROC  comparison of different methods with caret
I used the caret package in R to compare different methods (PLSDA, support vector machine, artificial neural network, random forest) using the same dataset and a stratified 10fold crossvalidation. The dataset has 1394 records. When comparing the results, I noticed that the area under ROC curve was higher for random forest compared to the other models that had similar sensitivity and specificity. Is that possible or should models with similar sensitivity and specificity always have similar area under ROC?
Here are the codes below for PLSDA (ANN and linear SVM gave similar results) and random forest:
PLSDA
Ycalib<factor(file2[,1121],levels=c("1","0"),labels=c("pregnant","open")) # create the factor vector names(Ycalib)<c("y") Xcalib<data.frame(file2[,1126:1663]) # create the data frame with spectral data set.seed(1001) folds<createFolds(Ycalib,k=10,list = TRUE, returnTrain = TRUE) # statified folds for crossvalidation set.seed(1001) ctrl<trainControl(method="repeatedcv",index=folds,classProbs = TRUE,summaryFunction = twoClassSummary,savePredictions = TRUE) set.seed(1001) plsda<train(x=Xcalib, # spectral data y=Ycalib, # factor vector method="pls", # plsda algorithm tuneLength=60, # number of components trControl=ctrl, # ctrl contained crossvalidation option preProc=c("center","scale"), # the data are centered and scaled metric="ROC") # metric is ROC for 2 classes plsda
Random forest
Ycalib<factor(file2[,1121],levels=c("1","0"),labels=c("pregnant","open")) # create the factor vector names(Ycalib)<c("y") Xcalib<data.frame(file2[,1126:1663]) # create the data frame with spectral data mtry<tuneRF(Xcalib, Ycalib, stepFactor=1) # automatically set the good value for mtry mtry set.seed(1001) folds<createFolds(Ycalib,k=10,list = TRUE, returnTrain = TRUE) set.seed(1001) ctrl<trainControl(method="repeatedcv",index=folds,classProbs = TRUE,summaryFunction = twoClassSummary,savePredictions = TRUE) customRF < list(type = "Classification", library = "randomForest", loop = NULL) # code to be able to choose the mtry and ntree using a grid in the train function below) customRF$parameters < data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree")) customRF$grid < function(x, y, len = NULL, search = "grid") {} customRF$fit < function(x, y, wts, param, lev, last, weights, classProbs, ...) { randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)} customRF$predict < function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata) customRF$prob < function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata, type = "prob") customRF$sort < function(x) x[order(x[,1]),] customRF$levels < function(x) x$classes customRF grid < expand.grid(mtry = 23, ntree = c(500, 1000) ) # change the mtry according to the results of the tuneRF function above, I can also the ntree set.seed(1001) rdforest<train(x=Xcalib, # spectral data y=Ycalib, # factor vector method=customRF, # random forest algorithm (ustomRF instead of 'rf' to be able to choose the mtry and ntree using a grid) trControl=ctrl, # ctrl contained crossvalidation option preProc=c("center","scale"), # the data are centered and scaled metric="ROC", # metric is ROC for 2 classes. Accuracy is used for multiple classes tuneGrid = grid) rdforest
Here are the results:
PLSDA results
ncomp ROC Sens Spec 47 0.7382311 0.57758621 0.8119994
Radom forest results
mtry ntree ROC Sens Spec 23 500 0.8434449 0.5896552 0.8158085
PLSDA CrossValidated (10 fold, repeated 1 times) Confusion Matrix
Reference Prediction pregnant open pregnant 24.0 11.0 open 17.6 47.4 Accuracy (average) : 0.7145
Random forest CrossValidated (10 fold, repeated 1 times) Confusion Matrix
Reference Prediction pregnant open pregnant 25.7 10.0 open 15.9 48.4 Accuracy (average) : 0.7403

final model in "timeslice" validation method from caret
I have a couple of questions about "timeslice" train control method from CARET. Let's pretend that I have to forecast one day ahead. I set initial window to
nrow(data)  7 #days
andtuneLength = 10
 How does the best model is chosen? Does model with
nrow(data)  1
is as good asnrow(data)  7
. In other words, how does the choosing of width of initial window affects on model performance?  For example, let today I trained model on whole dataset with
initialWindow = nrow()  7
, but it was very time consuming. Next day I have new data and I want retrain my model with new observation. Should I retrain model withinitialWindow = nrow()  7
? How to speed up the process of retraining?
 How does the best model is chosen? Does model with

Are the same number of trees required while comparing Random Forest to GBM?
My training set has 13,737 observations with 53 predictors. I need to compare the accuracy of Random Forest and GBM.
For Random Forest, I set
ntree = 128
[based on Oshiro et al. (2012)] intrain(data=trainset, y~., method = "rf", ntree = 128)
because the default (500) was taking far too long.Now in
train(data=trainset, y~., method = "gbm", verbose = FALSE)
, I have not changed the defaultn.tree
value (100).Should I set n.tree in gbm also to 128? Would it be wrong to compare it against random forest otherwise?
Please advise.
Thank you!

Building a regression results table
I'm attempting to build a regression results table and I'm stuck. I'm getting the error:
Error in
summary(mod)$coefficients[vars, "Estimate"]
: subscript out of bounds.I have all these models run and labeled as so. What I want my table to look like:
  model1L  model2L  model3L  model1P  model2P  model3P   price  coef1L  coef2L  coef3L  coef1P  coef2P  coef3P    sd1L  sd2L  sd3L  sd1P  sd2P  sd3P  promoflag  coef1L  coef2L  coef3L  coef1P  coef2P  coef3P    sd1L  sd2L  sd3L  sd1P  sd2P  sd3P 
my functions to extract key regression results from an estimated model
model_list = c("model1L","model2L","model3L", "model1P", "model2P", "model3P") vars = c("price","promoflag")
building the table
results_table1 = function(model_list, vars) { # build leftmost column of results table outrec = c() for (j in 1:length(vars)) { outrec = c(outrec,sprintf("%s",vars[j])) outrec = c(outrec,"") } outrec = c(outrec,"R^2") outrec = c(outrec,"Observations") outdf = as.data.frame(outrec) # process each model for (i in 1:length(model_list)) { # extract estimates for this model mod = eval(parse(text=model_list[i])) estimates = summary(mod)$coefficients[vars,"Estimate"] ses = summary(mod)$coefficients[vars,"Std. Error"] pvals = summary(mod)$coefficients[vars,"Pr(>t)"] # process each parameter of interest outrec = c() for (j in 1:length(vars)) { # set significance stars star = "" if (pvals[j] <= .05) {star = "*"} if (pvals[j] <= .01) {star = "**"} if (pvals[j] <= .001) {star = "***"} # output estimate and std err outrec = c(outrec,sprintf("%.4f%s",estimates[j],star)) outrec = c(outrec,sprintf("(%.4f)",ses[j])) } # add R^2, # of observations to output outrec = c(outrec,sprintf("%.4f",summary(mod)$r.squared[1])) outrec = c(outrec,sprintf("%d",nobs(mod))) outdf = cbind(outdf,outrec) } # set column names to model names names(outdf) = c("",model_list) outdf }
outputting the sample results table
model_list = c("model1L", "model2L", "model3L", "model1P", "model2P", "model3P") vars = c("price", "promoflag") outdf = results_table1(model_list, vars) library(knitr) kable(outdf,align='c')

lm formula with variable names in it
I want to write a function that would take a
lm
model, try to add some feature and test its statistical significance. I've give it a go with the code as follows:library(rlang) library(tidyverse) dataset < data.frame(y = rnorm(100, 2, 3), x1 = rnorm(100, 0, 4), x2 = rnorm(100, 2, 1), x3 = rnorm(100, 9, 1)) model1 < lm(y ~ ., data = dataset) dataset2 < dataset %>% mutate(x10 = rnorm(100, 20, 9), x11 = rnorm(100, 3, 3)) test_var < function(data, var, model){ y_name < names(model$model)[1] dataset_new < data %>% select_at(vars(y_name, str_remove_all(labels(model), '`'), var)) model_new < lm(y_name ~ ., data = dataset_new) return(summary(model_new)) }
As you can notice, to create a new model from available dataset I need to specify which variable should be dependent variable. However, I don't know this name directly, I just need to pull it out from the original model. So I did it in a function above, but it results in an error:
Error in model.frame.default(formula = y_name ~ ., data = dataset_new, : variable lengths differ (found for 'y')
Correct me if I'm wrong but I believe this is due to
y_name
being a string, not a symbol. So I have tried the following editions:test_var < function(data, var, model){ y_name < sym(names(model$model)[1]) dataset_new < data %>% select_at(vars(!!y_name, str_remove_all(labels(model), '`'), var)) model_new < lm(eval(y_name) ~ ., data = dataset_new) return(summary(model_new)) }
Although it seems to work, the resulting model is a perfect fit, as
y
is taken not only as dependent variable, but also as one of the features. Specifying formula witheval(y_name) ~ .  eval(y_name)
doesn't help here. So my question is: how should I pass the dependent variable name tolm
formula to build a correct model? 
How to extract dependent variable name from lm object in R
As in topic, I need to extract the name of dependent variable from an object of class
lm
. Neithervariable.names
orlabels
does that, as they extract only regressors' names. One way would benames(model.frame(model))[1]
, but this also extract all the data as an intermediate step, which is quite problematic as your dataset is large and you care for speed, especially for such a tiny task. What are the other ways?Here's the minimal example:
dataset < data.frame(y = rnorm(100), x1 = rnorm(100, 2, 5), x2 = rnorm(100, 33, 1)) model < lm(y ~ ., data = dataset) names(model.frame(model))[1]