How to pass customized formula to learing_curve_dat of caret?
train(mpg ~ wt+cyl,
mtcars,
method = "lm")
learing_curve_dat(dat=mtcars,
outcome="mpg",
test_prop = 1/5,
method="lm",
trControl = trainControl(method = "cv",
number = 3,
repeats = 1))
Using train in caret, we can easily customize the formula for lm. How can we pass customized formula to learing_curve_dat
?
Regarding parameters for train
, the documentation explicitly says that "These should not include x, y, formula, or data". How can I work around this constrain? Do I have to create and use my own model to do so?
BTW, why it is called learing_curve_dat
, not learning_curve
?
Updates:
I found that the train
in learing_curve_dat
only support Default S3 method
, in order to pass formula as an parameter, we need S3 method for class 'formula'
. I replace the train
function to be an S3 method for class 'formula'
in the source code and add form
as an argument.
mod < train(form=form,
data=dat[in_mod,,drop=F],
...)
See also questions close to this topic

R regression: Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :undefined columns selected
I've been trying to build regression models using different ways and method and as I was trying out this code here, using the Caret package:
library(caret) set.seed(222) ind < sample(2, nrow(model2), replace = T, prob = c(0.7,0.3)) train < model2[ind==1,] test < model2[ind==2,] custom < trainControl(method = "repeatedcv", number = 6, repeats = 6, verboseIter = T) lm < train(train$SS~., train, method = 'lm', trControl = custom) lm$results
But I kept receiving this error note:
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :undefined columns selected
Here are the
str()
of my data set.> str(train) 'data.frame': 19 obs. of 15 variables: $ SST : num 0 0 0 0 1 0 0 0 0 1 ... $ SSA : num 0 1 0 0 0 0 0 1 0 0 ... $ SSR : num 0 0 0 1 0 0 0 0 0 0 ... $ SSC : num 0 0 0 0 0 0 0 0 0 1 ... $ SSF : num 1 1 1 0 1 1 1 1 1 1 ... $ SSS : num 0 0 0 1 0 0 0 0 0 0 ... $ SST : num 1 1 1 0 1 1 1 1 1 1 ... $ SSH : num 1 1 1 0 1 1 1 1 1 1 ... $ SSC : num 1 1 1 0 1 1 1 1 1 1 ... $ SSW : num 0 0 0 0 0 0 0 0 0 0 ... $ QTY : num 45 45 49 13 48 109 45 42 45 31 ... $ SS : num 470000 550000 460000 630000 1060000 530000 480000 510000 460000 630000 ... $ BASE..SS : num 6.27e+09 6.67e+09 6.14e+09 8.54e+09 1.43e+10 ... $ Ex : num 13341 13341 13341 13341 13341 ... $ TPH. : num 45 45 45 90 95 65 45 45 45 45 ...
Any kind of help will be greatly appreciated. Thank you for taking time off this question!

How to execute a function without argument using lapply
I have the following functions:
set.seed(1) make_seq < function() { paste0(sample(LETTERS, size = 30, replace = TRUE), collapse = "") } make_seq() #> [1] "GJOXFXYRQBFERJUMSZJUYFQDGKAJWI"
It takes no argument and spits out a sequence.
What I want to do is to compactly create 100 sequences with the above function with
lapply
. But why this failed?> lapply(1:100, make_seq()) Error in get(as.character(FUN), mode = "function", envir = envir) : object 'GJOXFXYRQBFERJUMSZJUYFQDGKAJWI' of mode 'function' was not found
What's the right way to do it?

ggplot change color of one bar from stacked bar chart
Is there way to change colors of one bar( x  value) manualy in
ggplot
data
for_plot_test=structure(list(name = c("A", "B", "C", "A1", "A2", "A3", "A4", "BI", "A", "B", "C", "A1", "A2", "A3", "A4", "BI"), n = c(1L, 3L, 5L, 7L, 9L, 11L, 13L, 15L, 2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L), value = c(0, 0.05, 0, 0.05, 0.05, 0.1, 0.05, 0, 1, 0.7, 0.6, 0.5, 0.4, 0.2, 0.2, 0.1), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("PROGRESS", "prev_progress"), class = "factor")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 16L), vars = "name", labels = structure(list(name = c("Applications", "BI", "Clients", "CRE & Scoring", "Portfolio & Production", "SG Russia", "Transactions", "УКЛ & Prescoring")), row.names = c(NA, 8L), class = "data.frame", vars = "name", drop = TRUE, indices = list(0:1, 14:15, 6:7, 10:11, 2:3, 12:13, 8:9, 4:5), group_sizes = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), biggest_group_size = 2L, .Names = "name"), indices = list(c(0L, 8L), c(7L, 15L), c(3L, 11L), c(5L, 13L), c(1L, 9L), c(6L, 14L), c(4L, 12L), c(2L, 10L)), drop = TRUE, group_sizes = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), biggest_group_size = 2L, .Names = c("name", "n", "value", "variable"))
Current plot
colot_progress=c("#be877a","#dcbfad") s < ggplot(for_plot_test, aes(x= reorder(name, n),y = value, fill = variable,label=ifelse(for_plot$value==0,"",scales::percent(for_plot$value))))+ geom_bar(stat='identity',position = "stack")+ scale_fill_manual(values=colot_progress,aesthetics = "fill")+ coord_flip()+ theme_minimal() + theme( axis.title = element_blank(), axis.text.x=element_blank(), panel.grid = element_blank(), legend.position="none" )+ geom_text(size = 5, position = position_stack(vjust = 0.5)) s

Is there a way to name a tensorflow variable based on the value of another tensorvariable
I want to be able to do the following
n = str(tf.constant(2)) v = tf.get_variable(name=n, shape=(256,256), initializer=tf.contrib.layers.xavier_initializer())
but doing this is converts n to a string representation of tensorflow variable i.e
"<tf.Tensor 'Const_4:0' shape=() dtype=int32>"

MNIST data denormalising does not give me back the same
This is part of my learning. I Understood the normalization is really helping to improve accuracy, and hence divided by 255 the mnist values. This will divide all the pixels by 255, and hence all the pixels of 28*28 will have the values in range from 0.0 to 1.0 .
Now i tired to multiply the same with 255,this essentially means we should get the original value back. but when i display the picture, both the original and denormalised pictures are different.
(trainX, trainY), (testX, testY) = mnist.load_data() plt.subplot(2,2,1) plt.imshow(trainX[143]) trainX /= 255 plt.subplot(2,2,2) plt.imshow(trainX[143]) trainX *= 255 plt.subplot(2,2,3) plt.imshow(trainX[143]) plt.show()
OUtput:
What am i missing?. Anything related to float and int data type of the input data?

Which is best way of logging to file in C?
I'm dealing with some deep learning model written in C.
I wanna make some log file to check it later.
log will be a line for each step.
Each step will take few seconds.Sometimes I use keyboard interrupt to stop procedure
Ways I thought are:
// Way 1 fp = fopen("log.txt", "a"); for each step: fprintf(fp, "Log content\n"); flose(fp);
I think way 1 may have lower file open/close overhead.
But when I use keyboard interrupt to stop procedure, log file will be never closed properly.
Is it OK?
Or can pass file pointer as argument of the my own signal handler?// Way 2 for each step: fp = fopen("log.txt", "a"); fprintf(fp, "Log content\n"); fclose(fp);
I think way 2 will have file open/close overhead for each step, could this slow down whole performance?
Could it be crucial?// Way 3 for each step: fprintf(buffer, "### Log content ###"); if step % 100 == 0: fp = fopen("log.txt", "a"); fprintf(fp, buffer); fclose(fp); flush buffer;
for way 3, i consider 2 ways of buffer:
1. array of string
2. one long string with line feed between itemsWhich way works best?
It depends?Or if there is a good library for logging, could you recommend it to me?

map emmeans from a list of linear models in R
I have a list of over ~100 linear models and I want to take the estimated means and standard errors for each model.
Let's use
mtcars
as an example.library(tidyverse); library(magrittr); library(emmeans)
mtcars %<>% mutate( cyl = as.factor(cyl) ) df < mtcars %>% select(cyl, hp, mpg)
I can easily get the estimated means and standard errors for each model with
emmeans
:mod < lm(hp ~ cyl, data = df) emmeans(mod, "cyl")
But what if I have a list of models?
list_lm < df %>% select(c(cyl)) %>% map(function(dv) lm(dv ~ df$cyl, data = .))
I cannot use:
emmeans(list_lm$hp, "cyl") Error in ref_grid(object, ...) : Perhaps a 'data' or 'params' argument is needed
And ideally, I want something that would give me these statistics for all the models. Something like
broom::tidy
for the coefficients of the model, but foremmeans
:list_lm %>% map(broom::tidy)

Calculating a correction factor
I have a variable that I suspect is influenced by Temperature, and I'd like to calculate a correction factor that accounts for the effect of temperature.
So give a time series of temperature data:
Temp<c(23.545, 23.475, 23.382, 23.328, 23.251, 23.247, 23.241, 23.227, 23.146, 23.133, 23.127, 23.567, 23.561, 23.521, 23.496, 23.348, 23.274, 23.270, 23.258, 23.244, 23.158, 23.152, 23.132, 23.123, 23.083, 23.025, 22.999, 22.666, 22.330, 22.072, 21.794, 21.532, 21.063, 20.742, 19.183, 18.556, 17.165, 15.233, 13.844, 12.818, 12.236, 11.914)
And the variable in question:
var<c(0.080, 0.003, 0.018, 0.035, 0.005, 0.023, 0.080, 0.035, 0.065, 0.055, 0.030, 0.038, 0.010, 0.013, 0.018, 0.033, 0.028, 0.105, 0.085, 0.010, 0.018, 0.065, 0.048, 0.013, 0.103, 0.013, 0.002, 0.053, 0.018, 0.080, 0.057, 0.083, 0.060, 0.085, 0.158, 0.155, 0.232, 0.245, 0.390, 0.400, 0.568, 0.508)
I can plot the two together to see the effect of temp on var
plot(Temp,var)
I have one Temp observation at 22.330 that I know is correct. So I'd like to adjust the var data based on some sort of correction factor by either assuming:
A  All the var values are close to the same value, and most of the variance is due to temp
B The point at 22.330 is correct, and values above and below should be corrected accordingly

How to increase the model accuracy of multiple linear regression
This is the custom code
#Custom model for multiple linear regression import numpy as np import pandas as pd dataset = pd.read_csv("50s.csv") x = dataset.iloc[:,:1].values y = dataset.iloc[:,4:5].values from sklearn.preprocessing import LabelEncoder lb = LabelEncoder() x[:,3] = lb.fit_transform(x[:,3]) from sklearn.preprocessing import OneHotEncoder on = OneHotEncoder(categorical_features=[3]) x = on.fit_transform(x).toarray() x = x[:,1:] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=1/5, random_state=0) con = np.matrix(X_train) z = np.matrix(y_train) #training model result1 = con.transpose()*con result1 = np.linalg.inv(result1) p = con.transpose()*z f = result1*p l = [] for i in range(len(X_test)): temp = f[0]*X_test[i][0] + f[1]*X_test[i][1] +f[2]*X_test[i][2]+f[3]*X_test[i][3]+f[4]*X_test[i][4] l.append(temp) import matplotlib.pyplot as plt plt.scatter(y_test,l) plt.show()
Then I created created a model with scikit learn and compared the results with y_test and l(predicted values of above code)
comparisons are as follows
for i in range(len(prediction)): print(y_test[i],prediction[i],l[i],sep=' ') 103282.38 103015.20159795816 [[116862.44205399]] 144259.4 132582.27760816005 [[118661.40080974]] 146121.95 132447.73845175043 [[124952.97891882]] 77798.83 71976.09851258533 [[60680.01036438]]
This were the comparison between y_test,scikitlearn model predictions and custom code predictions
please help with the accuracy of model.
blue :Custom model predictions
yellow : scikitlearn model predictions 
What does "~ ." do in R caret dummyVars?
I know ~. is used to include all the available variables in a dataframe to train a machine learning model. My situation is a bit different. I basically had to join the test and the train data in order to do some feature engineering work. Before combining I had to remove the y variable from the Train dataset since the test dataset doesn't have that column to join. Now since there are a lot of factor variables with a lot of levels I thought I would use the One Hot encoding technique on the entire combined dataset using the Caret Package. Here is my code:
# Creating dummy variables is converting a categorical variable to as many binary variables as here are categories. dummies_model < dummyVars(" ~ .", data=trial) # Create the dummy variables using predict. The Y variable (Purchase) will not be present in trainData_mat. trainData_mat < predict(dummies_model, newdata = trial)
My question is: What does :
~.
in the code do? The initial tutorial I got this code from had used the y variable likey~.
and since I don't have they
variable in this dataframe I used another example where the author had used~.
and it works. I would like to know what does the tilda mean in this situation in order for me to verify the work done. Thanks a lot. 
Using caret with recipes is leading to difficulties with resample
I've been using recipes to pipe into
caret::train
, which has been going well, but now I've tried some step_transforms, I'm getting the error:Error in resamples.default(model_list) : There are different numbers of resamples in each model
when I compare models with and without the transformations. The same code with
step_centre
andstep_scale
works fine.library(caret) library(tidyverse) library(tidymodels) formula < price ~ carat model_recipe < recipe(formula, data = diamonds) quadratic_model_recipe < recipe(formula, data = diamonds) %>% step_poly(all_predictors()) model_list < list( linear_model = NULL, quadratic = NULL ) model_list$linear_model < model_recipe %>% train( data = diamonds, method = "lm", trControl = trainControl(method = "cv")) model_list$quadratic_model < quadratic_model_recipe %>% train( data = diamonds, method = "lm", trControl = trainControl(method = "cv")) resamp < resamples(model_list)

Loop lm and msmFit extracting coefficients
I have an xts object with 81 variables. Of these, I need to extract 25 of them with a common string. For each element of this subset I need to do the following estimation, which works for one element (PortAvilliq#). Firstly, a lm, then a msmFit.
Port1 < lm(PortAvilliq1 ~ 1, data = ger_ts) # The msm summary(msmPort1 < msmFit(Port1, 2, sw=rep(TRUE,2))) # Two variables to determine the greater and smaller coefficients Port1HighIll < ifelse(msmPort1@Coef[1,]> msmPort1@Coef[2,],msmPort1@Coef[1,], msmPort1@Coef[2,]) Port1LowIll < ifelse(msmPort1@Coef[1,] < msmPort1@Coef[2,], msmPort1@Coef[1,], msmPort1@Coef[2,]) # The associated probabilities Port1ProbLow < ifelse(msmPort1@Coef[1,] > msmPort1@Coef[2,], msmPort1@transMat[2,2], msmPort1@transMat[1,1]) Port1ProbHigh < ifelse(msmPort1@Coef[1,] > msmPort1@Coef[2,], msmPort1@transMat[1,1], msmPort1@transMat[2,2]) # The main variable of interest Port1EIll < Port1LowIll * Port1ProbLow + (Port1HighIllPort1LowIll) * Port1ProbHigh
How can I do it?

Saving lm model outputs in data frame
I have a data set with 14 variables, 2 are products with their sales and the other 12 are month dummy coded variables.
I first tried with the first product in the table the following model
mod < lm(p1[,1] ~ ene + feb + mar + abr + may + jun + jul + ago + sep + oct +nov + dic, p1)
I wish to save the name of the product, and the model in a format like the following table
Name  Model  Product 1 Product1 = 2.3 + 0.2*jan etc
Later I want to make a loop to make the same operation in a 1000 different products and fill that object(table).
Does someone have any idea how can I do that?

select nonmissing variables in a purrr loop
Consider this example
mydata < data_frame(ind_1 = c(NA,NA,3,4), ind_2 = c(2,3,4,5), ind_3 = c(5,6,NA,NA), y = c(28,34,25,12), group = c('a','a','b','b')) > mydata # A tibble: 4 x 5 ind_1 ind_2 ind_3 y group <dbl> <dbl> <dbl> <dbl> <chr> 1 NA 2 5 28 a 2 NA 3 6 34 a 3 3 4 NA 25 b 4 4 5 NA 12 b
Here I want, for each
group
, regressy
on whatever variable is not missing in that group, and store the correspondinglm
object in alistcolumn
.That is:
 for group
a
, these variables correspond toind_2
andind_3
 for group
b
, they correspond toind_1
andind_2
I tried the following but this does not work
mydata %>% group_by(group) %>% nest() %>% do(filtered_df < . %>% select(which(colMeans(is.na(.)) == 0)), myreg = lm(y~ names(filtered_df)))
Any ideas? Thanks!
 for group