Why does it take longer to train my Elastic Net model with caret vs glmnet?
I am fitting an Elastic Net model to a very wide matrix. I like the preprocessing functions in caret
but I have found that it takes about 5 times longer to train that if I just use glmnet
. Why?
# Sample data.
set.seed(123)
trainX < replicate(1000, rnorm(30))
colnames(trainX) < paste0("var", 1:1000)
trainOutcome < gl(2, 15)
# Train model using glmnet.
alpha_to_test < seq(0, 1, 0.1)
system.time({
sapply(alpha_to_test, function(a) {
fit < glmnet::cv.glmnet(
x = trainX, y = as.numeric(trainOutcome),
alpha = a,
nfolds = nrow(trainX) # LOOCV
)
})
}) # 7.272s
# Train model using caret using same search space.
fit < glmnet::cv.glmnet(trainX, y = as.numeric(trainOutcome))
lambda_to_test < fit$lambda
grid < expand.grid(alpha = alpha_to_test, lambda = lambda_to_test)
system.time({
fit < train(
x = trainX, y = trainOutcome,
method = "glmnet",
trControl = trainControl(method = "loocv", selectionFunction = "oneSE"),
tuneGrid = grid
)
}) # 45.316s
See also questions close to this topic

R error while running ARIMA for Time Series Forecasting
I am getting below error after running Arima Model. Please find below my code. Please help in resolving the same.
> model < arima(log(gas_train),order=c(8,1,2),seasonal = list(order=c(8,1,2),period=12)) Error in optim(init[mask], armafn, method = optim.method, hessian = TRUE, : nonfinite finitedifference value [17] In addition: Warning messages: 1: In log(s2) : NaNs produced 2: In log(s2) : NaNs produced 3: In log(s2) : NaNs produced 4: In log(s2) : NaNs produced 5: In log(s2) : NaNs produced

R, HPD (Highest posterior density) interval based on samples from posterior, WinBUGS
How to calculate HPD (Highest posterior density) interval from posterior samples? I have four parameters and i generate 1000 samples from posterior parameters distribution. Now How to calculate HPD in R software. I used package code But I got an error that
HPDinterval(winbugsresult$sims.list,prob=0.05) Error in UseMethod("HPDinterval") : no applicable method for 'HPDinterval' applied to an object of class "list"
where "winbugsresult" is a list that contains posterior samples.
I also used a vector I got following error
HPDinterval(winbugsresult$sims.list$alpha ,prob=0.05) Error in UseMethod("HPDinterval") : no applicable method for 'HPDinterval' applied to an object of class "c('double', 'numeric')"
I used just a random vector from normal and i got error again
HPDinterval(rnorm(100)) Error in UseMethod("HPDinterval") : no applicable method for 'HPDinterval' applied to an object of class "c('double', 'numeric')"

If statement with employees data
I have one data set.Which contain data about employees in company.You can see data below:
#Data output_test<data.frame( Employees=c(1,2,3,10,15,122,143,150,250,300,500,1000) )
So next steep should be classification. I need to classify Employees by size of company.Rule is that every number of Employees determine size of company.For example if number is below 10 that meaning that is "micro" company, if number is greater then 10 but below or equal to 50 company is "small" company.For "medium" company number of Employees is greater then 50 but equal or small to 250 and last is "large" company which have Employees greater then 250. In order to do this i wrote this line of code whit IF else statment
# Code library(dplyr) output_test_final<output_test%>% mutate( Size= if(Employees>=10){ "Micro" } else { if(Employees>=50){ "Small" } else { if(Employees>=250){ "Medium" } else { "Large" } } } )
So results from this code are not good.So can anybody help me how to fix this code and get table like table below ?

How to speed up nested loops for groupby multiindex
I have two Multiindex dataframes, namely, panel1 and panel2: both have the same level 0 indexthe dates, but different level 1 index; see sample code below:
# panel1: idx1 = pd.MultiIndex.from_product([['20170502', '20170503', '20170504'],['id1', 'id2', 'id3']],names=['Dates', 'id']) panel1=pd.DataFrame(np.random.randn(9,2), index=idx1,columns=['ytm','mat']) # panel2: idx2 = pd.MultiIndex.from_product([['20170502', '20170503', '20170504'],['0.5', '1.5', '2.5']],names=['Dates', 'yr']) panel2=pd.DataFrame(np.random.randn(9), index=idx2,columns=['curve'])
I want to loop over the two panels by the Dates (level 0 index). So for each day (e.g. '20170502'), I search the mat of each id/row (of panel1) in the yr column (of panel2), if there is a match, I want to get the corresponding curve values (of panel2) and add it as a new column (named CDB) in panel1.
My current code as following:
group1=panel1.groupby(level=0) group2=panel2.groupby(level=0) lst=[] for ytm in group1: # loop over each day for yr in group2: # loop over each day df_ytm=ytm[1] # get df of id, yt & mat df_ytm=df_ytm.assign(CDB=np.nan) # add a col of nan, later will be replaced by matched curve values df_curve=yr[1].reset_index() # need get rid of index to match yr with t_mat df_curve.yr=df_curve.yr.astype(float) for i in range(df_ytm.shape[0]): # loop over each row if (df_ytm.iloc[i,1]==df_curve.yr).any()==True: # search if each 'mat' value in 'yr' column df_ytm.iloc[i,2]=df_curve[df_curve.yr.isin([df_ytm.t_mat[i]])].curve.values # if matched, set 'CDB' as curve value lst.append(df_ytm) # need get modified 'df_ytm' (with matched 'CDB')
The code works as I tried with a small sample, but I have a huge panel 1 (sized 800 days times 10000 ids) and big panel 2 as well. So the code has been running for more than 24 hours.
I wonder how could I rewrite the code (use possible vectorization) to speed it up?
Any comments would be much appreciated!

using reduce inside a for loop make the code O(n) complexity or higher like O(n^2) or another kind?
For an interview, they ask me to do some exercises and the 3rd one was the following:
We have an unknown quantity of elements in a vector/array v1 with random integer numbers.
 Made a v2 vector/array with the same length of v1 in that the v2[k] is the product of all the elements of v1 except v1[k]
 try to do it without the division operator and with complexity O(n).
And I do the following code:
const v1 = [4, 2, 7, 8, 6, 7, 9, 3, 2, 6, 7]; //it's just an example array const l = v1.length; let v2 = []; for (let i = 0; i < l; i++) { let segment = v1.splice(0, 1); // save the number of the position in array that it'll exclude de product let product = v1.reduce((total, number) => { return total * number; }, 1); v2.push(product); // add the result to the v2 array at the position of the number of v1 array v1.push(segment); // is necesary to add again the segment of the v1 array to keep the original length } console.log('v2', v2); /* Results Reference product of all array 42674688 product  position 00 10668672 product  position 01 21337344 product  position 02 6096384 product  position 03 5334336 product  position 04 7112448 product  position 05 6096384 product  position 06 4741632 product  position 07 14224896 product  position 08 21337344 product  position 09 7112448 product  position 10 6096384 */
My question is:
 Is my code an O(n) complexity? or is O(n^2)? or another kind of complexity?
thanks

Why Update Layer Tree is taking too much on Chrome whereas Firefox works fine
My web application is working fine on Firefox and others. But in the case of Chrome, particularly Chrome Mobile, the same app is almost unusable.
On investigation, it turned out that the Update Layer Tree is taking that time on Chrome, which makes scrolling very slow. Below is the detail of the time consumption on Chrome and Firefox browsers.
Chrome Performance Call Tree:
Firefox Performance Call Tree:
I wonder why Update Layer Tree or any alternate job is not consuming the same time on Firefox. And how can I optimize this for the Chrome browser?

Error while fitting data to the model: TypeError: 'NoneType' object is not callable
I am using this code for the generator:
train_generator = train_datagen.flow_from_directory("C:\\Users\\Rahul\\Desktop\\AI\\Audio\\Classifying music notes\\output\\train", batch_size = 20, class_mode = 'binary', target_size = (864, 432)) validation_generator = test_datagen.flow_from_directory( "C:\\Users\\Rahul\\Desktop\\AI\\Audio\\Classifying music notes\\output\\val", batch_size = 20, class_mode = 'binary', target_size = (864, 432))
I am getting this error when I run the below code: TypeError: 'NoneType' object is not callable
history = model.fit_generator( train_generator, validation_data = validation_generator, steps_per_epoch = 100, epochs = 100, validation_steps = 50, verbose = 2, callbacks=[callbacks])

Problems initializing model in pytorch
I can't initialize my model in pytorch and get:
TypeError Traceback (most recent call last) <ipythoninput829bfee30a439d> in <module>() 288 dataset = News_Dataset(true_path=args.true_news_file, fake_path=args.fake_news_file, 289 embeddings_path=args.embeddings_file) > 290 classifier = News_classifier_resnet_based().cuda() 291 try: 292 classifier.load_state_dict(torch.load(args.model_state_file)) /usr/local/lib/python3.6/distpackages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: > 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) TypeError: forward() missing 1 required positional argument: 'input'
Someone asked for code. It is given below
class News_classifier_resnet_based(torch.nn.Module): def __init__(self): super().__init__() self.activation = torch.nn.ReLU6() self.sigmoid = torch.nn.Sigmoid() self.positional_encodings = PositionalEncoder() self.resnet = list(torch.hub.load('pytorch/vision:v0.6.0', 'resnet18', pretrained=True).children()) self.to_appropriate_shape = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=77) self.conv1 = torch.nn.Conv2d(in_channels=1,out_channels=64,kernel_size=7,stride=2,padding=3) self.conv1.weight = torch.nn.Parameter(self.resnet[0].weight[:,0,:,:].data) self.center = torch.nn.Sequential(*self.resnet[1:2]) self.conv2 = torch.nn.Conv2d(in_channels=512, out_channels=1, kernel_size=1) self.conv3 = torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=7) self.title_conv = torch.nn.Sequential( torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=3,stride=3), self.activation(), torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=2,stride=2), self.activation(), torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=2,stride=2) ) self.title_lin = torch.nn.Linear(25,1) self.year_lin = torch.nn.Linear(10,1) self.month_lin = torch.nn.Linear(12,1) self.day_lin = torch.nn.Linear(31,1) self.date_lin = torch.nn.Linear(3,1) self.final_lin = torch.nn.Linear(3,1) def forward(self,x_in): #input shape  (batch_size, 3+title_len+seq_len, embedding_dim) #output shape  (batch_size, 1) year = x_in[:,0,:10] month = x_in[:,1,:12] day = x_in[:,2,:31] title = x_in[:,3:3+args.title_len,:] text = x_in[:,3+args.title_len:,:] title = self.positional_encodings(title) text = self.positional_encodings(text) text = text.unsqueeze(1) text = self.activation(self.to_appropriate_shape(text)) text = self.activation(self.conv1(text)) text = self.activation(self.center(text)) text = self.activation(self.conv2(text)) text = self.activation(self.conv3(text)) text = text.reshape(args.batch_size,1) title = title.unsqueeze(1) title = self.activation(self.title_conv(title)) title = title.reshape(args.batch_size,1) title = self.activation(self.title_lin(title)) year = self.activation(self.year_lin(year)) month = self.activation(self.month_lin(month)) day = self.activation(self.day_lin(day)) date = torch.cat([year,month,day], dim=1) date = self.activation(self.date_lin(date)) final = torch.cat([date,title,text], dim=1) final = self.sigmoid(self.final_lin(final)) return final classifier = News_classifier_resnet_based().cuda()
What should I do? StackOverflow asked for more details. I'm trying to classify texts using word embeddings but problem lies in last line. I am working in google colab. Also when I created some models in other code blocks, I've got no problems

CNNLSTM Model on nonimage dataset
I have a dataset in this form
name peoject_grade project_summary Cost_on_project Subject Topics Rating Wilcox 7 this project is.. 455$ Chemistry Atomic 4.2
Its shape is (51722,7)
I have been asked to apply cnnlstm model on this dataset. How is it possible ?? Its not a image dataset like mnist so how can i apply cnnlstm model here.
The basic code for cnnlstm is like this 
model=Sequential() model.add(TimeDistributed(Conv1D(32, kernel_size=(3, 3), padding='same'), input_shape=(frames, 224, 224, 3))) model.add(TimeDistributed(Activation('relu'))) model.add(TimeDistributed(Conv1D(32, (3, 3)))) model.add(TimeDistributed(Activation('relu'))) model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2)))) model.add(TimeDistributed(Dropout(0.25))) model.add(TimeDistributed(Dense(512))) model.add(TimeDistributed(Flatten())) model.add(LSTM(20, return_sequences=True, name="lstm_layer_rgb")); model.add(TimeDistributed(Dense(num_classes), name="time_distr_dense_one_rgb")) model.add(GlobalAveragePooling1D(name="global_avg_rgb"))
But what input_size, kernal_size and other changes i need to do for this dataset ??

Caret confusionMatrix measures are wrong?
I made a function to compute sensitivity and specificity from a confusion matrix, and only later found out the
caret
package has one,confusionMatrix()
. When I tried it, things got very confusing as it appearscaret
is using the wrong formulae??Example data:
dat < data.frame(real = as.factor(c(1,1,1,0,0,1,1,1,1)), pred = as.factor(c(1,1,0,1,0,1,1,1,0))) cm < table(dat$real, dat$pred) cm 0 1 0 1 1 1 2 5
My function:
model_metrics < function(cm){ acc < (cm[1] + cm[4]) / sum(cm[1:4]) # accuracy = ratio of the correctly labeled subjects to the whole pool of subjects = (TP+TN)/(TP+FP+FN+TN) sens < cm[4] / (cm[4] + cm[3]) # sensitivity/recall = ratio of the correctly +ve labeled to all who are +ve in reality = TP/(TP+FN) spec < cm[1] / (cm[1] + cm[2]) # specificity = ratio of the correctly ve labeled cases to all who are ve in reality = TN/(TN+FP) err < (cm[2] + cm[3]) / sum(cm[1:4]) #(all incorrect / all) metrics < data.frame(Accuracy = acc, Sensitivity = sens, Specificity = spec, Error = err) return(metrics) }
Now compare the results of
confusionMatrix()
to those of my function:library(caret) c_cm < confusionMatrix(dat$real, dat$pred) c_cm Reference Prediction 0 1 0 1 1 1 2 5 c_cm$byClass Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall 0.3333333 0.8333333 0.5000000 0.7142857 0.5000000 0.3333333 model_metrics(cm) Accuracy Sensitivity Specificity Error 1 0.6666667 0.8333333 0.3333333 0.3333333
Sensitivity and specificity seem to be swapped around between my function and
confusionMatrix()
. I assumed I used the wrong formulae, but I doublechecked on Wiki and I was right. I also doublechecked that I was calling the right values from the confusion matrix, and I'm pretty sure I am. Thecaret
documentation also suggests it's using the correct formulae, so I have no idea what's going on.Is the
caret
function wrong, or (more likely) have I made some embarrassingly obvious mistake? 
How to improve computing time performance when using caret to train a model over large datasets
I am working with
caret
functiontrain()
in order to develop a support vector machine model. My datasetMatrix
has a considerable number of rows255099
and few columns/variables (8
including response/target variable). Target variable has10
groups and is a factor. My issue is about the speed to train the model. My datasetMatrix
is included next, and also the code I used for the model. I have also usedparallel
in order to make faster but is not working.#Libraries library(rsample) library(caret) library(dplyr) #Original dataframe set.seed(1854) Matrix < data.frame(Var1=rnorm(255099,mean = 20,sd=1), Var2=rnorm(255099,mean = 30,sd=10), Var3=rnorm(255099,mean = 15,sd=11), Var4=rnorm(255099,mean = 50,sd=12), Var5=rnorm(255099,mean = 100,sd=20), Var6=rnorm(255099,mean = 180,sd=30), Var7=rnorm(255099,mean = 200,sd=50), Target=sample(1:10,255099,prob = c(0.15,0.1,0.1, 0.15,0.1,0.14, 0.10,0.05,0.06, 0.05),replace = T)) #Format target variable Matrix %>% mutate(Target=as.factor(Target)) > Matrix # Create training and test sets set.seed(1854) strat < initial_split(Matrix, prop = 0.7, strata = 'Target') traindf < training(strat) testdf < testing(strat) #SVM model #Enable parallel computing cl < makePSOCKcluster(7) registerDoParallel(cl) #SVM radial basis kernel set.seed(1854) # for reproducibility svmmod < caret::train( Target ~ ., data = traindf, method = "svmRadial", preProcess = c("center", "scale"), trControl = trainControl(method = "cv", number = 10), tuneLength = 10 ) #Stop parallel stopCluster(cl)
Even using
parallel
, thetrain()
process defined in previous code did not finish. My computer with Windows system, intel core i3 and 6GB RAM was not able to finish this training in 3 days. For 3 days the computer was turned on but the model was not trained and I stopped it.Maybe I am doing something wrong that is making
train()
pretty slow. I would like to know if there is any way to boost the training method I defined. Also, I do not know why is taking too much time if there is only8
variables.Please, could you help me to solve this issue? I have looked for solutions to this problem without success. Any suggestion on how to improve my training method is welcome. Moreover, some solutions mention that
h2o
can be used but I do not know how to set up mySVM
scheme into that architecture.Many thanks for your help.

"Error in seeds[[num_rs + 1L]] : subscript out of bounds" when using caret for creatng LVQ model?
I'm using caret package to create a LVQ model and select features on a dataset of 579 independent variable and 55 samples:
set.seed(123) data=data control < trainControl(method="repeatedcv", number=5, repeats=10)
But when I run the command to train the model I get the following error:
model < train(remission~., data=data, method="lvq", preProcess="scale", trControl=control, importance=T) Error in seeds[[num_rs + 1L]] : subscript out of bounds
Can you suggest any solutions? Considering the number of variables I have, this seems the best way to find important features for my model. I even tried trimming my variables to 40 and 10, but I still get the same error.

glmnet & selectiveInferance: issues when calculating confidence intervals of a LASSO fit
I'm trying to use
glmnet
to run a LASSO fit on a large dataset (n = 15000, 21 variables). I want to look at the confidence intervals and pvalues for the selected variables, so I've tried pushing the results throughfixedLassoInf
from theselectiveInference
library. I have a couple of issues with my results that aren't well explained in the vignettes, so I'm posting them here in case someone can help: For one of my variables the CI range goes to +Inf. According to the vignette:
The confidence interval construction involves numerical search and can be fragile: if the observed statistic is too close to either end of the truncation interval (vlo and vup, see references), then one or possibly both endpoints of the interval of desired coverage cannot be computed, and default to +/ Inf.
Looking at the references in the vignette looks like the variables involved are calculated directly from the data and so are a consequence of my dataset  is there anything I can do to avoid this from happening? If not, how would I report this issue in a publication?
 For the pvalues there are two calculation options under the "type" argument  "partial" and "full", for whether the contrasts are tested for the variables that remain after the LASSO, or for all the candidate variables. When I try "full" I get the following error:
Polyhedral constraints not satisfied; you must recompute beta more accurately. With glmnet, make sure to use exact=TRUE in coef(), and check whether the specified value of lambda is too small (beyond the grid of values visited by glmnet). You might also try rerunning glmnet with a lower setting of the 'thresh' parameter, for a more accurate convergence.
The lambda I get from
cv.glmnet
is: 3.8E4 (according to the vignette this needs to be divided by the no. of observations, but I don't see any difference to the coefficients or selected variables when I do this). Is this a typical value for lambda I fear that this may be too small to be significant, but I still see variables being removed by the LASSO, so I guess that it is still useful? The value of lambda  and so, the coefficients and selected variables  also change when I lower "thresh". Currently I'm using a value of 1E20, but I have no idea what this value should be. I've seen it mentioned that this value depends on the data, but how can this be inferred?

Glmnet: great performance with zero features selected
I'm training an Elastic Net model to predict clinical severity post COVID infection using leaveoneout cross validation. The performance looks great when you look at the predictions but it looks like all the coefficients were zero. How is this possible?
I can't share the original gene matrix because it is very large but here is my script.
# Test and train Elastic Net model for each fold. results < lapply(1:nrow(data), function(i) { fit < cv.glmnet( x = data[i, ], y = as.numeric(outcome[i]), family = "gaussian", alpha = 0.5 ) pred < predict( fit, newx = data[i, , drop = F], lambda = "lambda.1se" ) data.frame( sample = rownames(data)[i], score = pred[1], actual = outcome[i], nzero = fit$nzero[fit$lambda == fit$lambda.1se] ) }, mc.cores = 2)
And here are the results. Notice that the
nzero
column (number of nonzero coefficients) is all zeros. What's going on here?Edit: Convert
outcome
to numeric 
Does glmnet package support multivariate grouped lasso regression?
I'm trying to perform a multivariate lasso regression on a dataset with 300 independent variables and 11 response variables using
glmnet
library. I'd like to group some of the input variables and then apply multivariate grouped lasso regression so that all the grouped variables are either selected or discarded by the lasso model depending on their significance. How can I achieve this? I did look intogrplasso
package but it doesn't support multivariate regression.