R programming for linear model
model2<lm(formula = Losses.in.Thousands~Age, Years.of.Experience,Gender, Married, data = default)
Error in model.frame.default(formula = Losses.in.Thousands ~ Age, data = default, : object 'Married' not found
See also questions close to this topic

Strange behavior of ggplot2 stacked bar plot, why groups are not grouped together properly?
I have a data frame like the following:
set.seed(1) mydf < data.frame(ID="id1", Type=c(rep("C",6),rep("A",4)), Value=abs(rnorm(n=10))) mydf$Value.log < log2(mydf$Value+0.1) mydf # ID Type Value Value.log # 1 id1 C 0.6264538 0.46105702 # 2 id1 C 0.1836433 1.81785019 # 3 id1 C 0.8356286 0.09599211 # 4 id1 C 1.5952808 0.76152426 # 5 id1 C 0.3295078 1.21924386 # 6 id1 C 0.8204684 0.11955993 # 7 id1 A 0.4874291 0.76751348 # 8 id1 A 0.7383247 0.25441895 # 9 id1 A 0.5757814 0.56537156 # 10 id1 A 0.3053884 1.30262333
I make this stacked bar plot and it looks good:
png(filename="test.png") print(#or ggsave() ggplot(mydf, aes(ID, Value.log, fill=Type)) + geom_bar(stat="identity") ) dev.off()
Why if I just change the name of
Type
"A" to "E", keeping the rest exactly the same, the stacked bar plot does not do the grouping the same way??mydf$Type < as.character(mydf$Type) mydf$Type[which(mydf$Type=="A")] < "E" mydf$Type < factor(mydf$Type) png(filename="test2.png") print(#or ggsave() ggplot(mydf, aes(ID, Value.log, fill=Type)) + geom_bar(stat="identity") ) dev.off()
Why is
Type
"C" (now in red) broken in two now? How can I avoid it? Thanks! 
Plotting 2 unevenly spaced time series on a single graph using R
I have 2 data sets of ocean temperatures for the past 2 million years.
The first data set is from the atlantic and has 70 observations unevenly spaced over a 2 million year interval. The second data set is from the pacific and has 500 observations also unevenly spaced over the same interval.
I want to plot both data sets on one graph so I can compare the temperature differences between the 2 oceans over the last 2 million years.
Any and all suggestions welcome!!

Changing data structure in R
I am currently having the following data structure (as data.frame)
NA a1 a2 a3
t1 y11 y21 y31
t2 y12 y22 y33
t3 y13 y23 y33and want to change it into
t1 s1 y11
t2 s1 y12
t3 s1 y13
t1 s2 y21
t2 s2 y22
t3 s2 y23
t1 s3 y31
t2 s3 y32
t3 s3 y33any suggestion how to proceed? Thanks for your help ;)

Create loop/function to remove negative varImp results
I would like to create the loop which is going to model data, get the variable importance, assign the negative values/importance columns and filter them from the data and model it again until there are no negative values. Here below you can see the example code for creating the model and getting variable importance:
library(party) library(caret) model_cforest < cforest(drat~.,data=mtcars,controls=cforest_unbiased()) cforest_var < varImp(model_cforest,conditional=TRUE)
As we can see cforest_var gives us this table:
Overall mpg 0.009778909 cyl 0.033507134 disp 0.056359569 hp 0.000000000 wt 0.044186730 qsec 0.000000000 vs 0.000309504 am 0.050791540 gear 0.060967894 carb 0.000000000
On base of this table i would like then to remove the column vs (which has negative value) and run the
cforest
model again (and if there is again negative value, remove it and run model until there are no negative values).Final result should be a table with the most important variables.
Here is as far as i got:
removeNeg < function(data){ model_cforest < cforest(drat~., mtcars,controls=cforest_unbiased()) cforest_var < varImp(model_cforest,conditional=TRUE) varImp_neg < row.names(cforest_var)[apply(cforest_var, 1, function(u) any(u < 0))] }
but i have feeling that it is wrong direction and i stucked in one place.Thanks for help!

Statsmodel intercept is different to Seaborn lmplot intercept
What could explain the difference in intercepts between statsmodel OLS regression and also seaborn lmplot?
My statsmodel code:
X = mmm_ma[['Xvalue']] Y = mmm_ma['Yvalue'] model2 = sm.OLS(Y,sm.add_constant(X), data=mmm_ma) model_fit = model2.fit() model_fit.summary()
My seaborn lmplot code:
sns.lmplot(x='Xvalue', y='Yvalue', data=mmm_ma)
My statsmodel intercept is 28.9775 and my seaborn lmplot's intercept is around 45.5.
Questions:
 Should the intercepts be the same?
 Why might explain why these are different? (can I change some code to make it equal)
 Is there a way to achieve a plot similar to seaborn lmplot but using the exact regression results to ensure they align?
Thanks
[EDIT  19th July]
@Massoud thanks for posting that. I think I have realised what is the problem. My xvalues range between 1400 to 2600 and yvalues range from 40 to 70. So using seaborn lmplot, it just plots the regression and the intercept is based on the lowest range X value  which is an intercept of 46.
However for statsmodel OLS, it keeps the line going until X = 0, which is why I get an intercept of 28 or so.
So I guess the question is there a way to continue the trend line using seaborn to go all the way until x = 0.
I tried changing the axis but it doesn't seem to extend the line.
axes = lm.axes axes[0,0].set_xlim(0,)

"TypeError: can't pickle NotImplementedType objects" in KerasRegression model
I'm creating a simple regression neural network in Keras. However, when I try to run it as follows
seed = 7 numpy.random.seed(seed) dataset = numpy.loadtxt("instancesFipo.txt", delimiter=", ") #testset = numpy.loadtxt("instancesFipo6.txt", delimiter=", ") Xtrain = dataset[:,0:8] #All rows, first 8 columns Ytrain = dataset[:,8] #All rows, 9th column model = Sequential() #Create layers of neural net model.add(Dense(50, input_dim=8, kernel_initializer='normal', activation='relu')) model.add(Dense(50, kernel_initializer='normal', activation='relu')) model.add(Dense(1, kernel_initializer='normal')) #Create loss function and algorithm model.compile(loss='mean_squared_error', optimizer='adam') estimator = KerasRegressor(build_fn=model, epochs=100, batch_size=10, verbose=0) kfold = KFold(n_splits=10, random_state=seed) results = cross_val_score(estimator, Xtrain, Ytrain, cv=kfold) print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
I'm getting "TypeError: can't pickle NotImplementedType objects", which is induced by the call to
cross_val_score
. Not sure what's going on. Any help would be appreciated and thanks! 
SciPy curve_fit returns weird fitted curve
I just tried to fit a curve to a bunch of points that looks like a logistic function, and the result is like a tangled curve.
This is the code:
from scipy.optimize import curve_fit def logistic(v, m, n, a, t): return a * (1 + m * np.exp(v/t))/(1 + n * np.exp(v/t)) def power_curve_fit(xvalues, yvalues): xdata = xvalues ydata = yvalues popt, pcov = curve_fit(logistic, xdata, ydata) pc = pd.DataFrame() pc['wind_speed'] = xdata pc['power_gen'] = ydata pc['Fit'] = logistic(xdata, *popt) plt.plot(xdata, logistic(xdata, *popt), 'red') plt.scatter(xdata, ydata, c='pink', marker='o') return pc
Thank you!

all coefficients turn zero in Logistic regression using scikit learn
I am working on logistic regression using scikit learn in python. I have the data file that can be downloaded via the following link.
Below is my code for machine learning part.
from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import roc_auc_score import pandas as pd scaler = StandardScaler() data = pd.read_csv('data.csv') dataX = data.drop('outcome',axis =1).values.astype(float) X = scaler.fit_transform(dataX) dataY = data[['outcome']] Y = dataY.values X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33) lasso = Lasso(alpha=.3) lasso.fit(X_train,y_train) print("MC learning completed") print(lasso.score(X_train,y_train)) print(lasso.score(X_test,y_test)) print(lasso.coef_)
when I print coefficients, it turns out all zero. Can anyone advise me on that?
Let me explain a little bit about my objective. The problem seems to be a classification problem as we can only see 0 or 1 in Ytrain and Ytest. if we put a simple example, 0 can be considered as missed, 1 can be considered as scored. what I am trying to do is to compute the probability scoring for each event when a shot is taken place.
Thanks in advance.
Regards,
Zep

CPU utilization while running Scikitlearn logistic regression
While running scikitlearn logistic regression , i only utilize ~14% of the computer CPU , even when adding n_jobs=1 parameter , is there a way to increase the CPU usage / memory usage for faster process ?

Matlab: Linprog separate points with line
I have two sets of dots with different colors. This should be separated by a straight line. The program should use the linprog from Matlab.
The straight line function is defined by
a * x + b * y = c
. The optimization target should be the maximum distance to the line. I have no idea howf, A, b
should look like. Thanks for the help! 
lpSolve Knapsack Linear Programming with multiple constraints
I've been attempting to modify the example in Knapsack Linear Programming post to solve a multiple knapsack type problem with a number of constraints. A reprex is provided below. The solution from the linked post works fine when constraining the knapsack to an
x
number of items. However, my attempt breaks down when adding the additional item limit constraints (2 and 3) noted below.The knapsack has to meet the following conditions:
 exactly 10 items have to be selected
exact.num.elt < 10
 each item has a label
Type1
,Type2
, etc. (contained in vectort
) and a corresponding weight (contained in vectorw
)  each
Type
in the knapsack must sum to its respective limit of either 1, 2, or 3  I would also like to have the ability to produce multiple knapsack results of possible knapsack solutions for comparative analysis of knapsack solutions
library(lpSolve)
p < c(rnorm(50, 15, 5)) t < rep(c("Type1", "Type2", "Type3", "Type4", "Type5"), length.out = 50) #item labels w < c(rnorm(50, 3500)) #the weight of each item cap < 32000 #sum of 'w' cannot exceed this amount exact.num.elt < 10 #total knapsack capacity # max item limit by Type  set const.dir to "=" for each type Type1 < 2 Type2 < 1 Type3 < 2 Type4 < 2 Type5 < 3 mod < lp(direction = "max", objective.in = p, const.mat = rbind(w, rep(1, length(p))) & rbind(t, rep(1, length(t))), const.dir = c("<=", "=", "=", "=", "=", "=", "="), const.rhs = c(cap, exact.num.elt, Type1, Type2, Type3, Type4, Type5), all.bin = TRUE)
Running the above reprex produces the following error:
Error in rbind(w, rep(1, length(p))) & rbind(t, rep(1, length(t))) : operations are possible only for numeric, logical or complex types
I can see that this error has something to do with the way I'm coding
constant.mat
argument. However, I have not been able to figure out how to add the additionalType
constraint without producing one error or another. I would also like to have the ability to generate multiple (say 10 or 20) different solutions so that I can produce summary statistics.  exactly 10 items have to be selected

Allocation of available resources (MILP? optimisation)
I want to run an optimization considering that I have a set of available times to depart (for example 5:20, 6:25, 6:30...) and a set of desired times (5:25, 7:00, 7:10...) for a set of flights.
The idea is to optimize which available time is used for every departure to minimize the deviation:
A wrong solution would be to use the 5:20 available time for the 7:00 departure. The closest available time for the 7:00 departure is the 6:30... I hope it makes sense...
I believe this is a type of MILF, but I am just an amateur and not sure how to approach it.
Many thanks for your help!