test variable with demographic
I want to test and see if there is any relationship between gender and "I do not know" variable in my database. Is there any way that I can do it with R? Let's say gender is like 1 and 2 or Male and Female. I have a scale question from 110 and 99 is "I do not know" option. How can I test the relationship between them?
Thanks
See also questions close to this topic

R error while running ARIMA for Time Series Forecasting
I am getting below error after running Arima Model. Please find below my code. Please help in resolving the same.
> model < arima(log(gas_train),order=c(8,1,2),seasonal = list(order=c(8,1,2),period=12)) Error in optim(init[mask], armafn, method = optim.method, hessian = TRUE, : nonfinite finitedifference value [17] In addition: Warning messages: 1: In log(s2) : NaNs produced 2: In log(s2) : NaNs produced 3: In log(s2) : NaNs produced 4: In log(s2) : NaNs produced 5: In log(s2) : NaNs produced

R, HPD (Highest posterior density) interval based on samples from posterior, WinBUGS
How to calculate HPD (Highest posterior density) interval from posterior samples? I have four parameters and i generate 1000 samples from posterior parameters distribution. Now How to calculate HPD in R software. I used package code But I got an error that
HPDinterval(winbugsresult$sims.list,prob=0.05) Error in UseMethod("HPDinterval") : no applicable method for 'HPDinterval' applied to an object of class "list"
where "winbugsresult" is a list that contains posterior samples.
I also used a vector I got following error
HPDinterval(winbugsresult$sims.list$alpha ,prob=0.05) Error in UseMethod("HPDinterval") : no applicable method for 'HPDinterval' applied to an object of class "c('double', 'numeric')"
I used just a random vector from normal and i got error again
HPDinterval(rnorm(100)) Error in UseMethod("HPDinterval") : no applicable method for 'HPDinterval' applied to an object of class "c('double', 'numeric')"

If statement with employees data
I have one data set.Which contain data about employees in company.You can see data below:
#Data output_test<data.frame( Employees=c(1,2,3,10,15,122,143,150,250,300,500,1000) )
So next steep should be classification. I need to classify Employees by size of company.Rule is that every number of Employees determine size of company.For example if number is below 10 that meaning that is "micro" company, if number is greater then 10 but below or equal to 50 company is "small" company.For "medium" company number of Employees is greater then 50 but equal or small to 250 and last is "large" company which have Employees greater then 250. In order to do this i wrote this line of code whit IF else statment
# Code library(dplyr) output_test_final<output_test%>% mutate( Size= if(Employees>=10){ "Micro" } else { if(Employees>=50){ "Small" } else { if(Employees>=250){ "Medium" } else { "Large" } } } )
So results from this code are not good.So can anybody help me how to fix this code and get table like table below ?

Keras regression  accuracy of second output is always worse than the first (single loss function, no loss_weights)
I am building an
LSTM
regression withkeras
. I have two outputs but am only using a single loss function (custom) since it is correct for both outputs. I am also not specifying anyloss_weights
. This is mycompile
statement:self.model.compile(optimizer=opt, loss=self.custom_loss)
where custom loss accepts both output predictions and actuals:
custom_loss(y_true, y_pred)
For every test I make, the second output is always less accurate than the first.
 Could this be because I am not specifying any weights?
 What are the default
loss_weights
for a multioutput regression in keras?  Without specifying anything is the loss function still a weighted sum of losses of all outputs?

Display the prediction for 10 years using polynomial regression on python
I built this code using polynomial regression based on the below table (few part of it), and I'm using regression on sklearn from degree 1 until 4 to be able to predict values until 2020.
Countries 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Algeria 0.000000 0.000000 0.009100 0.018119 0.026723 0.028600 0.060000 0.058000 0.245000 0.504000 0.603000 Argentina 0.000144 0.000076 0.000086 0.001614 0.008173 0.015074 0.015944 0.014683 0.014273 0.016417 0.108129 Australia 0.139200 0.290242 0.977648 2.044547 2.412000 3.847400 4.952000 5.958000 7.474595 8.955110 12.081099 Austria 0.030120 0.048914 0.088813 0.174070 0.337483 0.625974 0.785246 0.937098 1.096016 1.268971 1.578641 Azerbaijan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000800 0.002900 0.004600 0.035300 0.037200 0.039260
While searched to build it, I got this expression that helps me out to make the prediction for the years lin.predict(poly.fit_transform([[2018]])), but is there a way I can make the prediction until 2020 and plot on the graph with this prediction values?
#Linear regression for one country df=Countries_New.transpose() x=df.index.values.reshape(1,1) #Choose here the country name to make the prediction Country_Name = 'Austria' try: print(f'The chosen country is {Country_Name}') print('') y=df[Country_Name].values.reshape(1,1) for i in range(1,5): #print('The degree of the equation is: '+str(i)) #Fit the Poly regression poly = PolynomialFeatures(degree = i) x_poly = poly.fit_transform(x) poly.fit(x_poly,y) lin=LinearRegression() lin.fit(x_poly,y) y_poly_pred = lin.predict(x_poly) rmse = np.sqrt(mean_squared_error(y,y_poly_pred)) r2 = r2_score(y,y_poly_pred) print('The RMSE is {} and R2 is {}'. format(rmse, r2)) #print('') #print('The generation (original value) for 2018 is {}'.format(y[1])) #print('The prediction for 2018 is {}'.format(lin.predict(poly.fit_transform([[2018]])))) #print('The prediction for 2019 is {}'.format(lin.predict(poly.fit_transform([[2019]])))) #print('The prediction for 2020 is {}'.format(lin.predict(poly.fit_transform([[2020]])))) #for i in range(2019,2029): # x_poly.append(i,lin.predict(poly.fit_transform([[2020]])),axis=1) #Plot Poly regression plt.scatter(x,y,color='blue') plt.plot(x,lin.predict(x_poly),color='red') plt.title('Polynomial Regression degree '+str(i)) plt.xlabel('Year') plt.ylabel('Renewable Generation (TWh)') plt.show() except: print('Country not found')

Loop to run regression on multiple dependent variables
I have data such as this. I would like to run a linear regression on multiple variables separately without having to copy paste the regression.
dat < read_table2("condition school Q2_1 Q2_2 Q2_4 Q2_8 Q2_10 Q2_11 Q2_14 Q2_15 1 A 3 4 3 2 2 4 4 2 0 B 3 3 3 2 1 2 2 1 1 C 4 4 4 3 3 4 3 3 0 D 3 4 3 3 2 2 4 2 1 A 2 4 2 3 3 3 3 3 0 B 2 4 4 2 3 2 2 3 1 C 3 3 3 2 3 3 2 3 0 D 4 4 3 3 3 2 2 3 1 A 3 3 3 3 2 3 3 2 0 B 3 3 3 2 3 3 4 1 1 C 4 4 4 4 3 3 3 3 0 D 3 3 4 3 3 4 4 2 1 A 2 2 2 2 2 2 2 3 0 B 3 3 3 2 2 2 2 3 1 C 3 4 3 1 2 3 3 3 0 D 3 4 2 2 3 4 3 3 1 A 4 4 4 3 3 4 3 2 0 B 4 2 3 2 3 2 2 1 1 C 4 3 3 4 3 4 3 3 0 D 3 3 3 2 2 2 3 2 1 A 4 2 3 3 2 2 2 3 ")
I could run this separately and then pull out the coefficients.
model1 < lmer(Q2_1~ (1school) + condition ,data=regression_data , REML = F) model2 < lmer(Q2_2~ (1school) + condition ,data=regression_data , REML = F) model3 < lmer(Q2_4~ (1school) + condition ,data=regression_data , REML = F) summary(model1)$coefficient summary(model2)$coefficient summary(model3)$coefficient
But I rather do this in one chunk of code. I adjusted some code online to come up with this:
# outcome out_start=3 out_end= 10 out_nvar=out_endout_start+1 out_variable=rep(NA, out_nvar) out_beta=rep(NA, out_nvar) out_se = rep(NA, out_nvar) out_pvalue=rep(NA, out_nvar) # exposure exp_start=1 exp_end=2 exp_nvar=exp_endexp_start+1 exp_variable=rep(NA, exp_nvar) exp_beta=rep(NA, exp_nvar) exp_se = rep(NA, out_nvar) exp_pvalue=rep(NA, exp_nvar) number=1 library(lme4) for (i in out_start:out_end){ outcome = colnames(dat)[i] for (j in exp_start:exp_end){ exposure = colnames(dat)[j] model < lmer(get(outcome) ~ get(exposure) + (1school) + condition, data=dat , REML = F, na.action = na.exclude) Vcov < vcov(model, useScale = FALSE) beta < fixef(model) se < sqrt(diag(Vcov)) zval < beta / se pval < 2 * pnorm(abs(zval), lower.tail = FALSE) out_beta[number] = as.numeric(beta[2]) out_se[number] = as.numeric(se[2]) out_pvalue[number] = as.numeric(pval[2]) out_variable[number] = outcome number = number + 1 exp_beta[number] = as.numeric(beta[2]) exp_se[number] = as.numeric(se[2]) exp_pvalue[number] = as.numeric(pval[2]) exp_variable[number] = exposure number = number + 1 } } outcome = data.frame(out_variable, out_beta, out_se, out_pvalue) exposure = data.frame(exp_variable, exp_beta, exp_se, exp_pvalue) library(tidyverse) outcome = outcome %>% dplyr::rename( variable = out_variable, beta = out_beta, se = out_se, pvalue = out_pvalue, ) exposure = exposure %>% dplyr::rename( variable = exp_variable, beta = exp_beta, se = exp_se, pvalue = exp_pvalue, ) all = rbind(outcome, exposure) all = na.omit(all)
This does not give me quite what I want  it gives me the impact of being in the condition group for each dependent variable (question) but doe not show me the value of the intercept.
For example, for Q2_1 I expect to see:
model_lmer < lmer(Q2_1~(1school) + condition ,data=dat , REML = F) summary(model_lmer)$coefficient Estimate Std. Error df t value Pr(>t) (Intercept) 3.1000000 0.2079585 21 14.9068177 1.213221e12 condition 0.1727273 0.2873360 21 0.6011334 5.541851e01
Instead I see:
var beta se p value 1 Q2_1 1.727273e01 0.2873360 0.54775114 3 Q2_1 2.559690e15 0.3737413 1.00000000
Any suggestions are appreciated!

Does statsmodels OLS normalise variables?
does
statsmodels.api.OLS
first normalise variables such that if my design matrix contains variables of very different magnitudes then this would help (X'X)**(1) to not be ill conditioned? 
Split dataset into vectors (linear regression)
I'm building a linear regression model and I need to split a dataset into vectors. Dataset is represented by an array of data points (X, Y). For example:
dataset = [{x1, y1}, {x2, y2}, ... {xn, yn}]
These data points can represent a single vector or a few ones. So, there can potentially be more than one vector. The image below illustrates the dataset: It has two none parallel vectors. So the input dataset should be split into two datasets.
I'm trying to find an algorithm or math methods that may help to do that. Any help, references, or code examples (preferably python, golang, javascript) are welcome.
I'm a developer, not a data scientist. Please, try to not use heavy math :)

According to PLS regression, how do they computed?
I realized the whole concept of it.
What I understood now:
Y is projected onto X and sum the direction.
Calculate the parameter theta by a ratio of <y,theta> / <theta, theta>
Update y hat with parameter
Orthogonalizing the previous X and get new X.
doing 1 ~ 4over and over
 After calculation, it finds out the X subset by backtracking.
I think this equation is in the first step of the calculation and I understood the second part but the first part, I guess that is summing up the direction, however, x is vector also I guess and the calculate direction also vectors and multiply it???? and sum up????? what is happening here?

Adding "predict" outcome to the initial dataframe
I am a a bit lost on how can I append the predicted probabilities to a new column in the original dataset 'advertising' (I know that there should be a test/validation, but this is an exercise and we are allowed to do it not splitting the data.
advertising < read.csv('C:/Users/matpo/Desktop/advertising_1.csv', stringsAsFactors = TRUE) LogMod < glm(Clicked.on.Ad ~ Daily.Time.Spent.on.Site + Age + Area.Income, data=advertising, family=binomial(link="logit")) summary(LogMod) predicted < predict(LogMod, advertising, type="response") predicted [1:5]
Thank you!

While fitting the Logistic Regression model to the data using the fit function, getting NameError: name 'check_X_y' is not defined
Error: File "C:\Users\soyhomo\Anaconda3\lib\sitepackages\sklearn\linear_model\_logistic.py", line 1526, in fit in a logarithmic scale between 1e4 and 1e4. NameError: name 'check_X_y' is not defined
Command Run:
logreg = LogisticRegression() logreg.fit(X_train, y_train)
I have the latest scikit and sklearn packages installed as well.

Stata logistic regression power analysis
Does the
powerlog
command allow for multiple independent variables? If so, how is it entered? The help file indicates entry of 1 IV.For instance:
powerlog, p1(.08) p2(.23) alpha(.05)
Is given at the UCLA guide.
This is meant to estimate the needed sample size for a logistic regression using 1+ independent variables, however they do not indicate how to add the additional variables into the command.