R programming for linear model
model2<-lm(formula = Losses.in.Thousands~Age, Years.of.Experience,Gender, Married, data = default)
Error in model.frame.default(formula = Losses.in.Thousands ~ Age, data = default, : object 'Married' not found
See also questions close to this topic
Using GLOVEs pretrained glove.6B.50.txt as a basis for word embeddings R
I'm trying to convert textual data into vectors using GLOVE in r. My plan was to average the word vectors of a sentence, but I can't seem to get to the word vectorization stage. I've downloaded the glove.6b.50.txt file and it's parent zip file from: https://nlp.stanford.edu/projects/glove/ and I have visited text2vec's website and tried running through their example where they load wikipedia data. But I dont think its what I'm looking for (or perhaps I am not understanding it). I'm trying to load the pretrained embeddings into a model so that if I have a sentence (say 'I love lamp') I can iterate through that sentence and turn each word into a vector that I can then average (turning unknown words into zeros) with a function like vectorize(word). How do I load the pretrained embeddings into a glove model as my corpus (and is that even what I need to do to accomplish my goal?)
Using dev.copy2pdf(), how can I set the filename of the pdf using paste()?
dev.copy2pdffunction allows you to export the currently displayed plot as a pdf file. You name the pdf file with
dev.copy2pdf(file = "",...). Because I am writing a loop to save multiple plots as pdfs I want to be able to name the new file using an element from my character vector.
Let's say I have a character vector called
charactervectorand the first element is
"MyImage1". I could do
NewName <- paste(charactervector), but
NewNamewouldn't be recognized by
file="". It would simply save it as
MyImage1.pdf. How can I accomplish what I want to do?
Getting duplicate rows when merging two data frames in R
I am trying to get every pitch thrown from a specific baseball game from 2011 by the pitcher Justin Verlander, however when I use the function merge rows repeat. I should have somewhere around 100+ (for the total of number of pitches he threw during that game) rows not the 7 thousand that I get from the output. I merged the 2 data frames by url as the primary key but i am not sure if that is correct.
library("Lahman") library("pitchRx") library("ggplot2") library("tidyverse") library("dplyr") pitching_05_07_2011<-scrape(start="2011-05-07", end="2011-05-07") atbats<-pitching_05_07_2011$atbat pitches<-pitching_05_07_2011$pitch head(atbats) head(pitches) verlander_nohitter<-filter(atbats,atbats$pitcher_name=="Justin Verlander") verlander_nohitter pitching_atbats<-merge(verlander_nohitter,pitches,by="url") pitching_atbats
KerasRegressor: ValueError: continuous is not supported
I am trying to apply a regression learning method to my data which has 28 dimensions.
import numpy import pandas as pd from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # load dataset dataframe = pd.read_csv("gold_train_small.csv", header=None) dataset = dataframe.values # split into input (X) and output (Y) variables X = dataset[:,1:29] Y = dataset[:,0] # load test set # load dataset dataframe = pd.read_csv("gold_test.csv", header=None) dataset = dataframe.values # split into input (X) and output (Y) variables X_test = dataset[:,1:29] Y_test = dataset[:,0] # define base model def baseline_model(): # create model model = Sequential() model.add(Dense(28, input_dim=28, kernel_initializer='normal', activation='relu')) model.add(Dense(1, kernel_initializer='normal')) # Compile model model.compile(loss='mean_squared_error', optimizer='adam') return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # evaluate model with standardized dataset estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0) kfold = KFold(n_splits=10, random_state=seed) results = cross_val_score(estimator, X, Y, cv=kfold) print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std())) #Baseline: 31.64 (26.82) MSE # evaluate model with standardized dataset numpy.random.seed(seed) estimators =  estimators.append(('standardize', StandardScaler())) estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=50, batch_size=5, verbose=0))) pipeline = Pipeline(estimators) kfold = KFold(n_splits=10, random_state=seed) results = cross_val_score(pipeline, X, Y, cv=kfold) print("Standardized: %.2f (%.2f) MSE" % (results.mean(), results.std())) estimator.fit(X, Y) prediction = estimator.predict(X_test) accuracy_score(Y_test, prediction)
However, I receive the following error for the last line:
ValueError: continuous is not supported
Should I use other measures?
How to specify degree argument in ns() in R, for constructing natural spline of degree 5?
library(ISLR) fit=lm(wage~bs(age ,knots =c(25 ,40 ,60),degree = 5,),data=Wage) fit=lm(wage~ns(age ,knots =c(25 ,40 ,60),degree = 5,),data=Wage)
I am able to build a regression spline of degree 5 polynomial, but how do I build a natural spline of degree 5, as the ns() function lacks a degree argument.
I am only able to produce a cubic natural spline using ns(). Are there any other functions that could be used to produce let's say quadratic natural splines, etc?
wildfly 10.1 mysql replication second database is read-only exception
I started using Wildfly 10.1 and use mysql as a database. I have two databases, DB1 and DB2. Wildfly will connect to DB2 when D1 disconnected. I've done it so far. But when I connect to DB2, I get a "Connection is read-only" error. I looked at the topic here Database Fail Over in Jboss Data sources but
> <connection-property name = "readOnly"> false </ connection-property>
not resolve. I'm looking for a solution to that. I want master/master not master/slave. In my mysql configuration in standalone.xml:
<datasources> <datasource jndi-name="java:jboss/datasources/ExampleDS" pool-name="ExampleDS" enabled="true" use-java-context="true"> <connection-url>jdbc:h2:mem:test;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE</connection-url> <driver>h2</driver> <security> <user-name>sa</user-name> <password>sa</password> </security> </datasource> <datasource jta="true" jndi-name="java:jboss/datasources/RailbaseLabDS" pool-name="RailbaseLabDS" enabled="true" use-java-context="true" use-ccm="true"> <connection-url>jdbc:mysql://IP1:3306,IP2:3306/DBN?autoreconnect=true</connection-url> <driver-class>com.mysql.jdbc.Driver</driver-class> <driver>mysql</driver> <url-delimiter>|</url-delimiter> <security> <user-name>DBN</user-name> <password>DBN</password> </security> <validation> <valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.mysql.MySQLValidConnectionChecker"/> <check-valid-connection-sql>select 1</check-valid-connection-sql> <background-validation>true</background-validation> <background-validation-millis>5000</background-validation-millis> </validation> </datasource> <drivers> <driver name="h2" module="com.h2database.h2"> <xa-datasource-class>org.h2.jdbcx.JdbcDataSource</xa-datasource-class> </driver> <driver name="mysql" module="com.mysql"> <xa-datasource-class>com.mysql.jdbc.jdbc2.optional.MysqlXADataSource</xa-datasource-class> </driver> </drivers> </datasources>
Probabilistic classification with Gaussian Bayes Classifier vs Logistic Regression
I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple.
However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". Saying "I don't know" rather than making a mistake is better.
So far so good.
For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification.
Coming to my Problem:
GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. So I naturally would prefer to use GBC. However, as successful as GBC is, it is also very sure of itself. The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now.
Logistic Regression works poor, but at least gives better and more 'sensible' odds.
As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unit-less but gives an idea anyway).
In the end; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of.
Thanks in advance.
How can I use stepwise regression to remove a specific coefficient in logistic regression within R?
When I run the logistic regression for a cars dataset:
carlogistic.fit4 <- glm(as.factor(Mpg01) ~ Weight+Year+Origin, data=carslogic, family="binomial") summary(carlogistic.fit4)
I get the below output: Call: glm(formula = as.factor(Mpg01) ~ Weight + Year + Origin, family = "binomial", data = carslogic)
Deviance Residuals: Min 1Q Median 3Q Max
-2.29189 -0.10014 -0.00078 0.19699 2.60606
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.697e+01 5.226e+00 -5.161 2.45e-07 *** Weight -6.006e-03 7.763e-04 -7.737 1.02e-14 *** Year 5.677e-01 8.440e-02 6.726 1.75e-11 *** OriginGerman 1.256e+00 5.172e-01 2.428 0.0152 * OriginJapanese 3.250e-01 5.462e-01 0.595 0.5519 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 549.79 on 396 degrees of freedom Residual deviance: 151.06 on 392 degrees of freedom AIC: 161.06
However, if you notice the p-value for Japanese origin cars is greater than 0.05 and hence is insignificant. I want to remove this from the model, however, the column header is Origin as you see in the initial code. How do I exclude Japanese origin specifically from the model?
chi-square goodness-of-fit and R-square measures from the fixed-effect logistic regression using 'feglm' function
I am trying to get the chi-square goodness-of-fit and R-square measures from the following fixed-effect logistic regression using 'feglm' function. However, I find very limited information to even check this.
> regress=feglm(Y ~ X1+X2+X3+X4+X5+X6*X10+X7+X8+X9+X11| Firm+Time, data=DATA, family=binomial(link="logit")) > summary(regress) binomial Y ~ X1 + X2 + X3 + X4 + X5 + X6 * X10 + X7 + X8 + X9 + X11 | Firm + Time l= [127, 15], n= 14139, deviance= 9891.112 Structural parameter(s): Estimate Std. error z value Pr(> |z|) X1 -7.006e-02 3.990e-03 -17.560 < 2e-16 *** X2 1.473e+00 1.047e-01 14.077 < 2e-16 *** X3 -9.105e-02 2.691e-02 -3.384 0.000715 *** X4 -2.896e-04 3.294e-05 -8.791 < 2e-16 *** X5 1.223e-01 4.557e-03 26.848 < 2e-16 *** X6 1.154e-01 2.267e-01 0.509 0.610699 X10 -6.273e-03 2.387e+00 -0.003 0.997903 X7 2.663e-02 1.192e-02 2.234 0.025453 * X8 2.940e-01 9.002e-02 3.266 0.001092 ** X9 4.115e+00 1.080e-01 38.103 < 2e-16 *** X11 1.115e-03 3.442e-01 0.003 0.997415 X6:X10 3.344e-02 2.533e-01 0.132 0.894962 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ( 6244 observation(s) deleted due to missingness )
I suppose I will need at least loglikelihood, residual deviance, etc. values to even calculate the chi-square values and R-squares which I cannot find from the above result.
May I get help on this?
Linear Programming feasible using linprog and unfeasible using Gurobi in Matlab
I have the following very simple linear programming problem to solve in Matlab
clear %The unknown %x=[x1,...,x10]; %The constraints %x2+x8=Phi12 %x3+x7=Phi21 %x5=infvalue; %x10=infvalue; %The known parameters Phi12=-3.3386; Phi21=3.0722; infvalue=50; sizex=10; %size of the unknown
The problem admits a solution.
When I implement this LP using
linprogrit find a solution.
When I implement this LP using the Gurobi solver it tells me that the problem is unfeasible.
What am I doing wrong? Here's my code.
beq=[Phi12; Phi21; infvalue; infvalue]; rAeq=[ 1 1 ... 2 2 ... 3 ... 4]; cAeq=[ 2 8 ... 3 7 ... 5 10]; fillAeq=[1 1 ... 1 1 ... ones(1,2)]; Aeq=sparse(rAeq, cAeq,fillAeq, size(beq,1),sizex); Aeqfull=full(Aeq); %linprogr f=zeros(sizex,1); xlinprog = linprog(f,,,Aeqfull,beq); %Gurobi clear model; model.A=Aeq; model.rhs=beq; model.sense=repmat('=', size(Aeq,1),1); model.obj=f; resultgurobi=gurobi(model);
During my attempts to understand what is going on: if I put any positive value in place of
-3.3386, then Gurobi works perfectly. Whta
Extractin LP model from AMPL
I have a very big and complicated LP model in AMPL. I need to extract
Ax<= bformat from my LP (i.e), I need to extract all my data in form of matrix
b, and all variables to be concatenate into a large vector
How can I do that?
Python pulp constraint - Doubling the weight of any one variable which contributes the most
I am trying to use http://www.philipkalinda.com/ds9.html to set up a constrained optimisation.
prob = pulp.LpProblem('FantasyTeam', pulp.LpMaximize) decision_variables =  res = self.team_df # Set up the LP for rownum, row in res.iterrows(): variable = str('x' + str(rownum)) variable = pulp.LpVariable(str(variable), lowBound = 0, upBound = 1, cat= 'Integer') #make variables binary decision_variables.append(variable) print ("Total number of decision_variables: " + str(len(decision_variables))) total_points = "" for rownum, row in res.iterrows(): for i, player in enumerate(decision_variables): if rownum == i: formula = row['TotalPoint']* player total_points += formula prob += total_points print ("Optimization function: " + str(total_points))
The above, however, creates an optimisation where if points earned by x1 = X1, x2=X2.... and xn=Xn, it maximises x1*X1 + x2*X2 +..... + xn*XN. Here xi is the points earned by the XI variable. However, in my case, I need to double the points for the variable that earns the most points. How do I set this up?
Maximize OBJ: 38.1 x0 + 52.5 x1 + 31.3 x10 + 7.8 x11 + 42.7 x12 + 42.3 x13 + 4.7 x14 + 49.5 x15 + 21.2 x16 + 11.8 x17 + 1.4 x18 + 3.2 x2 + 20.8 x3 + 1.2 x4 + 24 x5 + 25.9 x6 + 27.8 x7 + 6.2 x8 + 41 x9
When I maximise the sum x1 gets dropped but when I maximise with the top guy taking double points, it should be there
Here are the constraints I am using:-
Subject To _C1: 10.5 x0 + 21.5 x1 + 17 x10 + 7.5 x11 + 11.5 x12 + 12 x13 + 7 x14 + 19 x15 + 10.5 x16 + 5.5 x17 + 6.5 x18 + 6.5 x2 + 9.5 x3 + 9 x4 + 12 x5 + 12 x6 + 9.5 x7 + 7 x8 + 14 x9 <= 100 _C10: x12 + x2 + x6 >= 1 _C11: x10 + x11 + x17 + x3 <= 4 _C12: x10 + x11 + x17 + x3 >= 1 _C13: x0 + x10 + x11 + x12 + x13 + x14 + x15 + x18 + x2 <= 5 _C14: x0 + x10 + x11 + x12 + x13 + x14 + x15 + x18 + x2 >= 3 _C15: x1 + x16 + x17 + x3 + x4 + x5 + x6 + x7 + x8 + x9 <= 5 _C16: x1 + x16 + x17 + x3 + x4 + x5 + x6 + x7 + x8 + x9 >= 3 _C2: x0 + x1 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 = 8 _C3: x0 + x14 + x16 + x5 <= 4 _C4: x0 + x14 + x16 + x5 >= 1 _C5: x15 + x18 + x4 + x7 + x8 <= 4 _C6: x15 + x18 + x4 + x7 + x8 >= 1 _C7: x1 + x13 + x9 <= 4 _C8: x1 + x13 + x9 >= 1 _C9: x12 + x2 + x6 <= 4
Naturally, maximising A + B + C + D doesn't maximise max(2A+B+C+D, A+2B+C+D, A+B+2C+D, A+B+C+2D)