Specifying Reference Category for Outcome Variable in Multinomial Logit Using SKlearn's LogisticRegression
I am trying to fit a multinomial logit model using LogisticRegression module from Sklearn.
My outcome (y) has 4 levels. I need to specify one of these levels as the reference category (or baseline). Does the LogisticRegression module provides a way of specifying this reference category?
1 answer

LogisticRegression for multiple classes in sklearn uses either one vs all or a softmax parameterization of the problem, depending on whether you specify multinomial. In either case it does not compute the solution using a reference, but instead computes a vector of coefficients for each output class. If you use the multinomial specification you can select the coefficients corresponding to the reference category you would like to set and subtract that from the others, which should recover an equivalent solution to the one you seem to want.
See the docs for how to specify multinomial: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
do you know?
how many words do you know
See also questions close to this topic

Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited  I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.

What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a crossvalidation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?

How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\testval\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # ScikitLearn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:5000], X_train_full[5000:] y_train, y_valid = y_train_full[:5000], y_train_full[5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.

Keyerror when processing pandas dataframe
For a pathway pi, the CNA data of associated genes were extracted from the CNV matrix (C), producing an intermediate matrix B∈Rn×ri, where ri is the number of genes involved in the pathway pi. That is, the matrix B consists of samples in rows and genes for a given pathway in columns. Using principal component analysis (PCA), the matrix B was decomposed into uncorrelated components, yielding Gpi∈Rn×q, where q is the number of principal components (PCs).
import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import LabelEncoder import csv def get_kegg_pathways(): kegg_pathways = [] with open(directory + "hsa.txt", newline="") as keggfile: kegg = pd.read_csv(keggfile, sep="\t") for row in kegg: #for row in kegg.itertuples(): kegg_pathways.append(row) return kegg_pathways def main(): # Pathway info kegg = get_kegg_pathways() # q : Number of Principal Components (PCs) # C : CNV matrix # G = mRNA expression matrix # M : DNA methylation matrix q = 5 C = [] G = [] M = [] # Process common data (denoted as matrix B) cna_sample_index = {} process_common = True if process_common: for i, p in enumerate(kegg): genes = {} first = True for s in p: if first: first = False else: if s!= "NA": genes[s] = 1 # Loop through each sample B = [] pathways = [] for s in ld: B.append([]) pathways.append(cna_sample_index[p]) Bi = 0 for index, row in cna.df.itertuples(): if row[0].upper() in genes: Bi2 = Bi for c in pathways: B[Bi2].append(cna.df.iloc[index, c]) Bi2 = Bi2 + 1 pca_cna = cna.fit() pca_cna.fit(B)
Traceback:
File "/home/melissachua/main.py", line 208, in <module> main() File "/home/melissachua/main.py", line 165, in main pathways.append(cna_sample_index[p]) KeyError: 'hsa00010_Glycolysis_/_Gluconeogenesis'
kegg
table:0 1 0 hsa00010_Glycolysis_/_Gluconeogenesis NaN 1 hsa00020_Citrate_cycle_(TCA_cycle) NaN 2 hsa00030_Pentose_phosphate_pathway NaN cna
table:Hugo_Symbol TCGA02000101 TCGA02000102 TCGA02000103 0 0.001 0.002 0.003 0.004 1 0.005 0.006 0.007 0.008 
Is there a way to use mutual information as part of a pipeline in scikit learn?
I'm creating a model with scikitlearn. The pipeline that seems to be working best is:
 mutual_info_classif with a threshold
 PCA
 LogisticRegression
I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100) lr = LogisticRegression(random_state=200) pipe = Pipeline( [ ('dim_red', pca), ('pred', lr) ] )
But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

Slicing an image into color based layers with Sklearn
I have an image. I applied KMeans color clustering on this image so I need to present only purple clusters on image and orange clusters on different image. How do I need to do that?

logistic regression model with r
I need to make a matrix with r in biostatistics for 30 patients with random DLT= 0/1 for 4 doses with initial toxicity :prior=c(0.02, 0.05, 0.10, 0.20) and age column as a Categorial variable with logistic model and linear regression

How can I plot a model with a dichotomous reponse variable?
My project is to make and compare models where the response factor is binary. The model takes factors and determines if the outcome would be be 0,1. The dataset is about a person purchasing or not purchasing a bike based on various factors ie: age, cars, commute, income, children.
I've constructed multiple models ( Logistic Model, KNN and Kmeans Cluster ) but now I'm tasked with comparing them. A plot would be nice but I'm having a hard time plotting it.
Code chunk with output of graph
Python
from sklearn.linear_model import LogisticRegression from sklearn import metrics from sklearn.metrics import classification_report, confusion_matrix from sklearn.model_selection import train_test_split X = df.drop(['Bought Bike'], axis = 1) Y = df['Bought Bike'] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 101) lm = LogisticRegression() lm.fit(X_train, Y_train) Reg_predictions = lm.predict(X_test) sns.barplot(Y_test, Reg_predictions)

Adjusted Odds Ratio in R grouping by multiple independent variables
Apologies if this has been answered elsewhere  I am new to R and not sure what terms I should be looking for.
I want to fit a multivariable logistic regression to get the crude and adjusted odds for the presence of damage in buildings, given independent variables on housing characteristics. I have a dataset like this:
damage_homes < data.frame(detached=c(1, 0, 0), apartment=c(0, 1, 1), flat_roof=c(1, 0, 0), pitched_roof=c(0, 1, 1), monopitch_roof=c(0, 0, 1), damage=c(1, 0, 0))
I put just roofs here, but I have lots of independent variables on walls, floors, etc. Some buildings can have the multiple characteristics types (e.g. one of the above buildings has both pitched and monopitched roof, for example in an extension).
I'm trying to figure out the best way to do a logistic regression to get the odds ratio. If I just do it by the catergories I have now, e.g.
reg < glm(damage ~ detached + apartment + flat_roof + pitched_roof + monopitch_roof, data=damage_homes , family = binomial(link="logit")) exp(cbind(coef(reg), confint(reg)))
then I get the odds for the presence of something vs the absence (?)
detached 1.1274825 (0.9879778, 1.2904948) apartment 0.7132140 (0.5758667, 0.8803410) flat_roof 1.05 (1.01868, 1.197979)
I'm not sure this makes sense to me  should the odds be relative to some 'base' characteristic. I.e. what are the odds of an apartment relative to a detached house instead of what are the odds of detached house = 1 vs 0 (something else).
Expanding on this:
 There are many independent columns, all of which can be grouped into e.g. roof type, wall type. All of them can have more than one type
 I am not sure I should combine the related indpendent variables into a single variable. If a building has a flat roof and a pitched roof, I want it to be included in both rather than a new category of 'mixed'
 I have thought about melting table, but presumably this could lead to the duplication of other independent variables which could impact the results?
I'm sure this is a common issue, I am just not sure what to look for or how to solve it.
Thanks!
JT

Python Scipy multinomial.pmf function different results?
I was working on figuring out an exact multinomial test of goodnessoffit with Python (3.9) in JupyterLab via Anaconda on Windows 10. To do this, I need the probability mass function of the multinomial distribution. Scipy has a nice function for this. I've set the following:
from scipy.stats import multinomial multinomial.pmf(x=[16, 17, 22], n=55, p=[1/3, 1/3, 1/3])
The result of this is 0.008700756210332266
Now, note that the expected proportions are equal for each category, so the order of the observed counts should not matter, however:
from scipy.stats import multinomial multinomial.pmf(x=[22, 17, 16], n=55, p=[1/3, 1/3, 1/3])
This results in: 0.00870075621033202
A very small difference. However, for the exact multinomial test, I'm iterating over a lot of combinations and checking if the value is less than or equal to this. It turns out these quickly add up.
I've solved the issue myself by sorting each option, and another option would of course be to simply round the decimals to 8 or so.
The question: Why does the multinomial.pmf function from Scipy give different results when they should be the same?
Thanks in advance for any comments/suggestions.

lrtest() not working: posthoc testing for multinomial with vglm() not working with lrtest()
I am running a multinomial analysis with vglm(). It all works, but then I try to follow the instructions from the following website (https://rcompanion.org/handbook/H_08.html) to do a pairwise test, because emmeans cannot handle pairwise for vglm models. The lrtest() part gives me the following error: Error in lrtest.default(model) : 'list' object cannot be coerced to type 'double'
I cannot figure out what is wrong, I even copy and pasted the exact code that the website used (see below) and get the same error with their own code and dataset. Any ideas?
Their code and suggestion for doing pairwise testing with vglm() is the only pairwise testing option I found for vglm() anywhere on the web.
Here is the code along with all the expected output and extra details from their website (it is simpler than mine but gets same error anyways).
Input = (" County Sex Result Count Bloom Female Pass 9 Bloom Female Fail 5 Bloom Male Pass 7 Bloom Male Fail 17 Cobblestone Female Pass 11 Cobblestone Female Fail 4 Cobblestone Male Pass 9 Cobblestone Male Fail 21 Dougal Female Pass 9 Dougal Female Fail 7 Dougal Male Pass 19 Dougal Male Fail 9 Heimlich Female Pass 15 Heimlich Female Fail 8 Heimlich Male Pass 14 Heimlich Male Fail 17 ") Data = read.table(textConnection(Input),header=TRUE) ### Order factors otherwise R will alphabetize them Data$County = factor(Data$County, levels=unique(Data$County)) Data$Sex = factor(Data$Sex, levels=unique(Data$Sex)) Data$Result = factor(Data$Result, levels=unique(Data$Result)) ### Check the data frame library(psych) headTail(Data) str(Data) summary(Data) ### Remove unnecessary objects rm(Input) Multinomial regression library(VGAM) model = vglm(Result ~ Sex + County + Sex:County, family=multinomial(refLevel=1), weights = Count, data = Data) summary(model) library(car) Anova(model, type="II", test="Chisq")```
Analysis of Deviance Table (Type II tests)
Response: Result Df Chisq Pr(>Chisq)
Sex 1 6.7132 0.00957 ** County 3 4.1947 0.24120
Sex:County 3 7.1376 0.06764 .library(rcompanion) nagelkerke(model)
$Pseudo.R.squared.for.model.vs.null Pseudo.R.squared McFadden 0.0797857 Cox and Snell (ML) 0.7136520 Nagelkerke (Cragg and Uhler) 0.7136520
$Likelihood.ratio.test Df.diff LogLik.diff Chisq p.value 7 10.004 20.009 0.0055508
library(lmtest) lrtest(model)
Likelihood ratio test
Model 1: Result ~ Sex + County + Sex:County Model 2: Result ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 115.39
2 15 125.39 7 20.009 0.005551 **Posthoc analysis
At the time of writing, the lsmeans package cannot be used with vglm models.
One option for posthoc analysis would be to conduct analyses on reduced models, including only two levels of a factor. For example, if the variable County x Sex term had been significant, the following code could be used to create a reduced dataset with only Bloom–Female and Bloom–Male, and analyze this data with vglm.
Data.b = Data[Data$County=="Bloom" & (Data$Sex=="Female" Data$Sex=="Male") , ] Data.b$County = factor(Data.b$County) Data.b$Sex = factor(Data.b$Sex) summary(Data.b)
County Sex Result Count
Bloom:4 Female:2 Pass:2 Min. : 5.0 Male :2 Fail:2 1st Qu.: 6.5 Median : 8.0 Mean : 9.5 3rd Qu.:11.0 Max. :17.0library(VGAM) model.b = vglm(Result ~ Sex, family=multinomial(refLevel=1), weights = Count, data = Data.b) lrtest(model.b)
Likelihood ratio test
#Df LogLik Df Chisq Pr(>Chisq) 1 2 23.612
2 3 25.864 1 4.5041 0.03381 *Summary table of results
Comparison pvalue Bloom–Female  Bloom–Male 0.034 Cobblestone–Female  Cobblestone–Male 0.0052 Dougal–Female  Dougal–Male 0.44 Heimlich–Female  Heimlich–Male 0.14
p.value = c(0.034, 0.0052, 0.44, 0.14) p.adj = p.adjust(p.value, method = "fdr") p.adj = signif(p.adj, 2) p.adj
[1] 0.068 0.021 0.440 0.190
Comparison pvalue p.adj Bloom–Female  Bloom–Male 0.034 0.068 Cobblestone–Female  Cobblestone–Male 0.0052 0.021 Dougal–Female  Dougal–Male 0.44 0.44 Heimlich–Female  Heimlich–Male 0.14 0.19