Average precision score too high looking at the confusion matrix
I am developing a machine learning scikitlearn model on an imbalanced dataset (binary classification). Looking at the confusion matrix and the F1 score, I expect a lower average precision score but I almost get a perfect score and I can't figure out why. This is the output I am getting:
Confusion matrix on the test set:
[[6792 199]
[ 0 173]]
F1 score: 0.63
Test AVG precision score: 0.99
I am giving the avg precision score function of scikitlearn probabilities which is what the package says to use. I was wondering where the problem could be.
1 answer

The confusion matrix and f1 score are based on a hard prediction, which in sklearn is produced by cutting predictions at a probability threshold of 0.5 (for binary classification, and assuming the classifier is really probabilistic to begin with [so not SVM e.g.]). The average precision in contrast is computed using all possible probability thresholds; it can be read as the area under the precisionrecall curve.
So a high
average_precision_score
and lowf1_score
suggests that your model does extremely well at some threshold that is not 0.5.
do you know?
how many words do you know
See also questions close to this topic

Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited  I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.

What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a crossvalidation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?

How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\testval\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # ScikitLearn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:5000], X_train_full[5000:] y_train, y_valid = y_train_full[:5000], y_train_full[5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.

Keyerror when processing pandas dataframe
For a pathway pi, the CNA data of associated genes were extracted from the CNV matrix (C), producing an intermediate matrix B∈Rn×ri, where ri is the number of genes involved in the pathway pi. That is, the matrix B consists of samples in rows and genes for a given pathway in columns. Using principal component analysis (PCA), the matrix B was decomposed into uncorrelated components, yielding Gpi∈Rn×q, where q is the number of principal components (PCs).
import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import LabelEncoder import csv def get_kegg_pathways(): kegg_pathways = [] with open(directory + "hsa.txt", newline="") as keggfile: kegg = pd.read_csv(keggfile, sep="\t") for row in kegg: #for row in kegg.itertuples(): kegg_pathways.append(row) return kegg_pathways def main(): # Pathway info kegg = get_kegg_pathways() # q : Number of Principal Components (PCs) # C : CNV matrix # G = mRNA expression matrix # M : DNA methylation matrix q = 5 C = [] G = [] M = [] # Process common data (denoted as matrix B) cna_sample_index = {} process_common = True if process_common: for i, p in enumerate(kegg): genes = {} first = True for s in p: if first: first = False else: if s!= "NA": genes[s] = 1 # Loop through each sample B = [] pathways = [] for s in ld: B.append([]) pathways.append(cna_sample_index[p]) Bi = 0 for index, row in cna.df.itertuples(): if row[0].upper() in genes: Bi2 = Bi for c in pathways: B[Bi2].append(cna.df.iloc[index, c]) Bi2 = Bi2 + 1 pca_cna = cna.fit() pca_cna.fit(B)
Traceback:
File "/home/melissachua/main.py", line 208, in <module> main() File "/home/melissachua/main.py", line 165, in main pathways.append(cna_sample_index[p]) KeyError: 'hsa00010_Glycolysis_/_Gluconeogenesis'
kegg
table:0 1 0 hsa00010_Glycolysis_/_Gluconeogenesis NaN 1 hsa00020_Citrate_cycle_(TCA_cycle) NaN 2 hsa00030_Pentose_phosphate_pathway NaN cna
table:Hugo_Symbol TCGA02000101 TCGA02000102 TCGA02000103 0 0.001 0.002 0.003 0.004 1 0.005 0.006 0.007 0.008 
Is there a way to use mutual information as part of a pipeline in scikit learn?
I'm creating a model with scikitlearn. The pipeline that seems to be working best is:
 mutual_info_classif with a threshold
 PCA
 LogisticRegression
I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100) lr = LogisticRegression(random_state=200) pipe = Pipeline( [ ('dim_red', pca), ('pred', lr) ] )
But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

Slicing an image into color based layers with Sklearn
I have an image. I applied KMeans color clustering on this image so I need to present only purple clusters on image and orange clusters on different image. How do I need to do that?

is nDCG a precisionoriented measurement? Why?
is nDCG a precisionoriented measurement? Why?

Can class recall be considered as class accuracy?
I find the notion of accuracyperclass very interesting but I don't find any known metrics for it in the literature.
I was thinking, can we consider that, in a multiclass classification problem, the recall of a given class represents the accuracy of our model regarding this class? because basically the recall of a given class represents the proportion of items well classified from that class, right? And this is what accuracy is about. If not, is there a metric for accuracyperclass that I am missing?
Ps: When I say the class recall, I am refering to the recall computed by sklearn classification_report for each label of the multiclass classification problem: recall = TP/(TP+FN)

How do I apply RandomUnderSalmpling and OverSampling in StratifiedKfoldCrossValidation?
Currently undergoing a classification taks where I have to predict customer default using a dataset that is provided by LendingClub. For my fisrst model I decided to test the Logistic Regression using SGD.
I created this initial pipeline:
imputer = SimpleImputer(strategy = "median") scaler = StandardScaler() model = SGDClassifier(loss='log',random_state=42,n_jobs=1,warm_start=True) pipeline_sgdlogreg = make_pipeline(imputer, scaler, model)
Defined my strategy:
KF = StratifiedKFold(n_splits = 5)
And performed GridSearchCV:
grid_sgdlogreg = GridSearchCV(pipeline_sgdlogreg, param_grid_sgdlogreg, scoring = 'roc_auc', pre_dispatch = 3, n_jobs = 1, cv = KF, verbose = 5) search = grid_sgdlogreg.fit(X_train, y_train)
Due to class imbalance the model is servery lacking in both recall and precision which does make sense.
I wanted to test out different sampling strategies. Consider this undersample approach. Made nem subsamples only ont the training data:
X_train_subsample, y_train_subsample = rus.fit_resample(X_train, y_train)
Pipeline that includes randomundersampler:
pipeline_sgdlogreg_rus = Pipeline([("Rus", RandomUnderSampler(sampling_strategy = "majority", random_state = 42)), ('imputer', SimpleImputer(strategy = "median")), ('scaler', StandardScaler()), ('model'SGDClassifier(loss='log',random_state=42,n_jobs=1,warm_start=True))])
Performed GridSearchCV again
grid_sgdlogreg = GridSearchCV(pipeline_sgdlogreg_rus, param_grid_sgdlogreg_rus, scoring='roc_auc', pre_dispatch=3, n_jobs=1, cv=KF, verbose=5) search = grid_sgdlogreg.fit(X_train_subsample, y_train_subsample)
What I would like to know is if I am doing this correctly?
I have already dealt with outliers and label enconding before the split and I want to make sure that I am doing this for every fold.
Do I need to split the data again or by using the RandomUnderSampler () in the pipeline this command does that automatically for every fold?
Thank You!

Deep Learning Image Detection  Help needed deciphering machine learning loss and accuracy graph and finding solutions to fix model
I have an imbalanced dataset from Google OpenImages of 6 classes
Train (starfish=439; Dolphin = 890; Turtle = 1362; Fish = 6216; Jellyfish = 733; Shellfish = 1141 )
Validation (starfish=20; Dolphin = 61; Turtle = 6; Fish = 370; Jellyfish = 38; Shellfish = 42 )
Test (starfish=72; Dolphin = 139; Turtle = 12; Fish = 1028; Jellyfish = 105; Shellfish = 115 )
I have 4 different PyVision pretrained models that I am using. I am using data augmentation and have tried adjusting the weights manually to compensate for the imbalance. I finally got my models to run on Google Colab 100 epochs, batch=8.
optimizer = torch.optim.SGD(params, lr=0.05, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
These are my graphs from the training of the pretrained model on my image set.
torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn
My Interpretation:
 Loss: not training and not learning
 Accuracies: Underrepresented
torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn
My Interpretation:
 Loss: not training and not learning
 Accuracies: Underrepresented
torchvision.models.detection.fasterrcnn_resnet50_fpn
TBD  assume similar graph  model running
torchvision.models.detection.FasterRCNN
TBD  assume similar graph  model running
QUESTIONS:
Are my interpretations correct? What more does this say and what other graph would you recommend that would tell me what to do to fix this?
What would you do to fix this?
Would you create extra images to pad the uneven classes? i.e. for the Train dataset, make all the classes equal the same amount as the largest class. Then run it again?
I can't think of anything else. Using pretrained models is supposed to make things better and overcome imbalances, as well as data augmentation, and designating the weights for the imbalanced classes.
I used weight = [9.36, 1.48, 12.87, 13.98, 21.18, 86.83] for classes = ['Jellyfish', 'Fish', 'Dolphin', 'Starfish', 'Shellfish', 'Turtle'].Did I do that wrong?