Found input variables with inconsistent numbers of samples error : [199364, 85443]
i am facing the above error when i try to fit my random forest classifier model. below is the code;
from sklearn.model_selection import train_test_split
X = df.drop(['Class'], axis=1)
y = df['Class']
X = df.drop(['Class'], axis=1)
y = df['Class']
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
model=rfc.fit(X_train,y_train)
but i keep on getting the below error;
ValueError: Found input variables with inconsistent numbers of samples: [199364, 85443]
what could i be doing wrong because the X.shape
and y.shape
seems to be okay, all with the same size as seen below;
X.shape
(284807, 30)
y.shape
(284807,)
do you know?
how many words do you know
See also questions close to this topic

Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited  I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.

What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a crossvalidation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?

How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\testval\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # ScikitLearn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:5000], X_train_full[5000:] y_train, y_valid = y_train_full[:5000], y_train_full[5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.

Keyerror when processing pandas dataframe
For a pathway pi, the CNA data of associated genes were extracted from the CNV matrix (C), producing an intermediate matrix B∈Rn×ri, where ri is the number of genes involved in the pathway pi. That is, the matrix B consists of samples in rows and genes for a given pathway in columns. Using principal component analysis (PCA), the matrix B was decomposed into uncorrelated components, yielding Gpi∈Rn×q, where q is the number of principal components (PCs).
import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import LabelEncoder import csv def get_kegg_pathways(): kegg_pathways = [] with open(directory + "hsa.txt", newline="") as keggfile: kegg = pd.read_csv(keggfile, sep="\t") for row in kegg: #for row in kegg.itertuples(): kegg_pathways.append(row) return kegg_pathways def main(): # Pathway info kegg = get_kegg_pathways() # q : Number of Principal Components (PCs) # C : CNV matrix # G = mRNA expression matrix # M : DNA methylation matrix q = 5 C = [] G = [] M = [] # Process common data (denoted as matrix B) cna_sample_index = {} process_common = True if process_common: for i, p in enumerate(kegg): genes = {} first = True for s in p: if first: first = False else: if s!= "NA": genes[s] = 1 # Loop through each sample B = [] pathways = [] for s in ld: B.append([]) pathways.append(cna_sample_index[p]) Bi = 0 for index, row in cna.df.itertuples(): if row[0].upper() in genes: Bi2 = Bi for c in pathways: B[Bi2].append(cna.df.iloc[index, c]) Bi2 = Bi2 + 1 pca_cna = cna.fit() pca_cna.fit(B)
Traceback:
File "/home/melissachua/main.py", line 208, in <module> main() File "/home/melissachua/main.py", line 165, in main pathways.append(cna_sample_index[p]) KeyError: 'hsa00010_Glycolysis_/_Gluconeogenesis'
kegg
table:0 1 0 hsa00010_Glycolysis_/_Gluconeogenesis NaN 1 hsa00020_Citrate_cycle_(TCA_cycle) NaN 2 hsa00030_Pentose_phosphate_pathway NaN cna
table:Hugo_Symbol TCGA02000101 TCGA02000102 TCGA02000103 0 0.001 0.002 0.003 0.004 1 0.005 0.006 0.007 0.008 
Is there a way to use mutual information as part of a pipeline in scikit learn?
I'm creating a model with scikitlearn. The pipeline that seems to be working best is:
 mutual_info_classif with a threshold
 PCA
 LogisticRegression
I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100) lr = LogisticRegression(random_state=200) pipe = Pipeline( [ ('dim_red', pca), ('pred', lr) ] )
But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

Slicing an image into color based layers with Sklearn
I have an image. I applied KMeans color clustering on this image so I need to present only purple clusters on image and orange clusters on different image. How do I need to do that?

LogLikelihood for Random Forest models
I'm trying to compare multiple species distribution modeling approaches via kfold crossvalidation. Currently I'm calculating the RSME and AUC to compare modelperformance. A friend suggested to further use the sum of loglikelihoods as metric to compare models. However, one of the models is a random forest fitted with the ranger package. If actually possible how would I calculate the loglikelihood for a random forest model and would it actually be a comparable metric to use with other models (GAM, GLM).
Thanks for your help.

Using the random forest method for classification to train my model, tuning your model based on the validation data set.Not using cross validation
I separate my dataset into three sets. train set, validation set, and test set. I want to use random forest method to train the data. But, To find the best ntree, mytry, and nnodes I want to use a validation set and see which are the best parameters. Then, I want to use those parameters for my training set. I do not want to use the caret package since it used crossvalidation. I am dealing with classification problem.
a=as.numeric(2:15) for (i in 2:15){ model2= randomForest(as.factor(V2)~ .,data = vset, ntree=500, mtry=i, importance=TRUE) predValid2 = predict(model2, newdata = test, type = "class") a[i1]= mean(predValid2 == test$V2) } n.tree=seq(from = 100, to = 5000, by = 100) n.mtry= seq(from = 1, to = 15, by = 1) model3= randomForest(as.factor(V2)~ .,data = vset, ntree=n.tree, mtry=n.mtry, importance=TRUE)
I use the above codes to write a loop but I believe they are not correct. I'd appreciate it if you could help me to find the best parameters based on validation set not cross validation