The same RandomForestClassifier returns different results
I have a case where I need to train the same model in two different machines and make it sure that the output will be exactly the same. The environment is the same, so I try to write some unit tests to compare the predictions of the same dataset. From all the predictions, there are a few only that are different by 0.5-1.5 percentage point.
To verify that the two models are the same, I compared them like this:
model_1 = joblib.load("path for model 1")
model_2 = joblib.load("path for model 2")
When I print both models, I get the same set of parameters
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
n_jobs=None, oob_score=False, random_state=10000,
verbose=0, warm_start=False)
And when I compare their features importance, I get the same output:
>> np.array_equal(model_1.feature_importances_, model_2.feature_importances_)
>> True
Is there any other way to find why some of the predictions return slightly different results between the two models?
Checking the random_state
of each estimator is also the same:
>> [i.random_state for i in model_1.estimators_] == [j.random_state for j in model_2.estimators_]
>> True
do you know?
how many words do you know
See also questions close to this topic
-
Python File Tagging System does not retrieve nested dictionaries in dictionary
I am building a file tagging system using Python. The idea is simple. Given a directory of files (and files within subdirectories), I want to filter them out using a filter input and tag those files with a word or a phrase.
If I got the following contents in my current directory:
data/ budget.xls world_building_budget.txt a.txt b.exe hello_world.dat world_builder.spec
and I execute the following command in the shell:
py -3 tag_tool.py -filter=world -tag="World-Building Tool"
My output will be:
These files were tagged with "World-Building Tool": data/ world_building_budget.txt hello_world.dat world_builder.spec
My current output isn't exactly like this but basically, I am converting all files and files within subdirectories into a single dictionary like this:
def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree
Right now, my dictionary looks like this:
key:''
.In the following function, I am turning the empty values
''
into empty lists (to hold my tags):def empty_str_to_list(d): for k,v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v)
When I run my entire code, this is my output:
hello_world.dat ['World-Building Tool'] world_builder.spec ['World-Building Tool']
But it does not see
data/world_building_budget.txt
. This is the full dictionary:{'data': {'world_building_budget.txt': []}, 'a.txt': [], 'hello_world.dat': [], 'b.exe': [], 'world_builder.spec': []}
This is my full code:
import os, argparse def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree def empty_str_to_list(d): for k, v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v) parser = argparse.ArgumentParser(description="Just an example", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("--filter", action="store", help="keyword to filter files") parser.add_argument("--tag", action="store", help="a tag phrase to attach to a file") parser.add_argument("--get_tagged", action="store", help="retrieve files matching an existing tag") args = parser.parse_args() filter = args.filter tag = args.tag get_tagged = args.get_tagged current_dir = os.getcwd() files_dict = fs_tree_to_dict(current_dir) empty_str_to_list(files_dict) for k, v in files_dict.items(): if filter in k: if v == []: v.append(tag) print(k, v) elif isinstance(v, dict): empty_str_to_list(v) if get_tagged in v: print(k, v)
-
Actaully i am working on a project and in it, it is showing no module name pip_internal plz help me for the same. I am using pycharm(conda interpreter
File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\Scripts\pip.exe\__main__.py", line 4, in <module> File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_internal\__init__.py", line 4, in <module> from pip_internal.utils import _log
I am using pycharm with conda interpreter.
-
Looping the function if the input is not string
I'm new to python (first of all) I have a homework to do a function about checking if an item exists in a dictionary or not.
inventory = {"apple" : 50, "orange" : 50, "pineapple" : 70, "strawberry" : 30} def check_item(): x = input("Enter the fruit's name: ") if not x.isalpha(): print("Error! You need to type the name of the fruit") elif x in inventory: print("Fruit found:", x) print("Inventory available:", inventory[x],"KG") else: print("Fruit not found") check_item()
I want the function to loop again only if the input written is not string. I've tried to type return Under print("Error! You need to type the name of the fruit") but didn't work. Help
-
Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited - I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.
-
What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a cross-validation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?
-
How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\test-val\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # Scikit-Learn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0-preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:] y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.
-
Keyerror when processing pandas dataframe
For a pathway pi, the CNA data of associated genes were extracted from the CNV matrix (C), producing an intermediate matrix B∈Rn×ri, where ri is the number of genes involved in the pathway pi. That is, the matrix B consists of samples in rows and genes for a given pathway in columns. Using principal component analysis (PCA), the matrix B was decomposed into uncorrelated components, yielding Gpi∈Rn×q, where q is the number of principal components (PCs).
import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import LabelEncoder import csv def get_kegg_pathways(): kegg_pathways = [] with open(directory + "hsa.txt", newline="") as keggfile: kegg = pd.read_csv(keggfile, sep="\t") for row in kegg: #for row in kegg.itertuples(): kegg_pathways.append(row) return kegg_pathways def main(): # Pathway info kegg = get_kegg_pathways() # q : Number of Principal Components (PCs) # C : CNV matrix # G = mRNA expression matrix # M : DNA methylation matrix q = 5 C = [] G = [] M = [] # Process common data (denoted as matrix B) cna_sample_index = {} process_common = True if process_common: for i, p in enumerate(kegg): genes = {} first = True for s in p: if first: first = False else: if s!= "NA": genes[s] = 1 # Loop through each sample B = [] pathways = [] for s in ld: B.append([]) pathways.append(cna_sample_index[p]) Bi = 0 for index, row in cna.df.itertuples(): if row[0].upper() in genes: Bi2 = Bi for c in pathways: B[Bi2].append(cna.df.iloc[index, c]) Bi2 = Bi2 + 1 pca_cna = cna.fit() pca_cna.fit(B)
Traceback:
File "/home/melissachua/main.py", line 208, in <module> main() File "/home/melissachua/main.py", line 165, in main pathways.append(cna_sample_index[p]) KeyError: 'hsa00010_Glycolysis_/_Gluconeogenesis'
kegg
table:0 1 0 hsa00010_Glycolysis_/_Gluconeogenesis NaN 1 hsa00020_Citrate_cycle_(TCA_cycle) NaN 2 hsa00030_Pentose_phosphate_pathway NaN cna
table:Hugo_Symbol TCGA-02-0001-01 TCGA-02-0001-02 TCGA-02-0001-03 0 0.001 0.002 0.003 0.004 1 0.005 0.006 0.007 0.008 -
Is there a way to use mutual information as part of a pipeline in scikit learn?
I'm creating a model with scikit-learn. The pipeline that seems to be working best is:
- mutual_info_classif with a threshold
- PCA
- LogisticRegression
I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100) lr = LogisticRegression(random_state=200) pipe = Pipeline( [ ('dim_red', pca), ('pred', lr) ] )
But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
-
Slicing an image into color based layers with Sklearn
I have an image. I applied KMeans color clustering on this image so I need to present only purple clusters on image and orange clusters on different image. How do I need to do that?
-
Log-Likelihood for Random Forest models
I'm trying to compare multiple species distribution modeling approaches via k-fold cross-validation. Currently I'm calculating the RSME and AUC to compare model-performance. A friend suggested to further use the sum of log-likelihoods as metric to compare models. However, one of the models is a random forest fitted with the ranger package. If actually possible how would I calculate the log-likelihood for a random forest model and would it actually be a comparable metric to use with other models (GAM, GLM).
Thanks for your help.
-
Using the random forest method for classification to train my model, tuning your model based on the validation data set.Not using cross validation
I separate my dataset into three sets. train set, validation set, and test set. I want to use random forest method to train the data. But, To find the best ntree, mytry, and nnodes I want to use a validation set and see which are the best parameters. Then, I want to use those parameters for my training set. I do not want to use the caret package since it used cross-validation. I am dealing with classification problem.
a=as.numeric(2:15) for (i in 2:15){ model2= randomForest(as.factor(V2)~ .,data = vset, ntree=500, mtry=i, importance=TRUE) predValid2 = predict(model2, newdata = test, type = "class") a[i-1]= mean(predValid2 == test$V2) } n.tree=seq(from = 100, to = 5000, by = 100) n.mtry= seq(from = 1, to = 15, by = 1) model3= randomForest(as.factor(V2)~ .,data = vset, ntree=n.tree, mtry=n.mtry, importance=TRUE)
I use the above codes to write a loop but I believe they are not correct. I'd appreciate it if you could help me to find the best parameters based on validation set not cross validation