same score values for different following-algorithms in machine learning project using python
I am trying to train and test a model using test,train,split. I train the model with one dataset (Train_data) and test it on another dataset (Test_data). the problem is that i am using different classifiers to select the best, but for all my classifiers the score values are the same. actually these values are based on the first classifier which in my case is KNN and is followed by DecisionTreeClassifier and so on. What could possibly be wrong?
from sklearn.neighbors import KNeighborsClassifier as kc from sklearn import metrics kn=kc(n_neighbors=9).fit(x_train,y_train) y_pred1=kn.predict(x1) from sklearn.metrics import f1_score f1_score(y1,y_pred1) 0.851063829787234 from sklearn.metrics import jaccard_score jaccard_score(y1, y_pred1) 0.7407407407407407 from sklearn.tree import DecisionTreeClassifier LoanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 6) LoanTree.fit(x_train,y_train) y_pred1=LoanTree.predict(x1) from sklearn.metrics import f1_score f1_score(y1, y_pred1) 0.851063829787234 from sklearn.metrics import jaccard_score jaccard_score(y1, y_pred1) 0.7407407407407407
See also questions close to this topic
- Why is this an invalid syntax?
Compact Lists OR Faster lists?
Is there a way to create List/Lists-of-Lists (and may be dicts) that act as lists in python but take less memory space ?
Even if the access is slower for in memory structure.
Or the other way around faster but take more memory.
Using in-memory DBs like redis I suppose is slower and takes more memory!
One possible usage is NLP tasks and ML where we have to store big chunks of parsed text. Or features.
one way for words is to create lexer/dict and have integer-list, but it is still a python list and i suppose the meta info overhead will be bigger percentage wise.
How can I decode the RSA encryption more efficiently?
For a project I'm decoding the RSA encryption. My code works perfectly, but the check I can do, says its too slow.
I've tested the algorithm and I've concluded that the bottleneck is in the following code:
message = (c**d) % n
Without this, the code runs instantaneously. c is the encrypted message, d is the Modular multiplicative inverse and n = pq. the encrypted message is 783103, so I get that I'm dealing with large numbers, but now it takes around 1 seconds to run. Is there any way to speed this up?
GoogleNet fails to classify images
I built Keras Google Net from here: https://www.analyticsvidhya.com/blog/2018/10/understanding-inception-network-from-scratch/ The only difference is that I replaced 1000 classes in output layer with 3 and data is prepared this way :
def grey_preprocessor (xarray): xarray=(xarray/127.5)-1 return xarray img_resol = (224,224) train_batches = ImageDataGenerator(horizontal_flip = True, preprocessing_function = grey_preprocessor).flow_from_directory( directory = train_path, target_size=img_resol, classes = ['bacterial', 'healthy', 'viral'], batch_size = 10) valid_batches = ImageDataGenerator(horizontal_flip = True, preprocessing_function = grey_preprocessor).flow_from_directory( directory = valid_path, target_size=img_resol, classes = ['bacterial', 'healthy', 'viral'], batch_size = 10) test_batches = ImageDataGenerator(horizontal_flip = True, preprocessing_function = grey_preprocessor).flow_from_directory( directory = test_path, target_size=img_resol, classes = ['bacterial', 'healthy', 'viral'], batch_size = 10, shuffle = False) assert train_batches.n == 4222 assert valid_batches.n == 300 assert test_batches.n == 150 assert train_batches.num_classes == valid_batches.num_classes == test_batches.num_classes == 3
However, all the accuracies on every batch are 0.3333, which means it doesn't classify at all. I understand that it can be anything. What is a good way to troubleshoot it?
What happens when the L1 regularizer penalty is 0?
I am playing around with deep learning. The final layer of my model is a dense layer and when I set the L1 regularizer to 0, it actually performs better than for any other value I have tested. I'm just wondering what is going on internally here as clearly it is not just dividing the penalty by zero and giving an error, as I would have expected.
MovieLens Dataset: Using the Timestamp for recommendations
I am currently working on developing a recommender system with the movielens dataset. The dataset provides a timestamp, but I do not know how to deal with it?
What can I do with the timestamps? How can I import it in the recommendation engine?
Why when I do 4 clusters clustering with K-means, I have only one intertia and not 4?
I have a dataframe and I did 4 clusters clustering using sklearn KMeans function:
km = KMeans(n_clusters=4, init='random', n_init=10, max_iter=10, tol=1e-4, random_state=10, algorithm='full', ) km.fit(df)
So , i have 4 clusters, but when i do this:
I get only one value:
However according to definition of inertia, it is a sum of squared distances of samples to their closest cluster center. So there must be 4 inertia values not 1 or am i wrong?
MSE in sklearn Tree
Can you tell me what the MSE describes in each node. I had assumed that this was the actual mse for the region with the samples, but this would say that we are getting worse after the splits. In the beginning MSE would be 7.592 the sum of the MSE of the terminal nodes would be much higher, which could be not possible, right. I suppose I am misunderstanding the MSE here, can someone be kind enough to enlighten me?
Python, Encoder [sklearn.preprocessing.LabelEncoder] - saving classes[str] and using in different module with smaller classes size
when using below code i saw a small problem:
#creating_encoding.py df = pd.read_csv('dataset.csv') ... - preprocess data = df.copy(deep=True) feat = data.drop(columns=['predict'], axis=1) label = data["predict"] X_train, X_test, y_train, y_test = train_test_split(feat, label, test_size=0.2) lbl = LabelEncoder() X_train['type'] = lbl.fit_transform(X_train['type'].astype(str)) X_test['type'] = lbl.fit_transform(X_test['type'].astype(str)) np.save('classes.npy', lbl.classes_)
then when i am trying to read that in different module with below code i got different results:
#testing_encoding.py encoder = LabelEncoder() encoder.classes_ = np.load('classes.npy', allow_pickle=True) predict['type'] = encoder.fit_transform(predict['type'].astype(str))
Code works as intended when file "dataset.csv" have 4 types: typeA, typeB, typeC, typeD in rows then it's encoded to 1, 2, 3, 4. Then when in testing_encoding.py on different file for example "dataset2.csv" i have typeD, typeA, typeC, typeB - it's changed correctly to 4, 1, 3, 2
Problem exist when in file dataset2.csv i got less than 4 classes to encode, for example typeD, typeA, typeB - it's encoded to 1,2,3 and it should be encoded to 4,1,2
If you could provide for me some answer how to make it work as intended it would be really appreciated. Thank you for all help and suggestions.
Oversampling in Weka with three-fold classification
I am trying to oversample my data in Weka, to avoid imbalanced classification. I have tried Resample and SMOT, but I seem to be stuck, as my class value is threefold (low, medium, high). When I apply these, I am unsure whether I am using the correct mechanisms.
The instance numbers are as follow. 1. low = 6700, 2. medium = 1200, 3. high = 100
I would like to achieve good data preprocessing in order to help build a more predictive model.
Any help is greatly appreciated! Thanks in advance.
String classification, how to encode character-by-character and train?
I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg treatment - an I l - contract n°4934283 received by partner.pdf - invoice_8843238_1_europe services_business 8592342sid paris.xls 140159498736656.txt 140159498736843.txt fsk_000000001090296_sdiselacrusefeyre_2000912.xls fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
- Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
- Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
- Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
MNIST Classifier - using prior to exclude class
I've got a ResNet MNIST classifier that works well on predicting digits 0-9. In my application I sometimes have prior information which tells me that a particular image is not a particular value. For example, I know that an image being classified is not the number "6".
Is there any known approach to use this prior information to restrict the classification so that the classifier output will give a 0% probability of the image being 6? I obviously know that I can just zero out the classifier output for label 6 and renormalize the other classes (i.e 0-9 excluding 6). But this is just a linear scaling on the final layer output.
I'd like to use this prior information at the input layer (?) or somewhere in the network to take advantage of all the nonlinear scaling of the network.
It's not clear that a simple CNN can do this, but I'm willing to play around with architectures or other approaches.
Any advice is much appreciated.