Scikit-learn Random Forest Classifier: combining numeric values with multi-labels
I have a training dataset where every feature contains two numeric values and six (out of a possible twelve) unique categorical values. What I want to do is train up a random forest classifier using the feature's two numeric values while assigning each feature six labels, in the aim that for my test values, I can figure out which of the labels most correlate with the numeric values.
Am I right in thinking that the 'forest.fit(feature[numeric data], feature[label data])' is the right approach? When I try and score my data, I get the following error:
ValueError: multiclass-multioutput is not supported
So I'm not putting my labels in correctly.
Score(X,y) - X is my two numeric values as floats, my y array is a pandas dataframe containing the labels
See also questions close to this topic
Comparing list values and storing new ones in a separate list
import csv with open("DADSA RESIT CWK JULY 2018.csv", newline='') as f: r = csv.reader(f) database = list(r) del database names =  names.append() def fillnames(d, n): for j in n: for i in d: if d[i] == n[j] and d[i] == n[j]: n[i] = n[i]+1 else: names.append([d[i], d[i], 0]) fillnames(database, names) for i in names: print(i)
The code I have here is me scanning in a csv file into a list. I then want to count how many entries share the same name, by scanning each new name into a separate list, then incrementing the number found every time I find a new one. Every time I run this code it returns "TypeError: list indices must be integers or slices, not list."
Changing font of a list python
Say I have some code that makes a list into a 4 by 4 1d array:
nlist = [2,2,4,8, 0,0,0,0, 0,0,0,0, 0,0,0,0] def drawBoard(): count = 0 for i in range(16): print(nlist[i], end = ' ') count += 1 if count == 4: print("") count = 0 print("") drawBoard()
How can I change all the fonts in this list into size 26. I tried doing font = 'times 26' but I don't know where to put it or if that command needs tkinter.
Get mouse coordinates without clicking in matplotlib
In a matplotlib plot, how can I continuously read the coordinates of the mouse when it is moved, but without waiting for clicks? This is possible in matlab, and there is a mpld3 plugin to do almost exactly what I want, but I can't see how to actually access the coordinates from it. There is also the package mpldatacursor, but this seems to require clicks. Searching for things like "matplotlib mouse coordinates without clicking" did not yield answers.
Answers using additional packages such as mpld3 are fine, but it seems like a pure matplotlib solution should be possible.
'tflite_convert' is not recognized as an internal or external command (in windows)
im trying to convert my saved_model.pb(from object detection API) file to .tflite for mlkit but when i execute the command on cmd:
tflite_convert \ --output_file=/saved_model/maonani.tflite \ --saved_model_dir=/saved_model/saved_model
i get a response saying
C:\Users\LENOVO-PC\tensorflow> tflite_convert \ --output_file=/saved_model/maonani.tflite \ --saved_model_dir=/saved_model/saved_model 'tflite_convert' is not recognized as an internal or external command, operable program or batch file.
what should i do to make this work?
image segmentation on retinal blood vessels in matlab
For classification, there are different supervised methods. Which method is better for retinal blood vessels segmentation from color images? CNN or Random Forest Classifier(RFC).
How to create bi-modal initial weight distribution in tensorflow?
How can I create a bi-modal initial weight distribution in tensorflow?
I want to create a distribution that is composed of 2 normal distributions centered at 0.15 and -0.15 each with stds of 0.1.
Increasing range by 1 gives totally wrong answer in sklearn
I am supposed to solve a regression problem in sklearn, where I add polynomial features in linear regression increasing the degree from 1 to 10 and finding the training and test set R2 score for every value. So, my code was:
restr = np.zeros(10) reste = np.zeros(10) for i in range(0,9): poly = PolynomialFeatures(degree = i+1) X_poly = poly.fit_transform(X_train.reshape(11,1)) X_poly = X_poly.reshape(11,i+2) X_test_poly = poly.fit_transform(X_test.reshape(4,1)) X_test_poly = X_test_poly.reshape(4,i+2) linreg = LinearRegression().fit(X_poly, y_train) restr[i] = linreg.score(X_poly,y_train) reste[i] = linreg.score(X_test_poly,y_test) return (restr, reste)
and the output is:
(array([ 0.42924578, 0.4510998 , 0.58719954, 0.91941945, 0.97578641, 0.99018233, 0.99352509, 0.99637545, 0.99803706, 0. ]), array([-0.45237104, -0.06856984, 0.00533105, 0.73004943, 0.87708301, 0.9214094 , 0.92021504, 0.63247944, -0.64525447, 0. ]))
I only got nine values, but that can be easily fixed by increasing the range by 1.
Only that it doesn't.
If I increase the range by one, here's what I get:
(array([ 0.42924578, 0.4510998 , 0.58719954, 0.91941945, 0.97578641, 0.99018233, 0.99352509, 0.99637545, 0.99803706, 1. ]), array([ -4.52371042e-01, -6.85698415e-02, 5.33105295e-03, 7.30049428e-01, 8.77083009e-01, 9.21409398e-01, 9.20215041e-01, 6.32479438e-01, -6.45254469e-01, -3.88585250e+01]))
It can be seen that the values in second array (reste) are basically 10 times what they should be. I don't understand why this is happenning. Also, interestingly, even if divide by 10 like so:
reste[i] = linreg.score(X_test_poly, y_test)/10
It will still output the same values.
Can someone please explain to me what is going wrong?
Choose the number of samples to average the gradient on in the SGDClassifier of Scikit-learn
I'm aware that the SGDClassifier in
Scikit-learnpicks one random sample from the training dataset each time to calculate the gradient, and updates the model weights (
My question is that among the parameters in the SGDClassifier, there doesn't seem to be an option to select the number of samples to pick each time (instead of just one instance) to average the gradient on? This would give us a
Mini-batch Gradient Descent.
I've already had a look at the
partial_fit()method which gets chunks of the training dataset each time to train on, but when using this in the
SGDClassifier, doesn't it just boil down to picking a random training instance from the chunk, instead of choosing it from the whole dataset?
LSTM multiple features, multiple classes, multiple outputs
I'm trying to use a LSTM classifier to generate music based on some midi's that I have.
The LSTM uses two features, the notes' pitch and the notes' duration.
For illustration, let's think we have:
Pitches: ["A", "B", "C"]
Durations: ["0.5", "1", "1.5"]
As you can imagine, a generated note has to have both pitch and duration.
I tried to do it with a MultiLabelBinarizer.
from sklearn.preprocessing import MultiLabelBinarizer labels = [[x,y] for x in all_pitches for y in all_durations] mlb = MultiLabelBinarizer() mlb_value = mlb.fit_transform(labels)
This divides the classes as intended, but the problem I'm having comes at the time of predictions.
prediction = model.predict_proba(prediction_input) indexes = np.argsort(prediction, axis=None)[::-1] index1 = indexes index2 = indexes result1 = mlb.classes_[index1] result2 = mlb.classes_[index2]
I need the notes to have both pitch and duration, so this approach seems to not work for me (I only get the same two pitches all over).
Another thing I thought was using a
MultiOutputClassifier, but I seem unable to understand the differences of them, or how to actually use this
Thanks for the patience, and sorry for the probably stupid question.
Cannot figure out format needed to make predictions on dataset trained with doc2vec and random forest classifier
I am trying to make predictions on a dataset based on some pre-defined data (tweets and categories that the tweets belong to, labeled 1-16) that I have built a model in with doc2vec and trained on random forest classifier. I am confused about what format I need to put my data into before I call
import csv from sklearn.ensemble import RandomForestClassifier import pandas as pd import itertools from gensim import utils from gensim.models import Doc2Vec import gensim import numpy as np #just making the object to put into gensim's doc2vec class LabeledLineSentence(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list self.doc_list = doc_list def __iter__(self): for t, l in itertools.izip(self.doc_list, self.labels_list): #change here t = nltk.word_tokenize(t) #end of change yield gensim.models.doc2vec.LabeledSentence(t, [l]) #predefined tweets = ["a tweet", "another tweet", ... , "a thousandth tweet"] labels = [1, 1, ... , 16] #what category the tweet belongs to training_data = LabeledLineSentence(tweets, labels_list) #build the doc2vec model model = Doc2Vec(vector_size=100, min_count=1, dm=1) model.build_vocab(training_data) model.train(training_data, total_examples=model.corpus_count, epochs=20) #put tweets into classifier train_tweets =  for i in range(len(tweets)): label = labels_list[i] train_tweets.append(model[label]) #have to convert to numpy array because that is what clf takes train_tweets = np.array(train_tweets) train_labels = np.array(labels_list) #fit classifier clf = RandomForestClassifier().fit(train_tweets, train_labels) #this is the data i am trying to classify into labels test_data = ["an unseen tweet", "another unseen tweet", ... , "a thousandth unseen tweet"] #*******change here*************** for t in test_data: split = nltk.word_tokenize(t) vect = model.infer_vector(split) vect = vect.reshape(1, -1) print clf.predict(vect)
Towards the end of this code block is where I am getting confused. I am pretty sure I built the doc2vec model and trained the classifier right, but I am unsure of what I need to do to each tweet in the testing data before I call
clf.predicton it. I have tried tokenizing the string and using a count vectorizer, but I keep getting errors about how it cannot convert those to a float. Is there some other way I am supposed to process the test data before putting it in for prediction?
I use FeatureHasher from sklearn for DecisionTreeRegressor. Now how do i decode the predicted results from regressor?
dataset = pd.read_csv('ll.csv',index_col=0) dataset = dataset.dropna(axis=0) # features or independent variables x = pd.DataFrame() x['Skills'] = dataset['Skills'] x['Location'] = dataset['Location'] x['Industry'] = dataset['Industry'] x['Experience'] = dataset['Experience'] # applying hashing x_hash = copy.copy(x) for i in range(x_hash.shape): x_hash.iloc[:,i] = x_hash.iloc[:,i].astype('str') x_hash = h.transform(x_hash.values) #Dependent Variable y=pd.DataFrame() y['Functional Area'] = dataset['Functional Area'] y_hash = copy.copy(y) for i in range(y_hash.shape): y_hash.iloc[:,i] = y_hash.iloc[:,i].astype('str') y_hash = h.transform(y_hash.values) # Regressor regressor = DecisionTreeRegressor(random_state=0) ll = regressor.fit(x_hash.toarray(),y_hash.toarray()) # For predicting input features input_df = pd.DataFrame() input_df['Skills'] = ['Illustrator'] input_df['Experience'] = ['1-6'] input_df['Industry'] = ['IT - Software Services'] input_df['Location'] = ['Cairo-Egypt'] input_df_hash = copy.copy(input_df) for i in range(input_df_hash.shape): input_df_hash.iloc[:,i] = input_df_hash.iloc[:,i].astype('str') input_df_hash = h.transform(input_df_hash.values) sss=regressor.predict(input_df_hash.toarray())
Using gensim doc2vec model with sklearn random forest
I am trying to use a doc2vec model with a random forest classifier to make predictions about lines of text. So far, I have gotten the corpus into a labeled sentence object and created the model. I am confused about how to put this into sklearn's random forest classifier. I am specifically trying to use doc2vec and random forest. Here is what I have tried so far.
import gensim from gensim import utils from gensim models import Doc2Vec import itertools from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier tweets = ["a tweet", "some other tweet", ... , "tweets"] labels_list = [1, 1, ... , 16] #predefined class LabeledLineSentence(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list self.doc_list = doc_list def __iter__(self): for t, l in itertools.izip(self.doc_list, self.labels_list): yield gensim.models.doc2vec.LabeledSentence(t, [l]) training_data = LabeledLineSentence(tweets, labels_list) model = Doc2Vec(vector_size = 300, alpha = 0.025, min_alpha = 0.00025, min_count=0, dm=1) model.build_vocab(training_data) model.train(training_data, total_examples=model.corpus_count, epochs=100) model.alpha -= 0.0002 model.min_alpha = model.alpha #this is where I am trying to put it into random forest X_train, X_test, y_train, y_test = train_test_split(tweets, labels_list, random_state = 0) clf = RandomForestClassifier().fit(X_train, y_train) #i know this is very wrong
I have been googling around and cannot find a good example for this. I am pretty much completely lost on how to go about putting doc2vec into random forest classifier. Thanks in advance for any help or pointers in the right direction!