optimization of season prediction of categorical purchase data
I am writing an algorithm with the aim of predicting the season of purchase (winter, spring, summer and fall) using the following example data:
df.head(4)
shop category subcategory season
date
20130904 abc weddings shoes winter
20130904 def jewelry watches summer
20130905 ghi sports sneakers spring
20130905 jkl jewelry necklaces fall
The predictor variables are shop
, category
and subcategory
, and the target variable is season
.
I have two questions: 1) best practices for preprocessing and 2) best classification models for this type of problem
1) preprocessing  below is my code, however I'm unsure if I need one hot encoding to be able to properly handle categorical variables:
le = LabelEncoder()
ss = StandardScaler()
X = pd.get_dummies(store_df.iloc[:, :1], drop_first=True).values.astype('float')
y = le.fit_transform(store_df.iloc[:, 1].values).astype('float')
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)
xtrain = ss.fit_transform(xtrain)
xtest = ss.transform(xtest)
The shape of the data looks correct as follows:
Training set: (67915, 1040), (67915,)
Testing set: (29107, 1040), (29107,)
Would preprocessing benefit from one hot encoding? What are best practices here?
2) model selection  so far I have tried a couple of classifiers, both of which score around 66% (not ideal):
logistic regression:
lr = LogisticRegression(C=100000)
lr.fit(xtrain, ytrain)
lr_pred = lr.predict(xtest)
lr_acc = accuracy_score(ytest, lr_pred)
random forest classifier:
rfc = RandomForestClassifier(n_estimators=100, max_features=3)
rfc.fit(xtrain, ytrain)
rfc_pred = rfc.predict(xtest)
rfc_acc = accuracy_score(ytest, rfc_pred)
I would imagine a few classification methods would work given that my preprocessing is done efficiently. Any pointers are welcome.
See also questions close to this topic

Half of tkinter button border white?
For some odd reason my tkinter button's border is half white half black, is this normal/fixable? I want the whole border to be black
btn_Next1 = tk.Button(self, text="Next", command=lambda: controller.show_frame(PageOne)) btn_Next1.configure(font=buttonfont, fg='#ffffff', background='#00497a', highlightbackground='#3E4149', borderwidth=2)

Error trying to convert from saved model to tflite format
While trying to convert a saved model to tflite file I get the following error:
F tensorflow/contrib/lite/toco/tflite/export.cc:363] Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If you have a custom implementation for them you can disable this error with allow_custom_ops, or by setting allow_custom_ops=True when calling tf.contrib.lite.toco_convert(). Here is a list of operators for which you will need custom implementations: AsString, ParseExample.\nAborted (core dumped)\n' None
I am using the DNN premade Estimator.
from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np import tensorflow as tf IRIS_TRAINING = "iris_training.csv" IRIS_TEST = "iris_test.csv" INPUT_TENSOR_NAME = 'inputs' def main(): training_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename=IRIS_TRAINING, target_dtype=np.int, features_dtype=np.float32) feature_columns = [tf.feature_column.numeric_column(INPUT_TENSOR_NAME, shape=[4])] # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], n_classes=3, model_dir="/tmp/iris_model") # Define the training inputs train_input_fn = tf.estimator.inputs.numpy_input_fn( x={INPUT_TENSOR_NAME: np.array(training_set.data)}, y=np.array(training_set.target), num_epochs=None, shuffle=True) # Train model. classifier.train(input_fn=train_input_fn, steps=2000) inputs = {'x': tf.placeholder(tf.float32, [4])} tf.estimator.export.ServingInputReceiver(inputs, inputs) saved_model=classifier.export_savedmodel(export_dir_base="/tmp/iris_model", serving_input_receiver_fn=serving_input_receiver_fn) print(saved_model) converter = tf.contrib.lite.TocoConverter.from_saved_model(saved_model) tflite_model = converter.convert() def serving_input_receiver_fn(): feature_spec = {INPUT_TENSOR_NAME: tf.FixedLenFeature(dtype=tf.float32, shape=[4])} return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)() if __name__ == "__main__": main()
Iris files can be downloaded form the following links:
IRIS_TRAINING FILE: "http://download.tensorflow.org/data/iris_training.csv"
IRIS_TEST FILE: "http://download.tensorflow.org/data/iris_test.csv"

Replacing a variable name in text with the value of that variable
Things like this have been answered, but not this in particular. I have a template that uses placeholders for the varying content that will be filled in. Suppose the template has:
"This article was written by AUTHOR, who is solely responsible for its content."
The author's name is stored in the variable: author
So I of course do:
wholeThing = wholeThing.replace('AUTHOR', author)
The problem is I have 10 of these selfnamed variables, and it would just be more economical if I could something like this, using only 4 for brevity:
def(selfreplace): ... return wholeThing = wholeThing.selfreplace('AUTHOR', 'ADDR', 'PUBDATE', 'MF_LINK')

glmnet multinomial logistic regression prediction result
I'm building a penalized multinomial logistic regression, but I'm having trouble coming up with a easy way to get the prediction accuracy. Here's my code:
fit.ridge.cv < cv.glmnet(train[,1], train[,1], type.measure="mse", alpha=0, family="multinomial") fit.ridge.best < glmnet(train[,1], train[,1], family = "multinomial", alpha = 0, lambda = fit.ridge.cv$lambda.min) fit.ridge.pred < predict(fit.ridge.best, test[,1], type = "response")
The first column of my test data is the response variable and it has 4 categories. And if I look at the result(fit.ridge.pred) it looks like this:
1,2,3,4 0.8743061353, 0.0122328811, 0.004798154, 0.1086628297
From what I understand these are the class probabilities. I want to know if there's a easy way to compute the model accuracy on the test data. Now I'm taking the max for each row and comparing with the original label. Thanks

NonLinear Delay Model for logic gates using Machine learning
I have to develop a non linear model for cell libraries (inverters, nand, etc) which will calculate the delay. This happens in simulator SPICE, but i have to do this using machine learning. I am new to machine learning so that's why i wanted to know how i can proceed with this.
I have ".lib" file which contains something like
cell_rise (delay_template) { index_1 ("0.001169, 0.005416, 0.01391, 0.03037, 0.06381, 0.1307, 0.2645, 0.532"); index_2 ("7.906e05, 0.007225, 0.02152, 0.04921, 0.1055, 0.218, 0.4431, 0.8933"); values ( \ "0.001217, 0.003174, 0.006568, 0.01305, 0.0262, 0.05249, 0.1051, 0.2101", \ "0.001635, 0.003949, 0.00758, 0.01413, 0.02727, 0.05354, 0.1062, 0.2112", \ "0.00197, 0.004875, 0.009089, 0.01613, 0.02952, 0.05581, 0.1085, 0.2133", \ "0.002252, 0.005958, 0.011, 0.01895, 0.03323, 0.06004, 0.1127, 0.2179", \ "0.002358, 0.007173, 0.01348, 0.02288, 0.03891, 0.06749, 0.1211, 0.2264", \ "0.002113, 0.008425, 0.0165, 0.02816, 0.04704, 0.07897, 0.1361, 0.2436", \ "0.0008438, 0.009034, 0.01957, 0.03442, 0.05773, 0.09518, 0.159, 0.2734", \ "0.002203, 0.008119, 0.02182, 0.04102, 0.07071, 0.1167, 0.1914, 0.3188" \ );
Now in my machine learning model i have inputs
index_1
,index_2
,(5,5) matrix
andgate_name
. I have to predict the whole8*8 matrix(values)
and whichgate_name
it belongs to.I know this is a regression and classfication problem as well, but how to go forward with it? Kindly share your instincts with me.

Get multiclass confusion matrix equal to number of class labels
I trained Random Forest Classifier in
sklearn
to predict multiclass classification problem.My dataset has four class labels. But my code create 2x2 confusion matrix
y_predict = rf.predict(X_test) conf_mat = sklearn.metrics.confusion_matrix(y_test, y_predict) print(conf_mat)
Output:
[[0, 0] [394, 39]]
How can I get 4x4 confusion matrix to analyze TP, TN, FP, FN.

RandomForest regression pValue
Dear smart people of the internet,
I'm currently working on a data set (a regression problem) and compare OLS vs RandomForest explanation power. Working with pValues of those regressions would be nice  due to the easy comparability for everyone not familiar with randomForest. But as far as I know, there is no pValue in a randomForest regression to be "easily" or at least credibly calculated.
I stumbled upon the package "rfUtilities" which provides via is "rf.significance" function a pValue for randomForest regressions. Could please someone smarter and/or more versed than me explain to me what this function does and if it makes any sense?
(For this question being a general one, I did not provide any sample data  because all available data set are affected)
Thank you in advance!

How to select variables from this heatmap?
Here This is My problem. In this heatmap i did eliminate some variables.
This is after elimination of some variables
My Question Is: Is There Any correlated variables there in 2nd image? Is My process of Elimination of varibles is right? do I Still need to eliminate variables from the second image?
Help Me Out This........

Custom scoring function RandomForestRegressor
Using
RandomSearchCV
, I managed to find aRandomForestRegressor
with the best hyperparameters. But, to this, I used a custom score function matching my specific needs.Now, I don't know how to use
best_estimator_  a RandomForestRegressor  returned by the search
with my custom scoring function.
Is there a way to pass a custom scoring function to a
RandomForestRegressor
? 
Logistic Regression error in 2 classes Normal Distributions
I generate two Gaussian Distributions:
f=100 mean1 = [5,3] cov1 = [[5,0], [0, 3]] mean2 = [4, 3] cov2 = [[3, 0], [0, 2]] g1=np.random.multivariate_normal(mean1, cov1, f) g2=np.random.multivariate_normal(mean2, cov2, f)
And I am trying to do logistic regression with g1 as class 0 and g2 as class 1:
ft=np.vstack((g1,g2)) #data stacked on each other class = np.hstack((np.zeros(f), np.ones(f))) #class values in column matrix clc=np.reshape(cl,(2*f,1)) #class values in an array w=np.zeros((2,1)) #weights matrix for n in range(0,5000): s = np.dot(ft, w) prediction = (1 / (1 + np.exp(s))) #sigmoid function gr = (np.dot(ft.T, class  prediction)) #gradient of loss function w += 0.01 * gr print (w)
I evaluate my result using sklearn logistic regression:
from sklearn.linear_model import LogisticRegression lr = LogisticRegression(fit_intercept=False) lr.fit(ft, cl) print(lr.coef_)
And I get :
w=[[6.77812323] [2.91052504]]
lr.coef_=[[1.22724506 1.10456893]
Do you know why the weights do not match? Is there anything wrong with my math?

Selecting a specific ''level'' of my categorical variable when running the logistical regression
I would like to select one one level pertaining to a categorical varialbe to include in my logistical regression as it is the only one that has a sig p value?
A) is that possible?
B) if so , how do i code it in R so that I only select that level and not include the rest.
The feature has 3 levels. and i wnat to select only the 2nd.
Thanks

Predict() won't give standard errors for mixedeffects logistic regression
I would like to calculate the predicted probability for a logistic regression. The dependent variable is a dummy variable. The
predict()
function is not allowing me to get the standard errors, which I need to calculate the confidence interval for the predicted value.library(lme4) # Create function cham < glmer(mentionedpresorpreslastname ~ poly(presvoteshare,3)+ preselyear + all_approvalS + samepartyapprovals + samepartyaspotus + membervoteshareS + republican + minorityparty + congservedattimeofpr + leadership + bachdegree + (1memberidnum), data=bothpres, model=binomial("logit")) #Create new data for prediction pp1 < with(bothpres, data.frame(samepartyaspotus = 1, presvoteshare = mean(presvoteshare), preselyear = 0, all_approvalS = mean(all_approvalS), samepartyapprovals=mean(samepartyapprovals), membervoteshareS = mean(membervoteshareS), republican = 0, minorityparty = 0, congservedattimeofpr=mean(congservedattimeofpr), leadership = 0, bachdegree=mean(bachdegree))) predictions1 < as.data.frame(predict(cham, pp1, se.fit="TRUE", type="link", re.form=~0))
Predictions1 returns a single value and
Warning message:
In predict.merMod(cham, pp1, se.fit = "TRUE", type = "link", re.form = ~0) : unused arguments ignored 
Data Standardization vs Normalization vs Robust Scaler
I am working on data preprocessing and wanted to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.
In theory, the guidelines are:
Advantages:
 Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
 Normalization: shrinks the range such that the range is now between 0 and 1 (or 1 to 1 if there are negative values).
 Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.
Disadvantages:
 Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
 Normalization: get influenced heavily by outliers (i.e. extreme values).
 Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.
I created 20 random numerical inputs and tried the abovementioned methods (numbers in red color represent the outliers):
I noticed that indeed the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical 6 digits after the decimal point
0.000000x
) even there is noticeable differences between the original inputs!
My questions are:
 Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, Why according to the result provided?
 I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?
P.S
I am imagining a scenario that I want to prepare my dataset for a Neural Network and I am concerned about the Gradient Vanishing! Nevertheless, my questions are still in general.

How to run multi optimizer at the same time?
I'm trying to train an ensemble neural network with tensorflow. For example, I have 5 classifiers and each of them is a neural network. I get different samples to train them separately. Thus, I can train them at the same time use a distributed method instead of training them one by one.
The problem is that: I deploy those 5 networks on 5 different devices(5 GPU for example). For each network, I create one Adamoptimizer (which means I have 5 Adamoptimizer totally).
When I'm training this network, first get samples for all networks(if one network need 64 then I get 320 samples).
Method 1 Next, I just run:
sess.run(tf.group(opt1, opt2, opt3, opt4, opt5))
in order to run those 5 optimizers at the same time.Method 2 Another way is to run opts in a for loop:
opt=[opt1, opt2, opt3, opt4, opt5]
for i in range(5):
sess.run(opt[i])
However, I try to record the time cost in both ways and found they are almost the same.
So my question is that, if I only use one session and run
sess.run(tf.group(opt1, opt2, opt3, opt4, opt5))
,will the 5 opts are actually executed one by one instead of at the same time?
And if I want to run them at the same time, what should I do?
By the way, if I just use one session, how fast could I get? Could the time cost in method 1 be 1/5 of method 2? If not, what should I do to achieve that? Should I use multisessions, multiprocesses, or multithreads?

When should I do feature scaling or normalisation in machine learning?
I have a training feature set consisting of 92 features. Out of which 91 features are boolean values of 1 or 0. But 1 feature is numerical and it varies from 32000.
Will it be better if I do feature scaling on my 92nd feature?
If yes, what are the best possible ways to do it? I am using Python.