Classification algorithm to predict next purchase?
The problem I am trying to solve has example training data and test data similar to the sample data frame below. Let’s assume that the training data has 10,000 rows and test data has 2000 rows. Now every user comes to the website and performs one of the 3 activities on a particular day:
UserID Date Activity
123 20130711 Website_Login
123 20130711 Form_Submit
123 20130715 Website_Login
124 20130717 Website_Login
125 20130718 Purchase
126 20130725 Website_Login
126 20130726 Form_Submit
126 20130801 Website_Login
126 20130805 Purchase
126 20130812 Website_Login
Goal : Which activities are most useful in predicting user purchase in the future? Predict next 25 UserID’s who are likely to purchase?
My Understanding 1: From my knowledge this is a classification problem that cant be solved using algorithms like Random Forests and Logistic regression because the activity types do not further have additional activities and the sort. The solution according to me could be derived using a more standard technique like Naive Bayes or Nearest Neighbor algorithms.
My Understanding 2: Further is this a type of recommendation problem, if so I do not think Content based filtering would work as there are very less features, it should be better to use collaborative filtering as I would utilize the Date field.
My Understanding 3: Will One hot encoding of the Activity field help me (Make 3 separate features for activity) and converting the Date field into (Day, Month, Year) help?
I would like to know your thoughts and opinions of what kind of classification algorithm might give me the best prediction of a User that might purchase next.
Sub_thought: It feels like the number of features are too less and it’s confusing me.
See also questions close to this topic

How can I compute the tensor in Pytorch efficiently?
I have a tensor
x
andx.shape=(batch_size,10)
, now I want to takex[i][0] = x[i][0]*x[i][1]*...*x[i][9] for i in range(batch_size)
Here is my code:
for i in range(batch_size): for k in range(1, 10): x[i][0] = x[i][0] * x[i][k]
But when I implement this in
forward()
and callloss.backward()
, the speed of backpropagation is very slow. Why is it slow and is there any way to implement it efficiently? 
How do I get all Gini indices in my decision tree?
I have made a decision tree using sklearn, here, under the SciKit learn DL package, viz.
sklearn.tree.DecisionTreeClassifier().fit(x,y)
.How do I get the gini indices for all possible nodes at each step?
graphviz
only gives me the gini index of the node with the lowest gini index, ie the node used for split.For example, the image below (from
graphviz
) tells me the gini score of the Pclass_lowVMid right index which is 0.408, but not the gini index of the Pclass_lower or Sex_male at that step. I just know the Gini index of Pclass_lower and Sex_male must be greater than (0.408*0.7 + 0) but that's it. 
Fighting against overfitting in an RNN model
We are currently trying to use an RNN model to build a classifier using text features. Our final accuracy on the training data is 87% but our accuracy on validation data flat out at 57% which is clearly overfitting. We think that the reason for overfitting is because of small data size since we only have about 4000 entries. What can we do to fix to the problem, we have also thought about doing data augmentation but all we can find is replacing words with synonyms which it wouldn't work in our case. Here's our code for the model and thank you in advance.
model = Sequential() model.add(Embedding(num_vocab+1,32)) model.add(SimpleRNN(64)) model.add(Dense(num_classes, activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc']) history = model.fit(f_train, cause_train, epochs=10, batch_size=50, validation_split=0.2)

how to use StratifiedKFold?
I have problem in using StratifiedKFold. I want to do cross validation. X and Y are numpy.ndarray, when I run the code below I get the following error. I know that what I get as train_index and test_index are the indexes of training and testing splits but how can I extract for instance the data with index 0 in X in order to make the training and testing sets out of the indexes skf.split reveal ?
skf = StratifiedKFold(n_splits=3) for train_index, test_index in skf.split(X, y): print("%s %s" % (train_index, test_index)) n+=1 print(n, "n") X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] print(X_train,y_train,X_test,y_test, "X_train,y_train,X_test,y_test")
error:
TypeError: only integer scalar arrays can be converted to a scalar index
The printed details of X are shown below:
print(X, "X is") print(type(X), "tyep X") #<class 'numpy.ndarray'> tyep X print(type(x), "type111") #<class 'numpy.ndarray'> type111 print(type(y), "type122") #<class 'list'> type122 print(np.prod(X.shape), "array dimensions") #24092640 array dimensions print('Saved dataset to dataset.npz.') print('X_shape:{}\nY_shape:{}'.format(X.shape, Y.shape)) #X_shape:(30, 156, 156, 11, 3) Y_shape:(30, 3)

Making Random Forest outputs like Logistic Regression
I am asking dimensional wise etc. I am trying to implement this amazing work with random forest https://www.kaggle.com/allunia/howtoattackamachinelearningmodel/notebook
Both logistic regression and random forest are from sklearn but when I get weights from random forest model its (784,) while the logistic regression returns (10,784)
My most problems are mainly dimension and NaN, infinity or a value too large for dtype errors with attack methods. The weights using logical regression is (10,784) but with Random Forest its (784,) may be this caused the problem? Or can you suggest some modifications to attack methods? I tried Imputer for NaN values error but it wanted me to reshape so I've got this. I tried applying np.mat for the dimension errors I'm getting but they didnt work.
def non_targeted_gradient(target, output, w): target = target.reshape(1, 1) output = output.reshape(1, 1) w = w.reshape(1,1) target = imp.fit_transform(target) output = imp.fit_transform(output) w = imp.fit_transform(w) ww = calc_output_weighted_weights(output, w) for k in range(len(target)): if k == 0: gradient = np.mat((1target[k])) * np.mat((w[k]ww)) else: gradient += np.mat((1target[k])) * np.mat((w[k]ww)) return gradient
I'm probably doing lots of things wrong but the TL;DR is I'm trying to apply Random Forest instead of Logistic regression at the link above.

The well known turbofan engine degradation dataset from NASA (computing time to failure)
Links to this data set:
https://c3.nasa.gov/dashlink/resources/139/ https://data.nasa.gov/widgets/vrksgjie
I'm trying to figure out if I need to compute the time to failure for each row in this data set as I don't see it provided, is this the case? If it does need to be computed for the training data set is there a simple way to go about this?
Details: (I assume from what i read that each row represents an engine at a certain time in the time series with some sensor readings). It looks like the testing data has a separate dataset with this number computed, but am I supposed to compute it for the training set too?
I ultimately want to do regression analysis on the time to failure, so need this for each of the training sets.
Any interpretation on what the data means and intuition would be helpful as well but I'm primary wondering how or if I need to compute this time to failure for the training data.

extract data from multiple urls stored in a column of dataframe
I want to extract data from multiple URLs, but the URLs are in a column of a data frame.
I tried data extraction with the code below but no luck.
from urllib.request import urlopen,Request link = data.column1 f = urlopen(link) myfile = f.read() print(myfile)
It shows:
AttributeError: 'Series' object has no attribute 'type'.
Please help with the code. Thank you

python detect multiclass imbalanced or balanced dataset
With counting values for each class, via pandas, i can know the distribution and the count for each class. However, since i want to do a research paper, I want to be precise on detecting being balanced or imbalanced for a given data set.
How can i achieve this in python? Is there a specific formula at all? Or we can tell just by counting (the way i do now)?
P.S.: I know that i can look data sets from papers in this field, however, i have found data sets from Kaggle or UCI, that are not so popular and i don't want to just let them go.
Thanks

How to approach Customer Store Recommendation Problem
3 files: User profile, store profile, transaction histroy.
I have user profiles of 100k customers (age, gender, location, salary, loyalty points, etc.) and store profiles of 35 shops (location, revenue, sale per day, etc.). Also, a transaction set (cust_id, store_id, location, revenue, item_cout, etc.) that has all the purchases made by customers at any store (550k transactions).
Some new stores are opened, store profile is given (similar to other 35 shops), no transactions. The objective is to determine whether an existing customer will shop in each of the new stores.
I am trying to do this with recommendation system with item cold start. I am also looking into implicit feedback recommendation system, however, I m very new to this and cannot figure out how to use all these features and data together.
I need suggestion on how to approach this problem, or where to get started.
p.s. Excuse my naivety, I am also new to StackOverflow.

How neural networks are used in collaborative filtering
I am just a begineer to neural network. Can some one suggest how neural networks are used in collaborative filtering, i mean by using userid and itemid how can neural network, put weights to the id parameters of input.
Let say there are userid and itemid of usage.
1 12, 1 13, 1 17, 2 12, 1 44, 3 4, 21 32, 1 16
How can neural network be used for collaborative filtering in this case.
How can you autoencode itemid/userid

Tasttebreaker playlist Generation
Does anyone have any clue what kind of Model, or Algorithm was used to create the Tastebreakers playlist?? Was it the layer6 spotifyrecsys challenge submission?

Does any H2O algorithm support multilabel classification?
Is deep learning model supports multilabel classification problem or any other algorithms in H2O?
Orginal Response Variable Tags: apps, email, mail finance,freelancers,contractors,zen99 genomes gogovan brazil,china,cloudflare hauling,service,moving ferguson,crowdfunding,beacon cms,naytev y,combinator in,store, conversion,logic,ad,attribution
After mapping them on the keys of the dictionary: Then
Response variable look like this:
[74] [156, 89] [153, 13, 133, 40] [150] [474, 277, 113] [181, 117] [15, 87, 8, 11]
Thanks

how many classes h2o deep learning algorithm accepts?
I want to predict the response variable, and it has 700 classes.
Deep learning model parameters
from h2o.estimators import deeplearning dl_model = deeplearning.H2ODeepLearningEstimator( hidden=[200,200], epochs = 10, missing_values_handling='MeanImputation', max_categorical_features=4, distribution='multinomial' ) # Train the model dl_model.train(x = Content_vecs.names, y='tags', training_frame = data_split[0], validation_frame = data_split[1] ) Response variable tags: [74] [156, 89] [153, 13, 133, 40] [150] [474, 277, 113] [181, 117] [15, 87, 8, 11]
Error:
OSError: Job with key $03017f00000132d4ffffffff$_8355bcac0e9e98a86257f45c180e4898 failed with an exception: java.lang.UnsupportedOperationException: error cannot be computed: too many classes
stacktrace: java.lang.UnsupportedOperationException: error cannot be computed: too many classes at hex.ConfusionMatrix.err(ConfusionMatrix.java:92)
But in h2ocore/src/main/java/hex/ConfusionMatrix.javaConfusionMatrix.java is written that it can compute 1000 classes.

accuracy for multi label text classification
How to find the accuracy, f1 score, precision and recall for this program. I want to calculate confusion matrix for this program and also I'm having trouble to find these using this functions :
metrics.accuracy_score(y_test, predicted) print(classification_report(y_test,predictions)))
Your help help will be highy appreciated because I'm new to text classification and I couldnt find any of these in multi label classfication.
import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"], ["new york"],["london"],["berlin"],["london"],["london"], ["london"],["london"],["new york","london","berlin"],["new york","london"]] print y_train_text[13:] X_test = np.array(['it is raining in britian and nyc']) target_names = ['New York', 'London'] mlb = MultiLabelBinarizer() Y = mlb.fit_transform(y_train_text) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) all_labels = mlb.inverse_transform(predicted) for item, labels in zip(X_test, all_labels): print('{0} => {1}'.format(item, ', '.join(labels)))