train/test split with repeated measures
I want to try a random forest on this data where y = happy after x = ate. Some of these people were lucky and got two free meals, while some only got one. Could I use rsample to make sure that the same id (in this case 5) does not appear in both the train and test split? If not, how should I do it?
library(tibble)
library(rsample)
set.seed(123)
dframe < tibble(id = c(1,1,2,2,3,4,5,5,6,7),
ate = sample(c("cookie", "slug"), size = 10, replace = TRUE),
happy = sample(c("yes", "no"), size = 10, replace = TRUE))
dframe_split < initial_split(dframe, strata = "happy")
dframe_train < training(dframe_split)
dframe_test < testing(dframe_split)
Created on 20181011 by the reprex package (v0.2.0).
1 answer

As of
rsample 0.0.2
, the only documented way of performing a split like this using this library seems to be thegroup_vfold_cv
function, example:resamples < group_vfold_cv(dframe, group='id', v=3) lapply(resamples$splits, training) lapply(resamples$splits, testing)
See also questions close to this topic

Warnings when restoring graphical parameters
I am writing my first R package and currently working on a function to make a plot using some particular graphical parameters. I want the user defined graphical parameters to get restored after the plot is made but always get same warning messages:
opar < par() par(oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) par(opar)
Warning messages:
1: In par(opar) : graphical parameter "cin" cannot be set
2: In par(opar) : graphical parameter "cra" cannot be set
3: In par(opar) : graphical parameter "csi" cannot be set
4: In par(opar) : graphical parameter "cxy" cannot be set
5: In par(opar) : graphical parameter "din" cannot be set
6: In par(opar) : graphical parameter "page" cannot be setIs there a better way of doing that? I know the
suppressWarnings()
function but 1. I don't want the messages to get hided and 2. if the function is called two times, a warning message appears:> There were 12 warnings (use warnings() to see them)

Fail to authenticate BigQuery with R under the bigrquery package
I am trying to use
set_service_token
in thebigrquery
package for a noninteractive authentication.Here is my code:
library(bigrquery) set_service_token("client_secret.json")
But it kept showing the error message below:
Error in read_input(file) :
file must be connection, raw vector or file pathHowever, when I simply to read the JSON path, it works:
lapply(fromJSON("client_secret.json"), names)
$`installed`
[1] "client_id" "project_id" "auth_uri" "token_uri" "auth_provider_x509_cert_url" "client_secret" "redirect_uris"Can anyone help me with this? Thank you very much!

Plotting mutiple graph in one graph in R with using function
I am trying to plot a graph in one plot only. I have 4 different plots coming by using a function. This is my code:
hazard.plot.w2p(beta = beta.spreda, eta = eta.spreda, time = exa1.dat$time, line.colour = "blue") hazard.plot.w2p(beta = 1.076429, eta = 26.21113, time = exa1.dat$time, line.colour = "blue") hazard.plot.w2p(beta = 5, eta = 32.97954, time = exa1.dat$time, line.colour = "blue") hazard.plot.w2p(beta = 2, eta = 32.9795, time = exa1.dat$time, line.colour = "blue")
Here is a function i used to get output:
hazard.plot.w2p < function(beta, eta, time, line.colour, nincr = 500) { max.time < max(time, na.rm = F) t < seq(0, max.time, length.out = nincr) r < numeric(length(t)) for (i in 1:length(t)) { r[i] < failure.rate.w2p(beta, eta, t[i]) } plot(t, r, type = 'l', bty = 'l', col = line.colour, lwd = 2, main = "", xlab = "Time", ylab = "Failure rate", las = 1, adj = 0.5, cex.axis = 0.85, cex.lab = 1.2) }
I want to plot all the 4 plots in one plot only.
Here is a sample data set:
fail time a 4.55 a 4.65 a 5.21 b 3.21 a 1.21 a 5.65 a 7.12

generating phone numbers using specific set of rules in python
I want to write a function which generates all possible numbers from a standard phone keypad (figure 1), using following set of rules:
 phone numbers begin with the digit 2
 phone number are 10 digit long
 successive digits in each phone number are chosen as a knight moves in chess
In chess, a knight (sometimes called a horse) moves two steps vertically and one step horizontally OR two steps horizontally and one step vertically.
Only numerical digits can be used in phone numbers  i.e. the (#) and (*) keys are not allowed.
The function has to take length of the phone number and initial position as input and for the output gives the number of unique phone numbers.
I am a newbie and facing difficulty to build the logic. I tried to do it as follow which is definitely not a right approach.
def genNumbers(len, initpos): numb = list('2xxxxxxxxx') #index 1 numb[1] = 7 or 9 if numb[1] == 7: numb[2] == 2 or 6 elif numb[1] == 9: numb[2] == 2 or 4 #index 2 if numb[2]== 2: numb[3] == 7 or 9 elif numb[2]== 4: numb[3] == 3 or 9 elif numb[2]== 6: numb[3] == 1 or 7 #index 3 if numb[3]== 1: numb[4] == 6 or 8 elif numb[3]== 3: numb[4] == 4 or 8 elif numb[3]== 7: numb[4] == 2 or 6 elif numb[3]== 9: numb[4] == 2 or 4 #index 4 if numb[4] == 8: numb[5]== 1 or 3 elif numb[4] == 2: numb[5]== 7 or 9 elif numb[4] == 4: numb[5]== 3 or 9 elif numb[4] == 6: numb[5]== 1 or 7 #index 5 if numb[5] == 1: numb[6]== 6 or 8 elif numb[5] == 3: numb[6]== 4 or 8 elif numb[5] == 7: numb[6]== 2 or 6 elif numb[5] == 9: numb[6]== 2 or 4 #index 6 if numb[6] == 2: numb[7]== 7 or 9 elif numb[6] == 4: numb[7]== 3 or 9 elif numb[6] == 6: numb[7]== 1 or 7 elif numb[6] == 8: numb[7]== 1 or 3 #index 7 if numb[7] == 1: numb[8]== 6 or 8 elif numb[7] == 3: numb[8]== 4 or 8 elif numb[7] == 7: numb[8]== 2 or 6 elif numb[7] == 9: numb[8]== 2 or 4 #index 8 if numb[8] == 6: numb[9]== 1 or 7 elif numb[8] == 8: numb[9]== 1 or 3 elif numb[8] == 4: numb[9]== 3 or 9 elif numb[8] == 2: numb[9]== 7 or 9 return numb
Any help would be highly appreciated!

How to Generate Random Covariance Matrix from Wishart Distrubtion
I need to generate an n x n, positivedefinite covariance matrix for a project. Drawing from the Wishart distribution was recommended. How do I generate a random covariance matrix in R, ideally also using the Wishart Distribution. I've tried rwishart() to get values, but need more help. Thanks

how to generate a random rumber and pass it as variable to a Django template?
I want to create a view that generates a random number for me between 1 and the amount of objects inside the model. I then want to pass it as context to the template. However i keep getting the following error:
Reverse for 'random_obj' with keyword arguments '{'ran': ''}' not found. 1 pattern(s) tried: ['detail\/(?P[09]+)\/$']
def random_page(request): object_count = MODEL.objects.count() ran = random.randint(1, object_count) return render(request, 'app/detail.html', {'ran': ran}) urlpatterns = [ path('', views.home, name='home_page'), path('detail/<int:pk>/', views.detail_page, name='detail_page'), path('detail/<int:ran>/', views.random_page, name='random_page'), ]
This results in above error:
<a class="anchor_boot" href="{% url 'app:random_page' ran=ran %}">random obj</a> </div>
This works:
<a class="anchor_boot" href="{% url 'app:random_page' 8 %}">random obj</a> </div>
What am I doing wrong?

Making Random Forest outputs like Logistic Regression
I am asking dimensional wise etc. I am trying to implement this amazing work with random forest https://www.kaggle.com/allunia/howtoattackamachinelearningmodel/notebook
Both logistic regression and random forest are from sklearn but when I get weights from random forest model its (784,) while the logistic regression returns (10,784)
My most problems are mainly dimension and NaN, infinity or a value too large for dtype errors with attack methods. The weights using logical regression is (10,784) but with Random Forest its (784,) may be this caused the problem? Or can you suggest some modifications to attack methods? I tried Imputer for NaN values error but it wanted me to reshape so I've got this. I tried applying np.mat for the dimension errors I'm getting but they didnt work.
def non_targeted_gradient(target, output, w): target = target.reshape(1, 1) output = output.reshape(1, 1) w = w.reshape(1,1) target = imp.fit_transform(target) output = imp.fit_transform(output) w = imp.fit_transform(w) ww = calc_output_weighted_weights(output, w) for k in range(len(target)): if k == 0: gradient = np.mat((1target[k])) * np.mat((w[k]ww)) else: gradient += np.mat((1target[k])) * np.mat((w[k]ww)) return gradient
I'm probably doing lots of things wrong but the TL;DR is I'm trying to apply Random Forest instead of Logistic regression at the link above.

how to map each leave samples in each tree in random forest classifier to it's X and y after fit?
i am triyng to understand how to map leaves to it's original X and y . i tried to used the Print the decision path of a specific sample in a random forest classifier and i can't understand how to map
children_left_ = [t.tree_.children_left for t in estimator.estimators_] children_right_ = [t.tree_.children_right for t in estimator.estimators_]
to it's original X and Y

Predictive models to predict sales with r
I would like to find a good model to predict which client will buy my product in 2018. I would like to have opinions on which method can fit my data to predict which client will the product A in 2018.
I have the following data:
client buyproductin17 buyproductin16 buyproductin15 all_day day_cnt product 22 1 0 1 34 5 2 23 0 1 1 56 11 2 24 1 1 1 122 45 3
client = client ID
buyproduct17 = if the client buy the product A in 2017 or not
buyproduct16 = if the client buy the product A in 2016 or not
buyproduct15 = if the client buy the product A in 2015 or not
all_day = total number of day the client spent with all my product
day_cnt = total number of day the client spent with product A
product = total number of product A the client has
My first thought are a logistic regression model or a random forest. But which dependant variable can I use ? buyproductin17 ?
Thanks a lot
Dany

how to use StratifiedKFold?
I have problem in using StratifiedKFold. I want to do cross validation. X and Y are numpy.ndarray, when I run the code below I get the following error. I know that what I get as train_index and test_index are the indexes of training and testing splits but how can I extract for instance the data with index 0 in X in order to make the training and testing sets out of the indexes skf.split reveal ?
skf = StratifiedKFold(n_splits=3) for train_index, test_index in skf.split(X, y): print("%s %s" % (train_index, test_index)) n+=1 print(n, "n") X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] print(X_train,y_train,X_test,y_test, "X_train,y_train,X_test,y_test")
error:
TypeError: only integer scalar arrays can be converted to a scalar index
The printed details of X are shown below:
print(X, "X is") print(type(X), "tyep X") #<class 'numpy.ndarray'> tyep X print(type(x), "type111") #<class 'numpy.ndarray'> type111 print(type(y), "type122") #<class 'list'> type122 print(np.prod(X.shape), "array dimensions") #24092640 array dimensions print('Saved dataset to dataset.npz.') print('X_shape:{}\nY_shape:{}'.format(X.shape, Y.shape)) #X_shape:(30, 156, 156, 11, 3) Y_shape:(30, 3)

Overfitting due to preprocessing data
i need an help: i think my model in keras is overfitting.
I noticed that, when my model runs, loss value decreases while val_loss becomes higher.
Maybe i made a mistake in the preprocessing part of the code, can you give a look please, i would really appreciate!
def split_into_chunks(data, train, predict, step, scale=True): X, Y = [], [] for i in range(0, len(data), step): try: x_i = data[i:i+train] y_i = data[i+train+predict] timeseries = np.array(data[i:i+train]) mean = np.mean(timeseries) std = np.std(timeseries) if scale: timeseries = preprocessing.scale(timeseries) y_i = ((y_i  np.mean(x_i)) / np.std(x_i)) x_i2 = timeseries y_i = np.array(y_i) except: break X.append(x_i2) Y.append(y_i) return X, Y
This is the code for preprocess the data i'm going to use for training the NN. I'm a little bit afraid that i'm normalizing data for training and test in a different way. What do you think? Thank you in advance!

The performance of applying Randome forest main model is different with same positive set and different negative sets?
I am using the RandomForest model to predict some specific segment in the genome. I have my positive training set from experimental data and selecting negative dataset randomly from the whole genome. Imaging my main training set contain 50 positive and 50 negative dataset. So, I trained my RF model based on this training set. Then What I did, I kept my positive training set and select negative dataset randomly from the whole genome 10 times. So I have 10 different datasets with the same positive dataset as the original model but with the different negative dataset.
The performance of main model on the testing set (which is 25% of original dataset and not seen by model during training) is ~ 90% . However when I applied the model on 10 different dataset (with different negative dataset and same positive seta), the performance getting higher and goes to ~98%. I am wondering why applying the model on the new datasets are higher?
Thanks M