train/test split with repeated measures
I want to try a random forest on this data where y = happy after x = ate. Some of these people were lucky and got two free meals, while some only got one. Could I use rsample to make sure that the same id (in this case 5) does not appear in both the train and test split? If not, how should I do it?
library(tibble)
library(rsample)
set.seed(123)
dframe < tibble(id = c(1,1,2,2,3,4,5,5,6,7),
ate = sample(c("cookie", "slug"), size = 10, replace = TRUE),
happy = sample(c("yes", "no"), size = 10, replace = TRUE))
dframe_split < initial_split(dframe, strata = "happy")
dframe_train < training(dframe_split)
dframe_test < testing(dframe_split)
Created on 20181011 by the reprex package (v0.2.0).
1 answer

As of
rsample 0.0.2
, the only documented way of performing a split like this using this library seems to be thegroup_vfold_cv
function, example:resamples < group_vfold_cv(dframe, group='id', v=3) lapply(resamples$splits, training) lapply(resamples$splits, testing)
See also questions close to this topic

How to make a function that loops over two lists
I have an event A that is triggered when the majority of coin tosses in a series of tosses comes up heads. I have an unfair coin and I'd like to see how the likelihood of A changes as the number of tosses change and the probability in each toss changes.
This is my function assuming 3 tosses
n < 3 #victory requires majority of tosses heads #tosses only occur in odd intervals k < seq(n/2+.5,n) victory < function(n,k,p){ for (i in p) { x < 0 for (i in k) { x < x + choose(n, k) * p^k * (1p)^(nk) } z < x } return(z) } p < seq(0,1,.1) victory(n,k,p)
My hope is the
victory()
function would1  find the probability of each of the outcomes where the majority of tosses are heads, given a particular value p
2  sum up those probabilities and add them to a vector z
3  go back and do the same thing given another probability pI tested this with
n < 3, k < c(2,3)
andp < (.5,.75)
and the output was 0.75000, 0.84375. I know that the output should've been 0.625, 0.0984375. 
Exponentiation of Log Transformed Values in Mixed Effects Model
I have run a linear mixedeffects model in R using the nlme package in which my response variable (Proximal_Lead_Bowing) was transformed to log10 scale (Log_Bowing) due to a non normal distribution of values. The estimated differences in Log_Bowing between different Deep Brain Stimulation Electrodes (DBS_Electrode) as estimated by the model using the "glht" function for multiple comparisons of means (Tukey contrasts) are as follows: (View screenshot for full glht() output: https://imgur.com/WVJ9KM6)
Linear Hypothesis: Medtronic 3389  Boston Scientific Versice == 0 Estimate: 0.5766* St. Jude Medical Infinity  Boston Scientific Versice == 0 Estimate: 0.2208 St. Jude Medical Infinity  Medtronic 3389 == 0 Estimate:0.3558* *Denotes significance
Exponentiating these values (10^Abs(Estimate)) provide me with the following estimates for true differences in Proximal_Lead_Bowing as estimated by our mixedeffects model:
Linear Hypothesis: Medtronic 3389  Boston Scientific Versice == 0 3.77 (in millimeters) St. Jude Medical Infinity  Boston Scientific Versice == 0 1.66 St. Jude Medical Infinity  Medtronic 3389 == 0 2.27
These values do not make sense considering that the the average Proximal_Lead_Bowing ± 95% CI for each DBS_Electrode in the sample is as follows:
Boston Scientific Versice: 2.10 ± 0.67 (in millimeters) Medtronic 3389: 2.95 ± 0.58 St. Jude Medical Infinity: 2.00 ± 0.35
Thus I would expect true differences in Proximal_Lead_Bowing as estimated by our linear mixed model to be estimated as approximately 1.0 mm between Medtronic 3389 and the other DBS_Electrode models but instead the exponentiated values I have calculated don't seem to make sense. Am I missing something in the process of exponentiation of log10 values and/or use of the "glht" function for multiple comparisons of means? Any feedback would be appreciated.

What kind of Statistic Method for enrichment or overrepresent should I used for a rank ordered vector with Binary status
I have a gene expression data from 1065 different cell lines, let's say "BRAF" gene. BRAF gene expression levels are ordered. Most TP53 mutated cell lines are high BRAF expression (see the figure below). So what kind of statistical method should I use to test the enrichment or overrepresent for TP53 status (WT vs Mutant) on BRAF expression?

Chrome Selenium IDE random number generator
I saw similar topics but nothing exact.
When I used Firefox and the IDE I was able to use StoreEval  Math.round (Math.random() * 99999999999) to create a random number of a specific length. I have now moved to Chrome to use the IDE and "StoreEval" is no longer an option. I have tried all the new "store" options available but end up with the below warning in the logs and the number is not created:
"Warning implicit locators are deprecated, please change the locator to id=Math.round (Math.random() * 99999999999"
Any ideas on what I need to use/change? I will admit I am not exactly sure what "please change the locator to" means.
Thanks!

Printing random array element in C89
I have the following code that works fine:
#include <stdio.h> #include <stdlib.h>/*need this for rand()*/ #include "random.h" #include <time.h>/*for time() function*/ int main() { int arr[5], a = 0; printf("enter 5 array elements\n"); scanf("%d", &arr[0]); scanf("%d", &arr[1]); scanf("%d", &arr[2]); scanf("%d", &arr[3]); scanf("%d", &arr[4]); /*scan all elements*/ srand(time(NULL)); /*set the seed*/ a = arr[rand() % ARR_SIZE(arr)];/*from .h file*/ printf("%d\n",a);/*print the random element generated above*/ return 0; }
It picks a random integer for an array of 5 integers.
I need the following modifications to it:
A function should accept two parameters — an array of void pointers and the array length. It should return a void pointer.
The function should pick an element from the array at random and return it.
int main() must seed the random number generator and then call the function. Then finally it should print the random element that was generated.
I don't know how to modify it to meet above requirements.
Here are the random.h file contents:
#define ARR_SIZE(arr) ( sizeof((arr)) / sizeof((arr[0])) )

One time pad encryptions in java
I've create a one time pad encryption in java but I have two problems which are :
In Encryption, how can I can make the size of the key flexible according to the size of the plaintext and generated randomly for example , the size of the plaintext is 4 letters so the size of the array key must be 32bit, because each letter has 8bit.
In Decryption, how can I read from to files and these two files in binary form and then do XOR between them then print it as ASCLL form.
public class onetimepad {
public static void main(String[] args) throws Exception { int[] key = generate8BitKey(); Scanner in = new Scanner(System.in); System.out.println(" One Time Pad encryption and decryption "); System.out.println(" For encryption Enter 1 "); System.out.println(" For decryption Enter 2 "); System.out.println(" Exit Enter 3 "); int a = in.nextInt(); switch (a) { case 1: File input = new File("message.txt"); Scanner sc = new Scanner(input); String msg = sc.nextLine(); System.out.println("Key: "); //Write the Key in file. PrintWriter writer2 = new PrintWriter("Output.txt", "UTF8"); writer2.println(" Key  "); for (int i : key) { System.out.print(key[i]); writer2.print(key[i]); } writer2.close(); System.out.println(); String ciphertext = encrypt(msg, key); System.out.println("Encrypted Message: " + ciphertext); break; case 2: File input2 = new File("ciphertext.txt"); Scanner sc2 = new Scanner(input2); String msg2 = sc2.nextLine(); File input3 = new File("Key.txt"); Scanner sc3 = new Scanner(input3); String msg3 = sc2.nextLine(); System.out.println("Decrypted Message: " + decrypt(msg3, key)); break; default: } }// End the main. // Methods. public static String encrypt(String msg, int[] key) { int[] binmsg = stringToBinary(msg); int[] result = xor(binmsg, repeatArray(key, msg.length())); String r = ""; for (int i : result) { r += (char) (result[i] + '0'); } return r; } // public static String decrypt(String ciphertext, int[] key) { int[] bin = new int[ciphertext.length()]; for (int i = 0; i < ciphertext.length(); i++) { bin[i] = ciphertext.charAt(i)  '0'; } int[] result = xor(bin, repeatArray(key, bin.length / 8)); return binaryToString(result); } // public static int[] stringToBinary(String msg) { int[] result = new int[msg.length() * 8]; for (int i = 0; i < msg.length(); i++) { String bin = Integer.toBinaryString((int) msg.charAt(i)); while (bin.length() < 8) { bin = "0" + bin; } for (int j = 0; j < bin.length(); j++) { result[i * 8 + j] = bin.charAt(j)  '0'; } } return result; } // public static String binaryToString(int[] bin) { String result = ""; for (int i = 0; i < bin.length / 8; i++) { String c = ""; for (int j = 0; j < 8; j++) { c += (char) (bin[i * 8 + j] + '0'); } result += (char) Integer.parseInt(c, 2); } return result; } // public static int[] generate8BitKey() { int[] key = new int[8]; for (int i = 0; i < 8; i++) { SecureRandom sr = new SecureRandom(); key[i] = sr.nextInt(2); } return key; } // public static int[] xor(int[] a, int[] b) { int[] result = new int[a.length]; for (int i = 0; i < a.length; i++) { result[i] = a[i] == b[i] ? 0 : 1; } return result; }
// public static int[] repeatArray(int[] a, int n) { int[] result = new int[a.length * n]; for (int i = 0; i < result.length; i++) { result[i] = a[i % a.length]; // mod } return result; }
}

How can I keep the structure of a extra tree regression?
I want to use the sklearn package to do an extra tree regression. But I need to keep the structure of the extra tree and just update the values in the leaf nodes when I doing the iteration. So how can I keep the structure unchanged when I doing the iteartion?

Random Forest Accuracy Difference between Test Data & New Data
I am new to Machine Learning and got very discouraged when experienced huge difference in my Random Forest model performance between test data & new data. Any insights would be greatly appreciated.
Objective of my model is to forecast future period stock returns based on time series data on the same stock. I applied Random Forest Classifier using scikitlearn in Python to predict return decile instead of the actual price change. I used data up to June 2017 and trained the model on 80% of data and tested it on 20%. The results were great. Probability of misclassification greater or smaller than 1 decile was 3%. That means if model predicts return to fall in Decile 5, actual return would fall below Decile 4 or above Decile 6 in only 3% of cases. I was very happy with that.
However when I applied the model on "new data", from July 1, 2017 to the present, I got horrific results. Probability of misclassification >+/1 decile jumped to 60%!!!
I thought issue was overfitting due to depth of trees. But it was not. I set up min_samples_leaf to 20 and even 40 and actually it made forecasts on new data worse.
What else could it be? If the model does so well on historical test data, why it does so differently on the new data. The nature of new data cannot be that different.
Thank you all.

caret::groupKFold and validation/testing
This is unknown territory, so please let me know if the question is not clear.
I'm trying to fit a random forest with caret. I have a dataset of about 160 observations where 60/160 are repeated measures so I need to make sure the same ids(patients) are not used for training and validation. Because of this I've used groupKFold to create 5 folds before training the model.
What I can't understand is at which point do I use/select data for actual testing/validation AFTER training the model? In other words, where is the "newdata"?
predict(rf_mod, "??")
folds < groupKFold(rf_data$id, k = 5) rf_data < rf_data %>% select(id) fitControl < trainControl(method = "cv", number = 5, index = folds, search = "random") rf_mod < train(cancer ~ ., rf_data, method = "rf", trControl = fitControl)

Retraining an existing generalized additive model (GAM) in R
I have a trained GAM model in R using the
mgcv
library. Now I am trying to retrain the existing GAM model with a new dataset. What is the best way to retrain the existing GAM model in R? Appreciate any experience or guidance in this regard! 
How can I solve this problem without rename images. ValueError: invalid literal for int() with base 10: '1 (1)'
How can I solve this problem without rename images. I named them are "User.1 (1)". By rename much photos in once.(F2). Thanks for helping!

Best way to change trainset dynamically
So I am experimenting with some artificially created trainset and I want to extract best subset from this trainset which will increase my model's performance. I use tensorflow as my deep learning framework.
So the requirement is to add some subset into trainset and if the model improves save this subset on seperate file and then keep adding new subset. Whenever the model does not improve, just ignore current subset and iterate to the next subset.
Is this ever possible? And what is the most effective way of implementing it.