train/test split with repeated measures

I want to try a random forest on this data where y = happy after x = ate. Some of these people were lucky and got two free meals, while some only got one. Could I use rsample to make sure that the same id (in this case 5) does not appear in both the train and test split? If not, how should I do it?

library(tibble)
library(rsample)

set.seed(123)
dframe <- tibble(id = c(1,1,2,2,3,4,5,5,6,7), 
                 ate = sample(c("cookie", "slug"), size = 10, replace = TRUE),
                 happy = sample(c("yes", "no"), size = 10, replace = TRUE))


dframe_split <- initial_split(dframe, strata = "happy")
dframe_train <- training(dframe_split)
dframe_test <- testing(dframe_split)

Created on 2018-10-11 by the reprex package (v0.2.0).

1 answer

  • answered 2018-10-12 08:12 liori

    As of rsample 0.0.2, the only documented way of performing a split like this using this library seems to be the group_vfold_cv function, example:

    resamples <- group_vfold_cv(dframe, group='id', v=3)
    lapply(resamples$splits, training)
    lapply(resamples$splits, testing)