Creating columns for each observed value of a variable

Suppose a respondent (id) is asked to make a choice in five tasks (t=1,2,3,4,5) (a panel dataset with five observations per respondent). Once, a choice is made then an outcome is shown to the respondent. Suppose the data look like below.

+----+---+---------+
| id | t | outcome |
+----+---+---------+
|  1 | 1 |      10 |
|  1 | 2 |      20 |
|  1 | 3 |      30 |
|  1 | 4 |      40 |
|  1 | 5 |      40 |
|  2 | 1 |      20 |
|  2 | 2 |      30 |
|  2 | 3 |      40 |
|  2 | 4 |      40 |
|  2 | 5 |      20 |
|  . | . |       . |
|  . | . |       . |
|  . | . |       . |
+----+---+---------+

Now, I am interested to keep the history of the outcome variable for each t-1 task. I am aiming for the below output.


+----+---+---------+------------+------------+------------+------------+------------+
| id | t | outcome | outcome_t1 | outcome_t2 | outcome_t3 | outcome_t4 | outcome_t5 |
+----+---+---------+------------+------------+------------+------------+------------+
|  1 | 1 |      10 | NA         | NA         | NA         | NA         | NA         |
|  1 | 2 |      20 | 10         | NA         | NA         | NA         | NA         |
|  1 | 3 |      30 | 10         | 20         | NA         | NA         | NA         |
|  1 | 4 |      40 | 10         | 20         | 30         | NA         | NA         |
|  1 | 5 |      40 | 10         | 20         | 30         | 40         | NA         |
|  2 | 1 |      20 | NA         | NA         | NA         | NA         | NA         |
|  2 | 2 |      30 | 20         | NA         | NA         | NA         | NA         |
|  2 | 3 |      40 | 20         | 30         | NA         | NA         | NA         |
|  2 | 4 |      40 | 20         | 30         | 40         | NA         | NA         |
|  2 | 5 |      20 | 20         | 30         | 40         | 40         | NA         |
|  . | . |       . | .          | .          | .          | .          | .          |
|  . | . |       . | .          | .          | .          | .          | .          |
|  . | . |       . | .          | .          | .          | .          | .          |
+----+---+---------+------------+------------+------------+------------+------------+

I went through most of the questions on this forum, but most of them address the lagged columns which are not applicable to this case.

Perhaps there could be an easy and efficient way using mutate with dplyr but I am unable to make it work so far.

4 answers

  • answered 2019-05-15 03:25 akrun

    We can use data.table methods for this. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', loop through the 'outcome', replicate the elements with the specifying the sequence from 1:.N and .N:1 with NA as padding, then join with the original dataset on the 'id' and 't' columns

    library(data.table)
    df2 <- setDT(df1)[, Map(function(x, y, z) rep(c(NA, x), 
                 c(y, z)), outcome, 1:.N, .N:1), id][, t := rowid(id)]
    out <- df2[df1, on  = .(id, t)]
    setcolorder(out, c(1, 7, 8, 2:6))
    setnames(out, 4:ncol(out), paste0("outcome_t", 1:5))
    out
    #    id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
    # 1:  1 1      10         NA         NA         NA         NA         NA
    # 2:  1 2      20         10         NA         NA         NA         NA
    # 3:  1 3      30         10         20         NA         NA         NA
    # 4:  1 4      40         10         20         30         NA         NA
    # 5:  1 5      40         10         20         30         40         NA
    # 6:  2 1      20         NA         NA         NA         NA         NA
    # 7:  2 2      30         20         NA         NA         NA         NA
    # 8:  2 3      40         20         30         NA         NA         NA
    # 9:  2 4      40         20         30         40         NA         NA
    #10:  2 5      20         20         30         40         40         NA
    

    Or an option with dcast

    dcast(setDT(df1), id + t ~ paste0("outcome_t", t), 
           value.var = 'outcome')[, na.locf(.SD, na.rm = FALSE), id]
    

    Or we can do this more compactly

    library(zoo)
    nm1 <- paste0("outcome_t", 1:5)
    df1[nm1] <- do.call(rbind, lapply(split(df1$outcome, df1$id), 
                    function(x) head(rbind(NA, na.locf((NA^!diag(x)) * x)), -1)))
    

    Or using colCumsums

    library(matrixStats)
    df1[nm1] <- do.call(rbind, lapply(split(df1$outcome, df1$id), 
              function(x) colCumsums(rbind(0, diag(x)))[-length(x), ]))
    

    data

    df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
    t = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), outcome = c(10L, 
    20L, 30L, 40L, 40L, 20L, 30L, 40L, 40L, 20L)),
     class = "data.frame", row.names = c(NA, -10L))
    

  • answered 2019-05-15 03:39 Ronak Shah

    Base R approach, we can split the outcome column based on id and create a dataframe incrementally adding one value at a time in outcome variable and filling rest of them with NA and finally rbind these list of dataframes into one dataframe.

    n <- 5
    df[paste0("outcome_t", seq_len(n))] <- do.call(rbind, 
        lapply(split(df$outcome, df$id), function(x) 
      t(sapply(seq_along(x), function(y) c(x[seq_len(y - 1)], rep(NA, n - (y - 1)))))))
    
    df
    #   id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
    #1   1 1      10         NA         NA         NA         NA         NA
    #2   1 2      20         10         NA         NA         NA         NA
    #3   1 3      30         10         20         NA         NA         NA
    #4   1 4      40         10         20         30         NA         NA
    #5   1 5      40         10         20         30         40         NA
    #6   2 1      20         NA         NA         NA         NA         NA
    #7   2 2      30         20         NA         NA         NA         NA
    #8   2 3      40         20         30         NA         NA         NA
    #9   2 4      40         20         30         40         NA         NA
    #10  2 5      20         20         30         40         40         NA
    

    A tidyverse option using separate

    library(tidyverse)
    
    df %>%
       group_by(id) %>%
       mutate(new = map_chr(seq_along(outcome), 
             ~paste0(outcome[seq_len(. - 1)], collapse = ","))) %>%
       separate(new, into = paste0("outcome_t", seq_len(n)), 
                     sep = ",", fill = "right") %>%
       mutate(outcome_t1 = replace(outcome_t1, outcome_t1 == "", NA))
    

    data

    df <- data.frame(id = rep(c(1, 2), each = 5), t = 1:5, 
         outcome = c(10, 20, 30, 40, 40, 20, 30, 40, 40, 20))
    

  • answered 2019-05-15 03:41 chinsoon12

    Another data.table approach using transpose:

    DT[, paste0("outcome_t", 1:5) := 
            transpose(lapply(t, function(x) replace(outcome, t>=x, NA))), 
        by=.(id)]
    

    output:

        id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
     1:  1 1      10         NA         NA         NA         NA         NA
     2:  1 2      20         10         NA         NA         NA         NA
     3:  1 3      30         10         20         NA         NA         NA
     4:  1 4      40         10         20         30         NA         NA
     5:  1 5      40         10         20         30         40         NA
     6:  2 1      20         NA         NA         NA         NA         NA
     7:  2 2      30         20         NA         NA         NA         NA
     8:  2 3      40         20         30         NA         NA         NA
     9:  2 4      40         20         30         40         NA         NA
    10:  2 5      20         20         30         40         40         NA
    

    data:

    library(data.table)
    DT <- fread("| id | t | outcome |
    |  1 | 1 |      10 |
    |  1 | 2 |      20 |
    |  1 | 3 |      30 |
    |  1 | 4 |      40 |
    |  1 | 5 |      40 |
    |  2 | 1 |      20 |
    |  2 | 2 |      30 |
    |  2 | 3 |      40 |
    |  2 | 4 |      40 |
    |  2 | 5 |      20 |")[, c(-1,-5)]
    

  • answered 2019-05-15 04:01 H 1

    Here is a tidyverse approach.

    library(tidyverse)
    
    df %>% 
      mutate(rn = 1:n(),
             t = paste0("outcome_t", t)) %>%
      group_by(id) %>%
      spread(t, outcome) %>%
      mutate_at(vars(-rn, -id), lag) %>%
      fill(-rn, -id)
    
    # A tibble: 10 x 7
    # Groups:   id [2]
          id    rn outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
       <int> <int>      <int>      <int>      <int>      <int>      <int>
     1     1     1         NA         NA         NA         NA         NA
     2     1     2         10         NA         NA         NA         NA
     3     1     3         10         20         NA         NA         NA
     4     1     4         10         20         30         NA         NA
     5     1     5         10         20         30         40         NA
     6     2     6         NA         NA         NA         NA         NA
     7     2     7         20         NA         NA         NA         NA
     8     2     8         20         30         NA         NA         NA
     9     2     9         20         30         40         NA         NA
    10     2    10         20         30         40         40         NA