Fetching values from one column based on other column keys in long-formatted dataset
I have a long format dataset of 100,000+ individuals, capturing clinic visits at 5 different time points (not chronological). I've included an example dataset below that replicates the formatting of my data:
ID
: participant ID visit_number: order of the clinic visits in the original datasetage_visit
: age at the time of the visitclinic_number
: the identifier for the specific clinic locationage_sorted
: For each ID, age sorted in ascending order across the 5 clinic visitsage_sorted_index
: For each ID, the visit number corresponding to the sorted age
I would like to create a new column (clinic_number_extracted
) that fetches the clinic identifier (clinic_number
) corresponding to each sorted age (age_sorted
value) for each participant. I was thinking that it might be possible to use the age_sorted_index
and visit_number
variables to do so (generating key-value pairs?), but am not quite sure how to do this outside of data.table
. A tidyverse solution would be preferred.
I've looked on R community and stack exchange for clues, but haven't been able to find exactly what I'm looking for (likely not using the correct search terms). I tried to play around with group_by(across())
and with_order(order_by())
functions without much success. I can potentially create a new variable with a few case_when()
conditions but might run into issues if there are repeated age_assessment_sorted values
.
set.seed(42)
# Beginning dataset
das <- data.frame(id = rep(letters[1:3], each = 5),
visit_number = rep(1:5, times = 3),
age_visit = c(50, rep(NA_real_, times = 7), 34, 40, 72, rep(NA_real_, times = 3), 87),
clinic_number = sample(30:50, 15, replace=TRUE),
age_sorted = c(50, rep(NA_real_, times = 4), 34, 40,rep(NA_real_, times = 3), 72, 87, rep(NA_real_, times = 3)),
age_sorted_index = c(rep(1:5), 4, 5, rep(1:3), 1, 5, 2, 3, 4))
# Print out dataset
das
#> id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1 a 1 50 46 50 1
#> 2 a 2 NA 34 NA 2
#> 3 a 3 NA 30 NA 3
#> 4 a 4 NA 39 NA 4
#> 5 a 5 NA 33 NA 5
#> 6 b 1 NA 47 34 4
#> 7 b 2 NA 46 40 5
#> 8 b 3 NA 44 NA 1
#> 9 b 4 34 36 NA 2
#> 10 b 5 40 33 NA 3
#> 11 c 1 72 34 72 1
#> 12 c 2 NA 43 87 5
#> 13 c 3 NA 49 NA 2
#> 14 c 4 NA 47 NA 3
#> 15 c 5 87 44 NA 4
Desired data:
das_final <- cbind(das,
clinic_number_extracted = c(46, rep(NA_real_, times = 4), 36, 33, rep(NA_real_, times = 3), 34, 44, rep(NA_real_, times = 3)))
# Print out final dataset
das_final
#> id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1 a 1 50 46 50 1
#> 2 a 2 NA 34 NA 2
#> 3 a 3 NA 30 NA 3
#> 4 a 4 NA 39 NA 4
#> 5 a 5 NA 33 NA 5
#> 6 b 1 NA 47 34 4
#> 7 b 2 NA 46 40 5
#> 8 b 3 NA 44 NA 1
#> 9 b 4 34 36 NA 2
#> 10 b 5 40 33 NA 3
#> 11 c 1 72 34 72 1
#> 12 c 2 NA 43 87 5
#> 13 c 3 NA 49 NA 2
#> 14 c 4 NA 47 NA 3
#> 15 c 5 87 44 NA 4
#> clinic_number_extracted
#> 1 46
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 36
#> 7 33
#> 8 NA
#> 9 NA
#> 10 NA
#> 11 34
#> 12 44
#> 13 NA
#> 14 NA
#> 15 NA
Created on 2022-05-06 by the reprex package (v2.0.1)
2 answers
-
answered 2022-05-06 18:23
akrun
We may use
match
between the 'age_sorted' and 'age_visit', use that index to subset the 'clinic_number' and replace the elements where they are NA in 'age_sorted' toNA
library(dplyr) das_final %>% group_by(id) %>% mutate(clinic_new = clinic_number[match(age_sorted, age_visit)] * NA^is.na(age_sorted)) %>% ungroup
-output
# A tibble: 15 × 8 id visit_number age_visit clinic_number age_sorted age_sorted_index clinic_number_extracted clinic_new <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> 1 a 1 50 46 50 1 46 46 2 a 2 NA 34 NA 2 NA NA 3 a 3 NA 30 NA 3 NA NA 4 a 4 NA 39 NA 4 NA NA 5 a 5 NA 33 NA 5 NA NA 6 b 1 NA 47 34 4 36 36 7 b 2 NA 46 40 5 33 33 8 b 3 NA 44 NA 1 NA NA 9 b 4 34 36 NA 2 NA NA 10 b 5 40 33 NA 3 NA NA 11 c 1 72 34 72 1 34 34 12 c 2 NA 43 87 5 44 44 13 c 3 NA 49 NA 2 NA NA 14 c 4 NA 47 NA 3 NA NA 15 c 5 87 44 NA 4 NA NA
-
answered 2022-05-06 18:30
jpsmith
Could you just do this without needing to group? It seems to solve the issue in the sample data, but perhaps I am missing something:
das$clinic_number_extracted <- ifelse(!is.na(das$age_sorted), das$clinic_number, NA)
Output:
# id visit_number age_visit clinic_number age_sorted age_sorted_index clinic_number_extracted 1 a 1 50 46 50 1 46 #2 a 2 NA 34 NA 2 NA #3 a 3 NA 30 NA 3 NA #4 a 4 NA 39 NA 4 NA #5 a 5 NA 33 NA 5 NA #6 b 1 NA 47 34 4 47 #7 b 2 NA 46 40 5 46 #8 b 3 NA 44 NA 1 NA #9 b 4 34 36 NA 2 NA #10 b 5 40 33 NA 3 NA #11 c 1 72 34 72 1 34 #12 c 2 NA 43 87 5 43 #13 c 3 NA 49 NA 2 NA #14 c 4 NA 47 NA 3 NA #15 c 5 87 44 NA 4 NA
Testing against desired data provided:
all.equal(das[,"clinic_number_extracted"], das_final[,"clinic_number_extracted"]) #[1] TRUE
do you know?
how many words do you know