How to convert column into date format in R that has extra information

I would like to use some sort of regex function for the rest of the data (~2 million rows) to extract relevant date information (ideally have Day, Month, and time in date format since Year is only 2009 for ~2 million rows).

I have a column that looks like so:

ID | created_at
1    Mon Apr 06 22:19:45 PDT 2009
2    Mon Apr 06 22:19:49 PDT 2009

I applied these two functions to extract "day" and delete 'PDT 2009' from the end but now I would like to make the rest of the columns be date format for plotting purposes.

df$Day <- sub("([A-Za-z]+).*", "\\1", df$created_at) ## Extract first word

df$delete <- gsub("\\PDT.*","", df$created_at) ## Delete everything after PDT starts

Desired outcome:

ID | created_at                    | Month  | Day  | Time
1    Mon Apr 06 22:19:45 PDT 2009    Apr      Mon    22:19:45
2    Mon Apr 06 22:19:49 PDT 2009    Apr      Mon    22:19:49

5 answers

  • answered 2021-09-11 18:28 Vinícius Félix

    I think this might help you

    Libraries

    library(tidyverse)
    library(lubridate)
    

    Data

    df <- 
      tibble(
        created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009") 
      )
    

    Code

    df %>% 
      separate(
        col = created_at,
        into = c("wday","month","day","time","type","year"),
        sep = " ",
        remove = FALSE
        ) %>% 
      mutate(
        day = as.numeric(day),
        year = as.numeric(year),
        month_num = which(month.abb == month),
        time = hms(time),
        date = lubridate::make_date(year = year,month = month_num,day = day)
      )
    

    Results

    # A tibble: 2 x 9
      created_at                   wday  month   day time        type   year month_num date      
      <chr>                        <chr> <chr> <dbl> <Period>    <chr> <dbl>     <int> <date>    
    1 Mon Apr 06 22:19:45 PDT 2009 Mon   Apr       6 22H 19M 45S PDT    2009         4 2009-04-06
    2 Mon Apr 06 22:19:49 PDT 2009 Mon   Apr       6 22H 19M 49S PDT    2009         4 2009-04-06
    

  • answered 2021-09-11 18:34 mnist

    You do not need to use any regexes, just regular date formatting is sufficient. you can find a nice overview here or in ?strptime(). You just have to adjust for the separators. This should be easier and more efficient than using regexes, splitting, etc...

    Once you have this in native R DateTime formats POSIXlt and POSIXt, you can easily extract all date-related information.

    strptime(x      = "Mon Apr 06 22:19:45 PDT 2009",
             format = "%a %b %d %H:%M:%S PDT %Y")
    #> [1] "2009-04-06 22:19:45 CEST"
    

  • answered 2021-09-11 19:05 TarJae

    Here is an approach using str_split in companion with map_char:

    library(tidyverse)
    df %>% 
        mutate(elements = str_split(created_at, fixed(" "), n=6)) %>% 
        mutate(Month = map_chr(elements, 2),
               Day = map_chr(elements, 1),
               Time = map_chr(elements, 4), .keep="unused"
               )
    

    output:

      ID                   created_at Month Day     Time
    1  1 Mon Apr 06 22:19:45 PDT 2009   Apr Mon 22:19:45
    2  2 Mon Apr 06 22:19:49 PDT 2009   Apr Mon 22:19:49
    

  • answered 2021-09-11 19:44 Anoushiravan R

    You can use the following solution too:

    library(dplyr)
    
    df %>%
      mutate(ID = row_number(), 
             Month = gsub("(?:[A-Za-z]+)\\s([A-Za-z]+).*", "\\1", created_at, perl = TRUE),
             Day = gsub("([A-Za-z]+).*", "\\1", created_at, perl = TRUE), 
             Time = gsub(".*(\\d{2}:\\d{2}:\\d{2}).*", "\\1", created_at, perl = TRUE)) %>%
      relocate(ID)
    
    # A tibble: 2 x 5
         ID created_at                   Month Day   Time    
      <int> <chr>                        <chr> <chr> <chr>   
    1     1 Mon Apr 06 22:19:45 PDT 2009 Apr   Mon   22:19:45
    2     2 Mon Apr 06 22:19:49 PDT 2009 Apr   Mon   22:19:49
    

  • answered 2021-09-11 20:05 Chris Ruehlemann

    If you're just after Month, Day, and Time, why not use extract from the tidyverse:

    library(tidyr)
    df %>%
      extract(col = created_at, 
              into = c('Month', 'Day', 'Time'), 
              regex = "([A-Za-z]+)\\s([A-Za-z]+)\\s\\d{2}\\s([\\d:]+)")
      Month Day     Time
    1   Mon Apr 22:19:45
    2   Mon Apr 22:19:49
    

    Here, we define three capturing groups using the round brackets syntax (...) to identify the substrings we want to extract into the three columns.

    If you also need created_atin its original form, just store the results as, say, df1 and use cbind:

    cbind(df, df1)
                        created_at Month Day     Time
    1 Mon Apr 06 22:19:45 PDT 2009   Mon Apr 22:19:45
    2 Mon Apr 06 22:19:49 PDT 2009   Mon Apr 22:19:49
    

    Data:

    df <- 
      data.frame(
        created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009") 
      )
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum