How to clean and standardize a large number of words in a column?

I have a column in a dataframe that has 600 words. I need to standardize the misspelled words. For example, I want " alpha", "alpha", "alpha school" and "ALPHA" to become "Alpha". I want "bravo", " bravo ", "braVO", and "bravo bravo" to become "Bravo".

Here is a short example of my dput.

structure(list(names = structure(c(2L, 1L, 6L, 5L, 4L, 3L, 7L, 
8L, 9L, 11L, 10L), .Label = c(" alpha", "alpha", "Alpha", "ALPHA", 
"alpha school", "alpha_school", "bravo", "Bravo", "charlie Charlie", 
"DELTA", "delta school"), class = "factor")), class = "data.frame", row.names = c(NA, 

2 answers

  • answered 2019-09-10 02:24 Sada93

    You could use mutate, str_detect and if_else statements. This gives you a little more control over which patterns you want to look for.

    data <- tibble(names = c(" alpha", "alpha", "Alpha", "ALPHA", 
                             "alpha school", "alpha_school", "bravo", "Bravo", "charlie Charlie", 
                             "DELTA", "delta school")) 
    words <- c(Alpha = "^ ?[aA]",Bravo = "^ ?b",Charlie = "^ ?c",Delta = "^ ?[dD]")
    for(i in seq_along(words)){
      data <- data%>%
        mutate(names = if_else(str_detect(names,words[i]),

  • answered 2019-09-10 02:25 Bill O'Brien

    This uses the adist (approximate string distance) function in base R.

    vals <- c('Alpha', 'Bravo', 'Charlie', 'Delta')
    match <- function(x){
        vals[which.min(adist(x, vals))]}
    df$standardName <- sapply(df$names, match)
                 names standardName
    1            alpha        Alpha
    2            alpha        Alpha
    3     alpha_school        Alpha
    4     alpha school        Alpha
    5            ALPHA        Alpha
    6            Alpha        Alpha
    7            bravo        Bravo
    8            Bravo        Bravo
    9  charlie Charlie      Charlie
    10    delta school        Delta
    11           DELTA        Delta