Trouble with casefold() due to Non-English letters

All I want to do is change the address column in df to upper case

df$address <- casefold(df$address, upper = TRUE)

but I keep getting the following error - probably because of the 'I' with an accent

Error in toupper(x) : 
  invalid input 'POLÍGONO INDUSTRIAL OLASO' in 'utf8towcs'

I know this observation is already upper case, but not all of them are. I don't want to just substitute all of these instances for their English counterpart, mainly because an Eszett (ß) shows up later and I don't know what that would be replaced with.

1 answer

  • answered 2018-10-11 21:12 CT Hall

    Casefold works as expected with the i accent on my account.

    > casefold('POLÍGONO INDUSTRIAL OLASO')
    [1] "polígono industrial olaso"
    > casefold('POLÍGONO INDUSTRIAL OLASO', upper = TRUE)
    [1] "POLÍGONO INDUSTRIAL OLASO"
    

    For eszett it leaves as is.

    > casefold('daß')
    [1] "daß"
    > casefold('daß', upper = T)
    [1] "DAß"
    

    You may want to check out the package stringr which will translate eszett to SS.

    > library(stringr)
    > str_to_lower('daß')
    [1] "daß"
    > str_to_upper('daß')
    [1] "DASS"
    

    But it doesn't work the other way around.

    > str_to_lower('DASS')
    [1] "dass"