Matching code runs at O(n2). Can you suggest alternate code that runs closer to O(n)?

I have large data frames with a column (character) of species names called hit.match. hit.match can contain one species name, or over 20 species names, separated by the pipe character "|". I want fast code to check for matches with a single species in hit.match. My current code scales as O(n2) (n-squared), where n is the number of input records. I need it to scale faster, closer to O(n).

##Function matched looks in hit.match in the input df, and returns one of the prioritized species in the input species list match.list that match hit.match in a priority order. If none of the prioritized species is found, it returns the original hit.match
#Input is a df and a list of species scientific names (genus species) locally called match.list
#Output is a df with hit.match replaced with the prioritized species name or retaining the original hit.match

 matched <- function(df, match.list) {
# Iterate through match.list
for(i in seq_len(length(match.list))) {
match <-grep(match.list[[i]],df$hit.match)
for(j in seq_len(length(match))) {
    df$hit.match[[ match[j] ]] <- match.list[[i]]
    } #end For j
} #end for i
df
}

df$hit.match = cbind("Nomina nudum", " Nomina nudum1 | Nomina nudum2", " Nomina nudum | Nomina nudum1 | Nomina nudum2", " Nomina nudum1", " Nomina nudum2")

match.list = c("Nomina nudum", " Nomina nudum1 ", " Nomina nudum2")

matched(df, match.list)

output should be ("Nomina nudum", " Nomina nudum1 ", " Nomina nudum", " Nomina nudum1", " Nomina nudum2") The function works fine, but is too slow.

1 answer

  • answered 2019-10-08 03:46 Ronak Shah

    We could split string on "|", create a pattern by pasting match.list together. If the pattern matches with any element in hit.match, we return the first match or else return hit.match again.

    pat <- paste0(match.list, collapse = "|")
    
    sapply(strsplit(df$hit.match, "\\|"), function(x) {
         inds <- grep(pat, x)
         if (length(inds) > 0)   trimws(x[inds[1L]]) 
         else paste0(x, collapse = "|")
    })
    #[1] "Nomina nudum"  "Nomina nudum1" "Nomina nudum"  "Nomina nudum1" "Nomina nudum2"
    

    data

    df <- data.frame(hit.match = c("Nomina nudum", " Nomina nudum1 | Nomina nudum2", 
        " Nomina nudum | Nomina nudum1 | Nomina nudum2", " Nomina nudum1", 
        " Nomina nudum2"), stringsAsFactors = FALSE)
    
    match.list <- c("Nomina nudum", " Nomina nudum1 ", " Nomina nudum2")