A faster way to find out if any word in a list is a substring in a longer word

What I have

  • A string s of length m (where m > 3)
  • A huge list L of words (length > 2)

What I want to know

  • Is any word in L a substring of s

Currently i have L split up into files, one file each for each word-length 16.txt, 15.txt ... 04.txt
I then iterate over these files from n.txt --> 04.txt and basically do

cat n.txt | while read w; do if [[ $s =~ $w ]] ; then echo $w; fi; done

It's painfully slow there has to be a better way to do this.

Additional info:

  • L contains about 200k words and is fairly static so I don't mind a complicated time consuming setup if it means greater speed
  • There are several different such lists L but only one of them needs to be searched at a time.

I'm pretty language agnostic (se tags) but pseudo code is also fine

1 answer

  • answered 2021-02-22 22:47 choroba

    You can use grep without splitting the list:

    grep -oFf list.txt <<< "$s"
    • -o will only output the matching substrings
    • -F will interpret the lines in list.txt as fixed strings, not regular expressions
    • -f will tell grep what file to use as the source of the patterns to match
    • <<< in bash takes the following word and makes it input to the preceding command