How do I quickly check whether a string contains concatenated English words
Is there a fast way to check whether a string without spaces contains a valid (English) sentence?
In the example below I've encrypted the sentence "thisisavalidenglishsentence" using a modern shift cipher with an unknown key/shift. The resulting ciphertext is "aopzpzhchspklunspzozlualujl".
key = 0, plaintext = aopzpzhchspklunspzozlualujl key = 1, plaintext = znoyoygbgrojktmroynyktzktik key = 2, plaintext = ymnxnxfafqnijslqnxmxjsyjshj key = 3, plaintext = xlmwmwezepmhirkpmwlwirxirgi key = 4, plaintext = wklvlvdydolghqjolvkvhqwhqfh key = 5, plaintext = vjkukucxcnkfgpinkujugpvgpeg key = 6, plaintext = uijtjtbwbmjefohmjtitfoufodf key = 7, plaintext = thisisavalidenglishsentence <- !! key = 8, plaintext = sghrhrzuzkhcdmfkhrgrdmsdmbd key = 9, plaintext = rfgqgqytyjgbclejgqfqclrclac key = 10, plaintext = qefpfpxsxifabkdifpepbkqbkzb key = 11, plaintext = pdeoeowrwhezajcheodoajpajya key = 12, plaintext = ocdndnvqvgdyzibgdncnziozixz key = 13, plaintext = nbcmcmupufcxyhafcmbmyhnyhwy key = 14, plaintext = mablbltotebwxgzeblalxgmxgvx key = 15, plaintext = lzakaksnsdavwfydakzkwflwfuw key = 16, plaintext = kyzjzjrmrczuvexczjyjvekvetv key = 17, plaintext = jxyiyiqlqbytudwbyixiudjudsu key = 18, plaintext = iwxhxhpkpaxstcvaxhwhtcitcrt key = 19, plaintext = hvwgwgojozwrsbuzwgvgsbhsbqs key = 20, plaintext = guvfvfninyvqratyvfufragrapr key = 21, plaintext = ftueuemhmxupqzsxueteqzfqzoq key = 22, plaintext = estdtdlglwtopyrwtdsdpyepynp key = 23, plaintext = drscsckfkvsnoxqvscrcoxdoxmo key = 24, plaintext = cqrbrbjejurmnwpurbqbnwcnwln key = 25, plaintext = bpqaqaiditqlmvotqapamvbmvkm
If I were to brute-force the decryption of this ciphertext it would still be possible to pick out the correct plaintext by manually going through all the candidates. (In this case that would be shift = 7). However, usually the amount of decryptions is much, much larger than the 26 from the example.
Possible ways that I know of to detect a valid sentence would be to either check whether the letter frequency matches that of English, or I could split the sentence into ngrams and compare every single permutation to a dictionary of known English words.
The problem with these two approaches are that letter frequency isn't reliable for short sentences. It might be fine if I have only 26 possibilities, but with thousands of candidates it becomes less helpful. The other method of splitting the text into ngrams and comparing it to a dictionary takes a very long time if there are many possible decryptions.
Is there an alternative approach to quickly find whether a string contains valid (English) words? Or a fast way to compare a sentence of concatenated words against a (English) dictionary?