Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex

/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i

Live here: https://regex101.com/r/CHAhKj/1/

First, it doesn't match. Next, I think - it is at all possible to split that way? I tryed simplified example

print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));

live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02

and it prints an array of 2 empty strings, when I expect ['my ', ' is cool']

2 answers

  • answered 2021-06-19 17:55 anubhava

    I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence

    You may use this regex:

    (?<=[^\s.?;]\h)pression(?=\h[^\s.?;])
    

    RegEx Demo

    RegEx Details:

    • (?<=[^\s.;]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ;
    • pression: Match word pression
    • (?=\h[^\s.;]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ;

  • answered 2021-06-19 20:00 Wiktor Stribiżew

    First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.

    Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.

    That said, all you need is:

    preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
    

    See the regex demo. Details:

    • ^ - start of string
    • (.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
    • pression - a word
    • (\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
    • $ - end of string.

    s makes the . match across lines and i flag makes the pattern match in a case insensitive way.

    See the PHP demo:

    $text = "You can use any regular expression pression inside the lookahead ";
    if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
        echo $m[1] . " << | >> " . $m[2];
    }
    // => You can use any regular expression  << | >>  inside the lookahead