Workaround for Impala Regex lookahead and lookbehind

If I use Hive, the below works fine. But if I use Impala, it throws error:

select regexp_replace("foobarbarfoo","bar(?=bar)","<NA>");

WARNINGS: Could not compile regexp pattern: bar(?=bar)
Error: invalid perl operator: (?=

Basically, Impala doesn't support lookahead and lookbehind

https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_incompatible_changes.html#incompatible_changes_200

Is there a workaround for this today? Maybe use UDF?

Thanks.

1 answer

  • answered 2018-07-11 06:55 Wiktor Stribiżew

    Since you are using regexp_replace, match and capture the part of string you want to keep (but want to use as must-have context) and replace with a backreference. See the regexp_replace Impala reference:

    These examples show how you can replace parts of a string matching a pattern with replacement text, which can include backreferences to any () groups in the pattern string. The backreference numbers start at 1, and any \characters must be escaped as \\.

    So, here, you may use

    select regexp_replace("foobarbarfoo","bar(bar)","<NA>\\1");
                                             ^   ^       ^^^   
    

    Note it will not work to replace consecutive matches, however, it will work in the current scenario and foobarbarfoo will turn into foo<NA>barfoo (note that Go regex engine is also RE2, hence this option is chosen at regex101.com).