Regex pattern to extract Hearst patterns

I am new to Regex and I am unable to extract hyponym-hypernym pairs in the form of a list or tuple. I tried using this pattern but I get no matches

(NP_[\w.]*(, NP_[\w.]*)*,? (and)? other NP_[\w.]*)

I have the following annotated sentences for 'and other' pattern:

  1. NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges .
  2. The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites .

I want to extract a list such as :

[NP_dui,NP_fleeing or NP_evading_police, NP_possible_charges]

OR

(NP_dui,NP_possible_charges)
(NP_fleeing or NP_evading_police,NP_possible_charges)

Similarly for the sentence 2:

[NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear, NP_old_favorites]

or Similar tuples.

Any help would be appreciated.

1 answer

  • answered 2021-10-12 21:38 Ryszard Czech

    Use

    NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s+other)?\b)\s*NP_[\w.]*)+
    

    This extracts strings with your matches. Next, extract expected ents with NP_[\w.]*.

    Python code:

    import re
    
    test_strs = ["NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges.",
    "The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites ."]
    p = r'NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s+other)?\b)\s*NP_[\w.]*)+'
    
    for test_str in test_strs:
        matches = []
        for match in re.findall(p, test_str):
            matches.extend(re.findall(r'NP_[\w.]*\b', match))
        print(matches)
    

    Results: ['NP_dui', 'NP_fleeing', 'NP_evading_police', 'NP_possible_charges']
    ['NP_frog', 'NP_miss_piggy', 'NP_fozzie_bear', 'NP_old_favorites']

    EXPLANATION

    --------------------------------------------------------------------------------
      NP_                      'NP_'
    --------------------------------------------------------------------------------
      [\w.]*                   any character of: word characters (a-z, A-
                               Z, 0-9, _), '.' (0 or more times (matching
                               the most amount possible))
    --------------------------------------------------------------------------------
      (?:                      group, but do not capture (1 or more times
                               (matching the most amount possible)):
    --------------------------------------------------------------------------------
        \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                                 or more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        (?:                      group, but do not capture:
    --------------------------------------------------------------------------------
          ,                        ','
    --------------------------------------------------------------------------------
         |                        OR
    --------------------------------------------------------------------------------
          \b                       the boundary between a word char (\w)
                                   and something that is not a word char
    --------------------------------------------------------------------------------
          or                       'or'
    --------------------------------------------------------------------------------
          \b                       the boundary between a word char (\w)
                                   and something that is not a word char
    --------------------------------------------------------------------------------
         |                        OR
    --------------------------------------------------------------------------------
          ,?                       ',' (optional (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
          \s*                      whitespace (\n, \r, \t, \f, and " ")
                                   (0 or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
          and                      'and'
    --------------------------------------------------------------------------------
          (?:                      group, but do not capture (optional
                                   (matching the most amount possible)):
    --------------------------------------------------------------------------------
            \s+                      whitespace (\n, \r, \t, \f, and " ")
                                     (1 or more times (matching the most
                                     amount possible))
    --------------------------------------------------------------------------------
            other                    'other'
    --------------------------------------------------------------------------------
          )?                       end of grouping
    --------------------------------------------------------------------------------
          \b                       the boundary between a word char (\w)
                                   and something that is not a word char
    --------------------------------------------------------------------------------
        )                        end of grouping
    --------------------------------------------------------------------------------
        \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                                 or more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        NP_                      'NP_'
    --------------------------------------------------------------------------------
        [\w.]*                   any character of: word characters (a-z,
                                 A-Z, 0-9, _), '.' (0 or more times
                                 (matching the most amount possible))
    --------------------------------------------------------------------------------
      )+                       end of grouping
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum