Pandas - extract method not matching anything

I am having a problem with this seemingly easy task to do. Here 's a recreation of my problem:

I have a dataframe called legal of this form:

+----+-----------------+
|    | legal           |
|----+-----------------|
|  0 | gmbh            |
|  1 | kg              |
|  2 | ag              |
|  3 | GmbH & Co. KGaA |
|  4 | LP              |
|  5 | LLP             |
|  6 | LLLP            |
|  7 | LLC             |
|  8 | PLLC            |
|  9 | corp            |
| 10 | corporation     |
| 11 | inc             |
| 12 | cic             |
| 13 | cio             |
| 14 | ltd             |
| 15 | s.a.            |
+----+-----------------+

It contains all the words that can represent a legal term of a given company.

Now I have another dataframe containing a list of company raw names that might also contain some legal terms. My task is to identify such legal terms for each company row name in the companies dataframe. I am trying to use some regex so that the legal terms might both be uppercase and lowercase (or a mix). So I am using the method extract for that.

For the sake of the demonstration, my first company raw name is 2&0 Technologies Inc, so for that company I would expect to extract the world inc from my legal dataframe.

This is the simplified version of my code with some comments:

def format_companies(self, legals, locations):
        self.companies['base_name'] = ''
        self.companies['location'] = ''
        self.companies['legal'] = ''
        for i, row in self.companies.iterrows():
            legal_pattern = '/(' + "|".join(row['raw'].split()]) +')/ig'
            legal_pattern = rf'{legal_pattern}'
            print(legal_pattern) # It prints out -> /(2&0|Technologies|Inc)/ig
            legal = legals['legal'].str.extract(legal_pattern)
            print(tabulate(legal, headers='keys', tablefmt='psql')) # Everything is NaN. (results will be print below)
            if i >= 0:
                break

The first print statement is just to print out the pattern used in the extract method, which is /(2&0|Technologies|Inc)/ig.

The second pattern is to print out the results from the extract method, and as said in the comments, it returns a list of NaNs:

+----+-----+
|    |   0 |
|----+-----|
|  0 | nan |
|  1 | nan |
|  2 | nan |
|  3 | nan |
|  4 | nan |
|  5 | nan |
|  6 | nan |
|  7 | nan |
|  8 | nan |
|  9 | nan |
| 10 | nan |
| 11 | nan |
| 12 | nan |
| 13 | nan |
| 14 | nan |
| 15 | nan |
+----+-----+

I am very confused because if you try out the regular expression /(2&0|Technologies|Inc)/ig on the text 'inc' on https://www.regextester.com/, inc gets selected correctly.

What am I doing wrong?

1 answer

  • answered 2021-09-11 18:33 SeaBean

    str.extract() does not recognize regex pattern with /i to indicate IGNORECASE. To solve this, you can do it in 2 ways:

    Method 1: Change your definition of legal_pattern without the / and /ig:

    legal_pattern = '(' + "|".join(row['raw'].split()]) +')'
    legal_pattern = rf'{legal_pattern}'
    

    Instead, use the flag re.IGNORECASE in str.extract(), as follows:

    import re
    legals['legal'].str.extract(legal_pattern, re.IGNORECASE)
    

    Method 2: Alternatively, you can also use (?i) in the regex to indicate IGNORECASE, as follows:

    legal_pattern = '(?i)(' + "|".join(row['raw'].split()]) +')'
    legal_pattern = rf'{legal_pattern}'
    

    Then, you can use str.extract() without specifying re.IGNORECASE:

    legals['legal'].str.extract(legal_pattern)
    

    Result:

          0
    0   NaN
    1   NaN
    2   NaN
    3   NaN
    4   NaN
    5   NaN
    6   NaN
    7   NaN
    8   NaN
    9   NaN
    10  NaN
    11  inc
    12  NaN
    13  NaN
    14  NaN
    15  NaN
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum