Unable to print expected name using regex in python

I am trying to print names along with their prefix, but for a given name it is not working as expected as shown below.

Python version 3.7.7

string4 = 'Mr. Venkat Mr Raj Mr.RK Mr T Mrs Venkat **Mrs. Raj** Ms Githa Ms. Seetha'
re.findall('[Mm][r-sR-S].?\s?[a-zA-Z]*\w', string4)

Output:

['Mr. Venkat',
 'Mr Raj',
 'Mr.RK',
 'Mr T',
 'Mrs Venkat',
 'Mrs',
 'Ms Githa',
 'Ms. Seetha']

2 answers

  • answered 2021-05-15 07:37 Tim Biegeleisen

    I would use the pattern \bMr?s?\.?\s*\w+\b here:

    string4 = 'Mr. Venkat Mr Raj Mr.RK Mr T Mrs Venkat Mrs. Raj Ms Githa Ms. Seetha'
    names = re.findall(r'\bMr?s?\.?\s*\w+\b', string4)
    print(names)
    

    This prints:

    ['Mr. Venkat', 'Mr Raj', 'Mr.RK', 'Mr T', 'Mrs Venkat', 'Mrs. Raj', 'Ms Githa', 'Ms. Seetha']
    

    The reason your current pattern

    [Mm][r-sR-S].?\s?[a-zA-Z]*\w
    

    does not match Mrs. Raj is that the above can only match M followed by r, but s is not in your pattern. The character class [r-sR-S] can only match one letter, not two.

  • answered 2021-05-15 07:50 Tuan Chau

    r'\b[Mm][rR]?[sS]?\.?\s*\w+\b'
    

    Bonus: This one works also with Miss

    r'\b[Mm][rR]?[iI]?[sS]{0,2}\.?\s*\w+\b'
    
    import re
    string4 = 'Mr. Venkat Mr Raj Mr.RK Mr T Mrs Venkat Mrs. Raj Ms Githa Ms. Seetha Miss. A'
    
    names = re.findall(r'\b[Mm][rR]?[iI]?[sS]{0,2}\.?\s*\w+\b', string4)
    print(names)
    

    Result

    ['Mr. Venkat', 'Mr Raj', 'Mr.RK', 'Mr T', 'Mrs Venkat', 'Mrs. Raj', 'Ms Githa', 'Ms. Seetha', 'Miss. A']
    

    Update: based on the comment of @tripleee. To avoid false-positive like M. Name, or Mris with my bonus solution, we should list all possible cases

    r'\b(?:Mr|Mrs|Ms|Miss)\.?\s*\w+\b'
    

    This is for me is easier to read than previous regexes but we have to add more case if the upper/lower case is not determined.