Python regex findall returning empty list after parsing a text file

I'm trying to parse some conversations from an app in a .txt file with Python's re module, but despite working on regex101 when used on a sample of the file, it doesn't work properly when I open the file and actually try to parse it.

The structure of the txt file is dd/mm/yyyy hh:mm - Message Author: message text\n, and I'm trying to get only the Name: message \n parts. I'm using the following pattern (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$). My code is looking more or less like the following:

buffer = open(file, 'r', encoding = 'UTF-8').read()
pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
matches = re.findall(pattern, buffer)

As the title says, though, findall returns and empty list, and I don't know why. The following sample works as expected on regex101:

20/04/2021 09:54 - Person 1: this is an example text. Will it match?
20/04/2021 09:54 - Person 2: I think it does.

3 answers

  • answered 2021-05-15 17:53 Nikolaos Chatzis

    Your regex has a small issue; that is the $ at the end. Note that f.read() in your code reads the entire file and puts its content in a str.

    See:

    >>> buffer = open('test', 'r', encoding = 'UTF-8').read()
    >>> buffer
    '20/04/2021 09:54 - Person 1: this is an example text. Will it match?\n20/04/2021 09:54 - Person 2: I think it does.\n'
    >>>
    >>> pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
    >>> matches = pattern.findall(buffer)
    >>> matches
    [('Person 2: ', 'I think it does.')]
    >>>
    >>> # but ...
    >>>
    >>> pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)')
    >>> matches = pattern.findall(buffer)
    >>> matches
    [('Person 1: ', 'this is an example text. Will it match?'), ('Person 2: ', 'I think it does.')]
    

    Note, for completeness, that your regex would work if you'd read the file line by line, instead of reading it in one go with f.read():

    >>> pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
    >>> with open('test', 'r', encoding = 'UTF-8') as f:
    ...     for line in f:
    ...         m = pattern.search(line)
    ...         if m:
    ...              print(m.groups())
    ... 
    ('Person 1: ', 'this is an example text. Will it match?')
    ('Person 2: ', 'I think it does.')
    

  • answered 2021-05-15 20:50 Ryszard Czech

    Kiss: remove $. It matches the end of string. You need to match end of lines, and re.M could be helpful here. But removing $ is simply simpler.

    (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)
    

    BUT even "kiss"er: you do not need lookbehind or escapes over slashes because re.findall returns captured strings if you use a capturing group in the expression.

    Use

    pattern = re.compile(r'\b\d{2}/\d{2}/\d{4}\s*\d{2}:\d{2}\s*-\s*(?P<name>.*):\s*(?P<message>.*)')
    with open(file, 'r', encoding = 'UTF-8') as buffer:
        matches = [match.groupdict() for match in pattern.finditer(test_str)]
    

    Regex proof | Python code

    EXPLANATION

    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      /                        '/'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      /                        '/'
    --------------------------------------------------------------------------------
      \d{4}                    digits (0-9) (4 times)
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      :                        ':'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      -                        '-'
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    

  • answered 2021-05-15 21:38 Jan

    Lookarounds are "expensive". Better match what you want and capture the interesting parts.
    That said, you might get along with a simpler expression:

    ^\d+[^-]+-\s+(?P<person>[^:]+):\s+(?P<text>.+)
    

    See a demo on regex101.com.