Match one of two lookbehinds

I'm trying to populate a column in a Pandas.DataFrame by extracting the id of a device from a log file. The problem is that id may be preceded by two separate patterns as follows:

Pattern 1:

(?<=cameraId=\')([a-z0-9-]+))

Pattern 2:

(?<=/live/)([a-z0-9-]+)

Note: there is no way for a line to have both of the patterns

The problem is that I use the Pandas.String.str.findall() method, and I want both the patterns to be populated.

I can successfully achieve the desired outcome as shown in the code below:

import pandas as pd

line_1 = 'INFO:2021-04-19 00:25:10,647:instance_manager.py:MainProcess:1:got event notificationName=\'DETECTION_STARTED\' cameraId=\'ab1c-ab6c-a6f6-a6d6-ab666\' timestamp=\'2021-04-19T00:24:08.192169Z\''

line_2 = 'INFO:2021-04-19 00:25:11,278:instance_manager.py:MainProcess:1:An old record record for the stream rtsp://127.0.1.1:6666/live/a001-a00a-0016-a006-ab606.stream was successfully updated in the DB!'

df = pd.DataFrame(columns=['type', 'ts', 'process', 'subprocess', 'line', 'message'])

line_1_parsed = pd.Series([line_1]).str.extract(r'(?P<type>[^:]+):(?P<ts>.+,\d+):(?P<process>[^:]+):(?P<subprocess>[^:]+):(?P<line>[^:]+):(?P<message>[^$]+)')
line_2_parsed = pd.Series([line_2]).str.extract(r'(?P<type>[^:]+):(?P<ts>.+,\d+):(?P<process>[^:]+):(?P<subprocess>[^:]+):(?P<line>[^:]+):(?P<message>[^$]+)')

df =df.append(line_1_parsed, ignore_index=True)
df =df.append(line_2_parsed, ignore_index=True)

df.loc[:, 'cam_id'] = df.loc[:, 'message'].str.findall('(?<=cameraId=\')([a-z0-9-]+)|(?<=/live/)([a-z0-9-]+)')
df

, but they are returned as tuples (pattern 1, pattern 2) as shown in the Current Output:

Current Output:

    type    ts  process     subprocess  line    message     cam_id
0   INFO    2021-04-19 00:25:10,647     instance_manager.py     MainProcess     1   got event notificationName='DETECTION_STARTED'...   [(ab1c-ab6c-a6f6-a6d6-ab666, )]
1   INFO    2021-04-19 00:25:11,278     instance_manager.py     MainProcess     1   An old record record for the stream rtsp://127...   [(, a001-a00a-0016-a006-ab606)]

I do understand that this is caused by the fact that it tries both pattern and returns the matches for both, but I'd rather would like it to have only the successful pattern in.

Sure, I can do it by manually extracting it in the following manner:

df.loc[:, 'cam_id'] = df.loc[:, 'cam_id'].apply(lambda cam_id_tuple: cam_id_tuple[0][0] if cam_id_tuple[0][0] != '' else cam_id_tuple[0][1])
df

but it is rather a cumbersome solution, and not extendable, in case I'd like to add patterns.

Desired Output:

    type    ts  process     subprocess  line    message     cam_id
0   INFO    2021-04-19 00:25:10,647     instance_manager.py     MainProcess     1   got event notificationName='DETECTION_STARTED'...   [ab1c-ab6c-a6f6-a6d6-ab666]
1   INFO    2021-04-19 00:25:11,278     instance_manager.py     MainProcess     1   An old record record for the stream rtsp://127...   [a001-a00a-0016-a006-ab606]`

Nonte: the cam_id column contains strings and not tuples

Thanks in advance.

2 answers

  • answered 2021-04-21 13:56 Shubham Sharma

    We can use str.extract with a regex pattern having a single capturing group

    df['message'].str.extract(r'(?:cameraId=\'|/live/)([a-z0-9-]+)', expand=False)
    

    0    ab1c-ab6c-a6f6-a6d6-ab666
    1    a001-a00a-0016-a006-ab606
    Name: message, dtype: object
    

    Regex details:

    • (?:cameraId=\'|/live/): Non capturing group
      • cameraId=\' : First alternative matches the characters cameraId=' literally
      • /live/ : Second alternative matches the characters /live/ literally
    • ([a-z0-9-]+) : First capturing group
      • [a-z0-9-]+ : Matches any character present in the list [a-z0-9-] one or more times

    See the online regex demo

  • answered 2021-04-21 14:52 RavinderSingh13

    With your shown samples, you could try following function too.

    df['message'].str.extract(r'.*(?:live\/|cameraId=\')([^\'.]*)', expand=False)
    

    Output of above code will be:

    0   ab1c-ab6c-a6f6-a6d6-ab666
    1   a001-a00a-0016-a006-ab606
    

    Here is the Online demo for above code

    Explanation: Adding detailed explanation for above.

    .*(?:live\/|cameraId=\')  ##From starting match till live/ OR cameraId=' in a non-captuging group.
    ([^\'.]*)                 ##Creating 1ct capturing group and matching until ' OR . here.