Regex Error and Improvement Driving Licence Data Extraction

I am trying to extract the Name, License No., Date Of Issue and Validity from an Image I processed using Pytesseract. I am quite a lot confused with regex but still went through few documentations and codes over the web.

I got till here:

import pytesseract
import cv2
import re

import cv2

from PIL import Image
import numpy as np
import datetime

from dateutil.relativedelta import relativedelta

def driver_license(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    
    i = cv2.imread(filename)
    newdata=pytesseract.image_to_osd(i)
    angle = re.search('(?<=Rotate: )\d+', newdata).group(0)
    angle = int(angle)
    i = Image.open(filename)
    if angle != 0:
       #with Image.open("ro2.jpg") as i:
        rot_angle = 360 - angle
        i = i.rotate(rot_angle, expand="True")
        i.save(filename)
    
    i = cv2.imread(filename)
    # Convert to gray
    i = cv2.cvtColor(i, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    i = cv2.dilate(i, kernel, iterations=1)
    i = cv2.erode(i, kernel, iterations=1)
    
    txt = pytesseract.image_to_string(i)
    print(txt)
        
    text = []
    data = {
        'firstName': None,
        'lastName': None,
        'age': None,
        'documentNumber': None
    }
    
    c = 0
    print(txt)
    
    #Splitting lines
    lines = txt.split('\n')
    
    for lin in lines:
        c = c + 1
        s = lin.strip()
        s = s.replace('\n','')
        if s:
            s = s.rstrip()
            s = s.lstrip()
            text.append(s)

            try:
                if re.match(r".*Name|.*name|.*NAME", s):           
                    name = re.sub('[^a-zA-Z]+', ' ', s)
                    name = name.replace('Name', '')
                    name = name.replace('name', '')
                    name = name.replace('NAME', '')
                    name = name.replace(':', '')
                    name = name.rstrip()
                    name = name.lstrip()
                    nmlt = name.split(" ")
                    data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                    data['lastName'] = nmlt[-1]
                if re.search(r"[a-zA-Z][a-zA-Z]-\d{13}", s):
                    data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]-\d{13}', s)
                    data['documentNumber'] = data['documentNumber'].group().replace('-', '')
                    if not data['firstName']:
                        name = lines[c]           
                        name = re.sub('[^a-zA-Z]+', ' ', name)
                        name = name.rstrip()
                        name = name.lstrip()
                        nmlt = name.split(" ")
                        data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                        data['lastName'] = nmlt[-1]
                if re.search(r"[a-zA-Z][a-zA-Z]\d{2} \d{11}", s):
                    data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]\d{2} \d{11}', s)
                    data['documentNumber'] = data['documentNumber'].group().replace(' ', '')
                    if not data['firstName']:
                        name = lines[c]           
                        name = re.sub('[^a-zA-Z]+', ' ', name)
                        name = name.rstrip()
                        name = name.lstrip()
                        nmlt = name.split(" ")
                        data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                        data['lastName'] = nmlt[-1]
                if re.match(r".*DOB|.*dob|.*Dob", s):         
                    yob = re.sub('[^0-9]+', ' ', s)
                    yob = re.search(r'\d\d\d\d', yob)
                    data['age'] = datetime.datetime.now().year - int(yob.group())
            except:
                pass

    print(data)
    

I need to extract the Validity and Issue Date as well. But not getting anywhere near it. Also, I have seen using regex shortens the code like a lot so is there any better optimal way for it?

My input data is a string somewhat like this:

Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL

: SHRI DARSHAN SINGH GILL

DOB: 10/05/1966 BG: U
Address :

104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034

  

Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010

(Holder's Sig natu re)

Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR

Or like this:

in

Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

2

   
    
   

Licence No. : DL-0320170595326 () WN
Name : AZAZ AHAMADSIDDIQUIE
s/w/D : SALAHUDDIN ALI
____... DOB: 26/12/1992 BG: O+
\ \ Address:
—.~J ~—; ROO NO-25 AMK BOYS HOSTEL, J.
— NAGAR, DELHI 110025
Auth to Drive Date of Issue
M.CYL. 12/12/2017
4 wt 4
Iseue Date: 12/12/2017 a
falidity(NT) < 2037
Validity(T) +: NA /
Inv CarrNo : NA te sntian sana

Note: In the second example you wouldn't get the validity, will optimise the OCR for later. Any proper guide which can help me with regex which is a bit simpler would be good.

2 answers

  • answered 2021-09-22 06:18 Alireza

    You can use this pattern: (?<=KEY\s*:\s*)\b[^\n]+ and replace KEY with one of the issues of the date, License No. and others. Also for this pattern, you need to use regex library.

    Code:

    import regex
    
    text1 = """
    Transport Department Government of NCT of Delhi
    Licence to Drive Vehicles Throughout India
    
    Licence No. : DL-0820100052000 (P) R
    N : PARMINDER PAL SINGH GILL
    
    : SHRI DARSHAN SINGH GILL
    
    DOB: 10/05/1966 BG: U
    Address :
    
    104 SHARDA APPTT WEST ENCLAVE
    PITAMPURA DELHI 110034
    
    
    
    Auth to Drive Date of Issue
    M.CYL. 24/02/2010
    LMV-NT 24/02/2010
    
    (Holder's Sig natu re)
    
    Issue Date : 20/05/2016
    Validity(NT) : 19/05/2021 : c
    Validity(T) : NA Issuing Authority
    InvCarrNo : NA NWZ-I, WAZIRPUR
    """
    
    for key in ('Issue Date', 'Licence No\.', 'N', 'Validity\(NT\)'):
        print(regex.findall(fr"(?<={key}\s*:\s*)\b[^\n]+", text1, regex.IGNORECASE))
    
    

    Output:

    ['20/05/2016']
    ['DL-0820100052000 (P) R']
    ['PARMINDER PAL SINGH GILL']
    ['19/05/2021 : c']
    

  • answered 2021-09-22 07:20 Wiktor Stribiżew

    You can also use re with a single regex based on alternation that will capture your keys and values:

    import re
    text = "Transport Department Government of NCT of Delhi\nLicence to Drive Vehicles Throughout India\n\nLicence No. : DL-0820100052000 (P) R\nN : PARMINDER PAL SINGH GILL\n\n: SHRI DARSHAN SINGH GILL\n\nDOB: 10/05/1966 BG: U\nAddress :\n\n104 SHARDA APPTT WEST ENCLAVE\nPITAMPURA DELHI 110034\n\n\n\nAuth to Drive Date of Issue\nM.CYL. 24/02/2010\nLMV-NT 24/02/2010\n\n(Holder's Sig natu re)\n\nIssue Date : 20/05/2016\nValidity(NT) : 19/05/2021 : c\nValidity(T) : NA Issuing Authority\nInvCarrNo : NA NWZ-I, WAZIRPUR"
    search_phrases = ['Issue Date', 'Licence No.', 'N', 'Validity(NT)']
    reg = r"\b({})\s*:\W*(.+)".format( "|".join(sorted(map(re.escape, search_phrases), key=len, reverse=True)) )
    print(re.findall(reg, text, re.IGNORECASE))
    

    Output of this short online Python demo:

    [('Licence No.', 'DL-0820100052000 (P) R'), ('N', 'PARMINDER PAL SINGH GILL'), ('Issue Date', '20/05/2016'), ('Validity(NT)', '19/05/2021 : c')]
    

    The regex is

    \b(Validity\(NT\)|Licence\ No\.|Issue\ Date|N)\s*:\W*(.+)
    

    See its online demo.

    Details:

    • map(re.escape, search_phrases) - escapes all special chars in your search phrases to be used as literal texts in a regex (else, . will match any chars, ? won't match a ? char, etc.)
    • sorted(..., key=len, reverse=True) - sorts the search phrases by length in descending order (to get longer matches first)
    • "|".join(...) - creates an alternation pattern, a|b|c|...
    • r"\b({})\s*:\W*(.+)".format( ... ) - creates the final regex.

    Regex details

    • \b - a word boundary (NOTE: replace with (?m)^ if your matches occur at the beginning of a line)
    • (Validity\(NT\)|Licence\ No\.|Issue\ Date|N) - Group 1: one of the search phrases
    • \s* - zero or more whitespaces
    • : - a colon
    • \W* - zero or more non-word chars
    • (.+) - (capturing) Group 2: one or more chars other than line break chars, as many as possible.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum