Extract text from image using OCR in python
I want to extract text from a specific area of the image like the name and ID number from identity card. The ID card from which I want to extract text is in the Chinese language(Chinese ID card). I have tried this code but it just extracts the address and date of birth which I don't need. I just need the name and ID number.
import cv2
from PIL import Image
import pytesseract
import argparse
import os
image = cv2.imread("E:/face.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY  cv2.THRESH_OTSU)[1]
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)
text = pytesseract.image_to_string(Image.open(filename), lang='chi_sim')
print(text)
os.remove(filename)
I have also attached the image from which I am trying to extract text. I have tried according to my knowledge but not succeeded.any help and guidance would be appreciated.
1 answer

I can suggest a preprocessing step prior to finding textual information. The code is simple to comprehend.
Code:
image = cv2.imread(r'C:\Users\Jackson\Desktop\face.jpg') # dilation on the green channel  dilated_img = cv2.dilate(image[:,:,1], np.ones((7, 7), np.uint8)) bg_img = cv2.medianBlur(dilated_img, 21) # finding absolute difference to preserve edges  diff_img = 255  cv2.absdiff(image[:,:,1], bg_img) # normalizing between 0 to 255  norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1) cv2.imshow('norm_img', cv2.resize(norm_img, (0, 0), fx = 0.5, fy = 0.5))
# Otsu threshold  th = cv2.threshold(norm_img, 0, 255, cv2.THRESH_BINARY  cv2.THRESH_OTSU)[1] cv2.imshow('th', cv2.resize(th, (0, 0), fx = 0.5, fy = 0.5))
Use it and let me know if you are able to find the relevant textual information!
See also questions close to this topic

Different signs of Principal Components
I have implemented a PCA in python. I used MNISTData and have reduced the data to 2d. After that I used KNN to classify the data. The same I had repeated with Scikit. The result is, that I have with my own PCA a much lower accuracy. I compared the PC's and see, that the signs of some components are different to the results of SciKit. I have absolut no idea how to fix this. I hope, one of you see my mistake or missunderstanding.
class Dimension_reduction: def PCA(self, X, dimensions): covariance_matrix = self.find_covariance(X) eigenvalues, eigenvectors = self.eigenvalue_decomposition(covariance_matrix) eigenpairs = self.sort_eigenvalues(eigenvalues, eigenvectors) projection_matrix = self.projection_matrix(eigenpairs, dimensions) new_featurespace = self.project_reduced_featurespace(X, projection_matrix) return new_featurespace def find_covariance(self, X): means = np.mean(X, axis = 0) covariance_matrix = (X  means).T.dot((X  means)) / (len(X)1) return covariance_matrix def eigenvalue_decomposition(self, covariance_matrix): eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix) return eigenvalues, eigenvectors def sort_eigenvalues(self, eigenvalues, eigenvectors): eigenpairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))] eigenpairs.sort() eigenpairs.reverse() return eigenpairs def projection_matrix(self, eigenpairs, dimensions): projection_matrix = np.array(eigenpairs[0][1].reshape(len(eigenpairs[0][1]),1)) for i in range(1, dimensions, 1): projection_matrix = np.concatenate((projection_matrix, np.array(eigenpairs[i][1].reshape(len(eigenpairs[0][1]),1))), axis = 1) return projection_matrix def project_reduced_featurespace(self, X, projection_matrix): return X.dot(projection_matrix) def scatter_plot(self, new_X, label_number, labels): partial_X = [] x = [] y = [] for i in range(label_number): partial_X.append([]) x.append([]) y.append([]) for i in range(len(new_X)): label = labels[i] partial_X[int(label)].append(new_X[i]) colors = plt.cm.rainbow(np.linspace(0,1,label_number)) for i in range(label_number): for j in range(len(partial_X[i])): x[i].append(partial_X[i][j][0]) y[i].append(partial_X[i][j][1]) i = 0 for x,y,c in zip(x,y, colors): plt.scatter(x,y,c, label = str(i)) return plt, colors def eigenfaces(self, X): covariance_matrix = self.find_covariance(X) eigenvalues, eigenvectors = self.eigenvalue_decomposition(covariance_matrix) eigenpairs = self.sort_eigenvalues(eigenvalues, eigenvectors) return eigenpairs
I also used different implementations
def PCA(X): PC = [] cov = np.cov(X.T) w,v = np.linalg.eig(cov) eig_pairs = [(np.abs(w[i]), v[:,i]) for i in range(len(w))] eig_pairs.sort() eig_pairs.reverse() PC.append(eig_pairs[0][1]) PC.append(eig_pairs[1][1]) Y = X.dot(np.array(PC).T) return Y def PCA(data, dims_rescaled_data=2): """ returns: data transformed in 2 dims/columns + regenerated original data pass in: data as 2D NumPy array """ import numpy as NP from scipy import linalg as LA m, n = data.shape # mean center the data data = np.mean(data,axis=0) # calculate the covariance matrix R = NP.cov(data, rowvar=False) # calculate eigenvectors & eigenvalues of the covariance matrix # use 'eigh' rather than 'eig' since R is symmetric, # the performance gain is substantial evals, evecs = LA.eigh(R) # sort eigenvalue in decreasing order idx = NP.argsort(evals)[::1] evecs = evecs[:,idx] # sort eigenvectors according to same index evals = evals[idx] # select the first n eigenvectors (n is desired dimension # of rescaled data array, or dims_rescaled_data) evecs = evecs[:, :dims_rescaled_data] # carry out the transformation on the data using eigenvectors # and return the rescaled data, eigenvalues, and eigenvectors return NP.dot(evecs.T, data.T).T, evals, evecs def pca_svd(X, num = 2): X = Xnp.mean(X, axis = 0) [u,s,v] = np.linalg.svd(X) v = v.T[:,:num] return np.dot(X,v)
In sum there are 4 different implementations. The result of first is:
# Training array([[1.01031446, 6.71282428], [3.03212724, 1.64169381], [ 2.95288108, 1.44413258], ..., [1.15329784, 2.83978701], [8.02795144, 2.12452378], [ 9.83911408, 3.2389573 ]]) # Testing array([[ 5.15053345, 6.79771421], [ 1.84247302, 0.58932415], [ 1.66957196, 3.89696398], ..., [ 5.22253275, 1.74628625], [8.2209684 , 0.32435677], [11.00041468, 4.62978653]])
For the second:
#train array([[ 4.94267554, 6.22892054], [ 6.96448832, 1.15779007], [ 0.97948 , 1.92803631], ..., [ 5.08565892, 3.32369075], [11.96031252, 1.64062004], [ 5.906753 , 2.75505357]]) #test array([[ 1.39046844, 7.2344142 ], [ 1.91759199, 0.15262416], [ 2.09049305, 4.33366397], ..., [ 1.46246774, 2.18298624], [11.98103342, 0.76105676], [ 7.24034966, 4.19308654]])
For the third:
#train array([[ 4.94267554, 6.22892054], [ 6.96448832, 1.15779007], [ 0.97948 , 1.92803631], ..., [ 5.08565892, 3.32369075], [11.96031252, 1.64062004], [ 5.906753 , 2.75505357]]) #test array([[1.39046844, 7.2344142 ], [ 1.91759199, 0.15262416], [ 2.09049305, 4.33366397], ..., [1.46246774, 2.18298624], [11.98103342, 0.76105676], [7.24034966, 4.19308654]])
for the svd:
#train array([[ 4.94267554, 6.22892054], [ 6.96448832, 1.15779007], [ 0.97948 , 1.92803631], ..., [ 5.08565892, 3.32369075], [11.96031252, 1.64062004], [5.906753 , 2.75505357]]) #test xt2 array([[ 1.39046844, 7.2344142 ], [ 1.91759199, 0.15262416], [ 2.09049305, 4.33366397], ..., [ 1.46246774, 2.18298624], [11.98103342, 0.76105676], [ 7.24034966, 4.19308654]])
At last the PC's from scikit:
#train array([[ 4.9426755 , 6.22891427], [ 6.9644884 , 1.15779638], [ 0.97948002, 1.92802868], ..., [ 5.08565916, 3.32364585], [11.96031245, 1.64060628], [5.90675305, 2.7550444 ]]) #test array([[1.39046854, 7.23440228], [ 1.91759194, 0.1526275 ], [ 2.09049303, 4.33366458], ..., [1.46246766, 2.18299325], [11.98103337, 0.76105147], [7.24034972, 4.19309378]])
One can see, that the results are just in signs different. But for the classification this is an problem. Additionally, there must be an mistake in the first code. But i can not find it.
Has anybody a idea?
PS: If i increase the number of PC's in projection, the signproblem get worst.

Please recommend a suitable programming / analytical approach for my problem.
I am working on a SQL / python project involving a data set of an operating plant.
If there are 50 contributing factors that impacts on the overall plant efficiency, can I write some codes to evaluate the impact of each variable?
What are the minor variables? What are the major variables that impacts the overall efficiency the most?
Please let me know what are the analytical approaches that I can use to achieve the above. (eg, standard transform, rescaling, Kmeans clustering? I don't know...)
I do not need any written codes for now. Please advice so that I can solve my problem in the right direction.
Thank you!

Which is faster, saving to a 2D list or CSV for millions of data?
I have over 20 million lines of data, each line with 60200 int elements. My present method of using:
with open("file.txt") as f: for block in reading_file(f): for line in block: a = line.split(" ") op_on_data(a)
where
reading_file()
is a function that takes around 1000 lines at a time. Andop_on_data()
is a function where I do some basic operations:def op_on_data(a): if a[0] == "keyw": print 'keyw: ', a[0], a[1] else: # some_operations on arr[] for v in arr[splicing_here]: if v > 100: # more_operations here two_d_list(particular_list_location).append(v) for e in arr[splicing_here]: if e < 100: two_d_list_2(particular_list_location).append(e) sys.stdout.flush()
And in the end I save the
two_d_list
to a Pandas Dataframe in ONE move. I do not save in chunks. For around 40,000 lines of a test dataset I got an initial time of~10.5 s
. But when I do the whole dataset, my system crashes after a few million lines. Probably because the list gets too large.I need to know what is the best way to save the data after doing the operations. Do I keep using lists or save directly to a CSV file inside the function itself like line by line? How do I improve the speed and prevent system from crashing?

cv2.findContours() acts strangely?
I'm finding contours in an image. With every contour I found, I print out its bouding rect and area and then draw it to the image. Funnily, I found that 5 contours that have been drawed while there were only 4 contours printed. Anyone knows what happened here?
>>contour 1 >>(0, 0, 314, 326) >>101538.5 >>contour 2 >>(75, 117, 60, 4) >>172.0 >>contour 3 >>(216, 106, 3, 64) >>124.0 >>contour 4 >>(62, 18, 138, 9) >>383.5
import cv2 import numpy as np img = cv2.imread('1.png') imgray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) ret,thresh = cv2.threshold(imgray,127,255,0) _, contours, hier = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) for i,c in enumerate(contours): rect = cv2.boundingRect(c) area = cv2.contourArea(c) print("contour " + str(i+1)) print(rect) print(area) cv2.drawContours(img, contours, 1, (0,255,0), 1) cv2.imshow('img', img) cv2.waitKey(0) cv2.destroyAllWindows()

CMake 2.8.11.2 Configuring incomplete, errors occurred
CMake Error at CMakeLists.txt:74 (include): include could not find load file: cmake/OpenCVUtils.cmake CMake Error at CMakeLists.txt:76 (ocv_clear_vars): Unknown CMake command "ocv_clear_vars".

No result when reading an image for Handwritten Recognition using tesseract
Currently I'm trying to implement a handwritten recognition program that will identify A,B,C,D, & E and numbers from 1100. what I tried so far is using PyTesseract. I had made a simple pytesseract code with this one
import cv2 import numpy as np import pytesseract from PIL import Image from pytesseract import image_to_string src_path = "testimg/" def get_string(img_path): img = cv2.imread(img_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) kernel = np.ones((1, 1), np.uint8) img = cv2.dilate(img, kernel, iterations=1) img = cv2.erode(img, kernel, iterations=1) cv2.imwrite(src_path + "sample.jpg", img) cv2.imwrite(src_path + "thres.png", img) result = pytesseract.image_to_string(Image.open(src_path + "thres.png")) return result print(get_string(src_path + "n.jpg") )
However, whenever I tried running the program. I don't get any result at all.
Can Someone help me with this one. Is there any alternative and easier way to implement handwritten recognition using python? Thank you

pyocr.get_available_tools() gives an empty list. Even after adding tesseract to environment variable
I am trying to do OCR on a PDF using pyocr and i am stuck in the first step. Pyocr does not recognize tesseract OCR even after i added the C:\Program Files (x86)\TesseractOCR to the system environment variable. Can anyone please help me fix this issue.

Add priority words for Tesseract 4.0 LSTM
I'm using Tesseract 4.0 and I'm searching how to add new words with high priority to the Tesseract dictionary. I already tried with Bazaar but only work with oem=0, then I tried to modify the lstmworddawg file with combine_tessdata but not work even if my new words were in the list. I will try to train Tesseract with LSTMTraaining but look difficult just to add some words.
So what I have to do? If someone could help me?
Thank for your help

TesseractNotFoundError()  pytesseract.pytesseract.TesseractNotFoundError
I have installed tesseract using homebrew ( brew install tesseract Warning: tesseract 4.0.0 is already installed and uptodate) and have downloaded pytesseract. What else could be yielding this error message? (I have a mac)

Uncertain Tesseract Output?
I was trying to extract the text from an static image template using pytesseract, but the output seems gibberish.
Here is my code:
from PIL import Image from pytesseract import image_to_string # test = image_to_string(Image.open('Test.jpg'), lang='eng') test = image_to_string(Image.open('PhraseTest.jpg'), lang='eng') print(test) print("Success!!")
Here is the image that i am using:
And below is the output that i am getting post ocr: Wmmwmm who mllhavaswoss
Is there anything that i am missing that's the reason for this?

This line of code fixed a pytesseract path issue I was having but I don't understand how it works, can someone explain how it works
So I was trying to run some code using pytesseract, and I got this error:
raise TesseractNotFoundError() TesseractNotFoundError: tesseract is not installed or it's not in your path
There was a post on this site that provided the solution, I added it to my code and it fixed things:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\TesseractOCR\tesseract.exe"
However, I want to understand what exactly this line of code is doing as I don't understand? It looks like it is calling a method on the pytesseract library? I don't understand, is this a line of code that can work with any library? What does the "r" in front of the path of the program file do?

Python pytesseract incorrect text extraction from image
I'm trying to extract the text from image using pytesseract. But I'm getting incorrect text. suppose when I try to read the following image,
A = pytesseract.image_to_string(Image.open('A.png'),config='psm 6',lang = 'eng')
output is:
Shooﬁng
I mentioned the language 'English' but still getting unknown characters like "ﬁ"
Though I mentioned lang= 'eng', it is giving latin characters.
"ﬁ"
Name : Latin Small Ligature Fi Unicode number
U+FB01 
Storing Tesseract output in list
I have extracting text from the image with the help of pytesseract, below is my python code
from PIL import Image import pytesseract from pytesseract import image_to_string pytesseract.tesseract_cmd = ('C:\Program Files (x86)\\TesseractOCR') img=Image.open(‘new2.jpg') text = image_to_string(img) print(text)
and I am getting output like this.
However, I want to store my output in a list format for each node.
Mylist=[‘Which car suits you?’,’Fairly low income’,’Fairly high income’,’Very high income’]
Eg my 1st node which is which car suits you ? will store in 0th index of list and Low income will store in 1st index of list and so on.
Can anyone suggest me how to achieve this output?