Python convert color image to black text on white background for OCR

Issue

I have an image that need to do OCR (Optical Character Recognition) to extract all data.

enter image description here

First I want to convert color image to black text on white background in order to improve OCR accuracy.

I try below code

from PIL import Image
img = Image.open("data7.png")
img.convert("1").save("result.jpg")

it gave me below unclear image

enter image description here

I expect to have this image

enter image description here

Then, I will use pytesseract to get a dataframe

import pytesseract as tess
file = Image.open("data7.png")
text = tess.image_to_data(file,lang="eng",output_type='data.frame')
text

Finally,the dataframe I want to get like below

enter image description here

Solution

Here’s a vanilla Pillow solution. Just grayscaling the image gives us okay results, but the green text is too faint.

So, we first scale the green channel up (sure, it might clip, but that’s not a problem here), then grayscale, invert and auto-contrast the image.

from PIL import Image, ImageOps

img = Image.open('rqDRe.png').convert('RGB')

r, g, b = img.split()

img = Image.merge('RGB', (
    r,
    g.point(lambda i: i * 3),  # brighten green channel
    b,
))

img = ImageOps.autocontrast(ImageOps.invert(ImageOps.grayscale(img)), 5)

img.save('rqDRe_processed.png')

output

enter image description here

Answered By – AKX

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published