Menu

C# – Using Tesseract to extract identity details from a photo (Passport, Driving License, etc.)

As part of a project I am working on, I did some research on extracting information from a scanned image or photo of an ID document. Acuracuy is not super important and is not a commercial project.

you will need to install “Tesseract” on your .net project to start with

Install-Package Tesseract

you can read about the Tesseract project here.

Go to this git hub and download the engine training file to https://github.com/tesseract-ocr/tessdata/, and copy it to “C:\Temp\TesseractEngineData”.

below is the sample implementation

    public class IdentificationInformationExtractService : IIdentificationInformationExtractService
    {
        
        public IdentificationInformationExtractService()
        {
            
        }

        public string Extract(string filePath)
        {
            using (var engine = new TesseractEngine("C:\\Temp\\TesseractEngineData\", "eng", EngineMode.Default))
            {
                using (var img = Pix.LoadFromFile(filePath))
                {
                    using (var page = engine.Process(img))
                    {
                        return  page.GetText();
                    }
                }
            }           
        }
    }

This is not perfect, you will have to do some cleanup work on the output and properly format it to a concrete/structured type, but this will get you started.

I am sure there are many tools out there with a paid subscription to get more accurate results. But this is FREE. You can use your C# skill to craft the results you are after.

Leave a comment