C# – Using Tesseract to extract identity details from a photo (Passport, Driving License, etc.)
As part of a project I am working on, I did some research on extracting information from a scanned image or photo of an ID document. Acuracuy is not super important and is not a commercial project.
you will need to install “Tesseract” on your .net project to start with
Install-Package Tesseract
you can read about the Tesseract project here.
Go to this git hub and download the engine training file to https://github.com/tesseract-ocr/tessdata/, and copy it to “C:\Temp\TesseractEngineData”.
below is the sample implementation
public class IdentificationInformationExtractService : IIdentificationInformationExtractService { public IdentificationInformationExtractService() { } public string Extract(string filePath) { using (var engine = new TesseractEngine("C:\\Temp\\TesseractEngineData\", "eng", EngineMode.Default)) { using (var img = Pix.LoadFromFile(filePath)) { using (var page = engine.Process(img)) { return page.GetText(); } } } } }
This is not perfect, you will have to do some cleanup work on the output and properly format it to a concrete/structured type, but this will get you started.
I am sure there are many tools out there with a paid subscription to get more accurate results. But this is FREE. You can use your C# skill to craft the results you are after.