How to Improve OCR Conversion Quality
OCR (Optical Character Recognition) is not an easy task, both the quality of the source PDF and OCR option affect the quality and accuracy of the output file. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71% to 98%.
This tutorial will show you how to improve OCR Conversion Quality using PDF to Word OCR.
You need to select the appropriate document language prior to OCR conversion. This is extremely important step to get accurate text recognition result.
For example, if your PDF is in French but you choose English as OCR languages, non-english character like ' é à ' will not be recognized correctly.
The application supports 10 languages, including English, French, German, Italian, Spanish, Portuguese, Polish, Swedish, Russian and Dutch.
The quality of conversion depends on the quality of original PDF. Poor document images quality and skewed document may not be converted accurately.
And the image in PDF document should be at least 300 dpi, and 600 dpi is recommended for document with smaller fonts. Or the text will stuck together and OCR is hard to recognize those text.
Incorrect orientation of the document will result in poor conversion quality.
Move your mouse cursor to the left top of the built-in PDF reader, you'll see rotate buttons appear. Rotate operation only affect current page.
Extracting text is the main purpose of performing OCR, if the scanned PDF contains images elements, you need to select them prior to the conversion for better formatting preservation and accuracy.
(1) To select image areas, move your mouse cursor to the built-in reader, hold left-click and drag to select area. And then release the mouse.
(2) To move or adjust the area, click on it and drag the area border to the desired location.
(3) To remove selected area, simply select and press 'Delete' button on your keyboard, or move your mouse cursor to the left top of the built-in PDF reader, you'll see 'remove' buttons appear. You can remove single selected areas, or all the selected areas in this document.
Selected area will be preserved as an image in converted Word document and the app will not perform OCR for the select areas. By doing this, you can keep the original layouts better. If you don't select image area, text on image will also be OCRed, but the image will be missing in output document.