You may find that when you are comparing a scanned PDF, some of the changes identified by the comparison appear illogical or are unexpected. If this happens, it is because Optical Character Recognition (OCR) has been performed on your PDF.
A regular PDF contains text that can be selected, copied and edited. A scanned PDF contains images of content; there’s no actual text content but only images embedded into the PDF file.
To run a comparison on a scanned PDF, the images must first be converted into editable text. This conversion process - OCR - is an imperfect process.
Workshare automatically runs OCR when you select to compare a scanned PDF and uses the converted version of the document for the comparison. This means, that the document Workshare actually compares may not be exactly the same as the document you selected.
Shown above, a scanned PDF is selected as the original document. Workshare converts the PDF to a text-based PDF and then runs the comparison using this converted original PDF. You cannot see the converted original PDF. Consequently, the comparison results may not match what you can see in the original and modified documents.
Why this may cause inconsistencies
While the conversion attempts to be as accurate as possible, some content may be converted incorrectly. For example, when the scanned PDF is a document that has been photocopied multiple times or includes hand-written notes. The comparison may indicate that text has been changed, while you can see that the text has not been changed.
Imagine the original document was a scanned rental agreement where the rent had been filled in by hand as £50.00 and the modified document was a regular PDF with the rent as £50.
The OCR process converts the scanned PDF and mistakenly converts the handwritten £50 to 450.00.
The comparison shows a change, when you can see there is none. Clicking the change will show that 450 has been deleted and £50 added. This seems very strange.
How to distinguish scanned PDF from a regular PDF?
One way of knowing whether your PDF is a scanned, image-based PDF is to try and select some text. You cannot select text in a scanned PDF, you can only select an area of image. In a regular PDF, you can select and copy text.
How side-by-side comparison helps you deal with scanned PDFs
When Workshare compares a scanned PDF, you are notified across the top of your comparison. For example:
This alerts you to the fact that OCR has been performed prior to the comparison.
You can review your changes in the usual way – hover over a change to learn more about it. Most of your changes will be accurate but there is a link to an explanation if the results are inconsistent. For example:
Remember that OCR is imperfect so comparisons of scanned documents will need more review time. Side-by-side comparison makes it very easy to review both original and modified documents as they are clearly shown in one workspace and stay synchronised as you scroll through them.