Scanned PDFs
Background
Does the type of PDF created matter?
There are two types of PDF:
Does the type of PDF created matter? Yes, when converting a PDF, the
nature of the PDF does matter.
Native PDFs
Native PDFs are generated from an electronic source document, for
example:
- Accounts production software
- Word
- Excel
- HTML
- Adobe InDesign
- Computer generated report
- Etc.
.... which have an internal structure that can be read and interpreted by
software.
These "generated" native PDF documents therefore already contain
characters that have an electronic character designation. In most cases, the
PDF creation software will take information from the structure of the source
document - such as character information, word placement information, etc.
and retain these items in the created PDF output. This is the reason why
you can word search a text-based PDF document.
Scanned PDFs
A scanned PDF comes about where a physical paper document needs to be
converted into an electronic form (i.e. where it is inefficient or not
viable to re-type/recreate documents manually into electronic form and then
convert them into PDFs).
The solution is to scan the document using an electronic scanning device.
The scanner digitally captures the image of the physical document into an
electronic form, creating a “snapshot” picture of the document. (Note: the
scanner does not reconstruct the character of every word when it creates
this scanned image.) This snapshot is then turned into a PDF by using
software integrated with the scanner.
The result is a scanned PDF document.
However, even though the image may be of a document that contains words,
the computer recognizes those words only as “images”, which it displays
without any information structure behind it.
This is the reason why if you try to text search the document, the PDF
search engine will not return any results.
OCR solution for scanned PDFs
To convert a scanned PDF into an searchable/editable format, OCR (optical
character recognition) software is required to analyze the “image” of each
character and match it to an electronic character-based file. This process
is not error free, and it may be difficult to determine that the character
"recognized" by the OCR software is indeed the character on the scanned
document.
OCR output - quality considerations
One should note, that the quality of OCR output is affected by matters
such as:
- Poor image quality of the scanned document
- Selection and mixture of fonts used in the scanned documents, and
italicized and underlined fonts, which may blur the quality and shape of
individual characters
- Etc.
OCR output - quality required for financial statements
For financial statements of course, the quality of OCR conversion is of
paramount importance. Accordingly these files need to be very carefully
processed followed by manual verification and correction of the OCR output
to assure accuracy of the results.
Following the above stage, the file is then read for conversion to iXBRL
or XBRL.
Alternative solution - avoiding the OCR stage
Obtain the source document from which the paper document was
printed
An easy solution, to avoid this OCR stage, is to
obtain the source document from which the paper document
was printed (and then scanned). This is likely to be:
- Word document
- or native PDF file
.... created just before signature of the financial statements. Note:
that the actual signature is neither needed nor utilized for iXBRL /XBRL
conversion.