How to Extract Data From Scanned Documents
With the shift from physical to digital documents, extracting data from scanned documents through OCR and machine learning has become crucial for convenience. To enable accurate data extraction from scans, research facilities and corporations have advanced computer vision and Natural Language Processing (NLP).
Deep learning now allows extracting far beyond just text from scans – tables, key-value pairs, and more can be extracted. Many OCR data extraction solutions provide products to extract data from scanned documents, meeting the needs of individuals and businesses for document data extraction.
This article explores current technology for extracting data from scanned documents. We'll look at a Python tutorial for this purpose and popular market solutions offering top-notch scanned document data extraction capabilities through OCR and machine learning.
What is Data Extraction?
Data extraction is the process of converting unstructured data into interpretable information by programs, which allows humans to process the data further.
Here, we list several of the most common types of data to be extracted from scanned documents.
1. Text Data
The most common and the most important task in data extraction from scanned documents is extracting text. This process, while seemingly straightforward, is, in fact, very difficult as scanned documents are often presented in the format of images. In addition, the methods of extraction are highly dependent on the types of text.
While the text is present in densely printed formats the majority of the time, the ability to extract sparse text from less well-scanned documents or from handwritten letters with drastically varying styles is equally important. Such a process will allow programs to convert images to machine-encoded text, where we can further organize them from unstructured data ( without certain formatting) into structured data for further analysis.
2. Tables
Tabular forms are the most popular approach for data storage, as the format is easily interpretable with human eyes. The process of extracting tables from scanned documents requires technology beyond character detection – one must detect the lines and other visual features in order to perform a proper table extraction and further convert that information into structured data for further computation.
Computer vision methods (described in detail in the following sections) are heavily used to achieve high-accuracy table extraction.
3. Key-Value Pairs
Key-value pairs (KVPs) are a common alternative format used for data storage in documents.
KVPs are essentially two data items -- a key and a value -- linked together as one. The key is used as a unique identifier for the value to be retrieved. A classic KVP example is the dictionary, where the vocabularies are the keys and the corresponding definitions are the values. These pairs, while usually unnoticed, are actually being used very frequently in documents: questions in surveys such as name, age, and prices of items in invoices are all implicitly KVPs.
However, unlike tables, KVPs often exist in unknown formats and are sometimes even partially handwritten. For example, keys could be pre-printed in boxes and values are handwritten when completing the form. Therefore, finding the underlying structures to automatically perform KVP extraction is an ongoing research process even for the most advanced facilities and labs.
4. Figures
Finally, it is also very important to extract or capture data from figures within a scanned document. Statistical indicators such as pie charts and bar charts often include crucial information for scanned documents. A good data-extracting process should be able to infer from the legends and numbers to partially extract data from figures like barcodes or QR codes for further use.
Technologies Behind Scanned Document Data Extraction
Data extraction involves Optical Character Recognition (OCR) and Natural Language Processing (NLP).
OCR
OCR extraction converts text images into machine-encoded text, while the latter analyzes the words to infer meanings. Other computer vision techniques, such as box and line detection, are often accompanied by OCR to extract aforementioned data types, such as tables and KVPs, for more comprehensive extraction.
The core improvements behind the data-extraction pipeline are tightly connected to the advances in deep learning that have contributed greatly to computer vision and natural language processing (NLP).
Deep Learning
Deep learning has a major role behind the hype of the artificial intelligence era and has been constantly pushed to the forefront in numerous applications. In traditional engineering, our goal is to design a system/function that generates an output from a given input; deep learning, on the other hand, relies on the inputs and outputs to find the intermediate relationship that can be extended to new unseen data through the so-called neural network.
A neural network, or a multi-layer perceptron (MLP), is a machine-learning architecture inspired by how human brains learn. The network contains neurons, which mimic biological neurons and “activate” when given different information. Sets of neurons form layers, and multiple layers are stacked together to create a network to serve the prediction purposes of multiple forms (i.e., image classifications or bounding boxes for object detections).
In computer vision, a type of neural network variation is heavily applied – convolutional neural networks (CNNs). Instead of traditional layers, a CNN adopts convolutional kernels that slide through tensors (or high-dimensional vectors) for feature extraction. Together with conventional network layers, CNNs are very successful in image-related tasks and further form the basis for OCR extraction and other feature detection.
On the other hand, NLP is reliant on another set of networks, which focuses on time-series data. Unlike images, where one image is independent of the other, text prediction can be largely beneficial if words before or after are also considered. In the past few years, a family of networks, namely long short-term memories (LSTMs), has taken previous results as inputs to predict the current results. Bilateral LSTMs were also often adopted to enhance the prediction output, where both results prior and after were considered. In recent years, however, the concept of transformers that use an attention mechanism has risen due to their higher flexibility, leading to better results than traditional networks handling sequential time series.
Applications of Scanned Documents Data Extraction
The main goal of data extraction is to convert data from unstructured documents to structured formats, in which a highly accurate retrieval of text, figures, and data structures can be very helpful for numerical and contextual analysis.
Business corporations and large organizations deal with thousands of pieces of paperwork with similar formats on a daily basis – Big banks receive numerous identical applications, and research teams have to analyze piles of forms to conduct statistical analysis. Therefore, automation of the initial step of extracting data from scanned documents significantly reduces the redundancy of human resources and allows workers to focus on analyzing data and reviewing applications instead of keying in information.
- Verifying Applications – Companies receive tons of applications, whether handwritten or through only application forms. At most times, these applications may be accompanied by personal IDs for verification purposes. Scanned documents of IDs such as passports or cards usually come in batches with similar formats. Therefore, a well written data extractor can quickly convert the data (texts, tables, figures, KVPs) into machine-understandable texts, which could substantially reduce the man hours on these tasks and focus on application selection instead of extraction.
- Payment Reconciliation – Payment Reconciliation is the process of comparing bank statements to ensure the matching of numbers between accounts, which heavily revolves around data extraction from scanned documents – a challenging issue for a company with considerable size and various sources of income stream. Data extraction can ease this process and allow employees to focus on faulty data and explore potential fraudulent events in the cash flow.
- Statistical Analysis – Feedback from customers or experiment participants is used by corporations and organizations to improve their products and services, and a comprehensive feedback evaluation will usually need a statistical analysis. However, survey data may exist in numerous formats or hidden in between text with various formats. Data extraction could ease the process by pointing out obvious data from documents in batches, ease the process of finding useful processes, and ultimately increase efficiency.
- Sharing Past Records – From healthcare to switching bank services, big industries often require new customer information t elsewhere. For example, a patient switching hospitals due to moving may have pre-existing medical records that could be helpful to the new hospital. In such cases, good data extraction software comes in handy as all it is required is for the individual to bring a scanned history of records to the new hospital for them to automatically fill in all the information. Not only would this be convenient, but it could also avoid extensive risks, especially in the healthcare industry, of important patient records being overlooked.
How to Implement Scanned Document Data Extraction?
To provide a clearer view of how to perform data extraction, we show two sets of methods for performing data extraction from scanning documents.
1. Building from Scratch
One may build a simple data-extracting OCR engine via PyTesseract engine as the following:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'
# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))
# List of available languages
print(pytesseract.get_languages(config=''))
# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))
# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))
# Timeout/terminate the tesseract job after a period of time
try:
print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
# Tesseract processing is terminated
pass
# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))
# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))
# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))
# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default
# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')
# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')
For more information regarding the code, you may checkout their official documentation.
In simple words, the code extracts data such as texts and bounding boxes from a given image. While fairly useful, the engine is no where as strong as the ones provided by advanced solutions due to their substantial computational power for training.
2. Using Google Document API
def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
import json
import re
from google.cloud import vision
from google.cloud import storage
# Supported mime_types are: 'application/pdf' and 'image/tiff'
mime_type = 'application/pdf'
# How many pages should be grouped into each json output file.
batch_size = 2
client = vision.ImageAnnotatorClient()
feature = vision.Feature(
type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source = vision.GcsSource(uri=gcs_source_uri)
input_config = vision.InputConfig(
gcs_source=gcs_source, mime_type=mime_type)
gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
output_config = vision.OutputConfig(
gcs_destination=gcs_destination, batch_size=batch_size)
async_request = vision.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config,
output_config=output_config)
operation = client.async_batch_annotate_files(
requests=[async_request])
print('Waiting for the operation to finish.')
operation.result(timeout=420)
# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage_client = storage.Client()
match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)
bucket = storage_client.get_bucket(bucket_name)
# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
print(blob.name)
# Process the first output file from GCS.
# Since we specified batch_size=2, the first response contains
# the first two pages of the input file.
output = blob_list[0]
json_string = output.download_as_string()
response = json.loads(json_string)
# The actual response for the first page of the input file.
first_page_response = response['responses'][0]
annotation = first_page_response['fullTextAnnotation']
# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print('Full text:\n')
print(annotation['text'])
Ultimately, Google's document AI allows you to extract a lot of information from documents with high accuracy. In addition, the service is offered for specific usages, too, including text extraction for both normal and in-the-wild images.
Please take a look here for more.
Current Solutions Offering OCR Data Extraction
Besides large corporations with APIs for document data extraction, several solutions provide highly accurate PDF OCR services. We present several options of PDF OCR that are specialized in different aspects, as well as some recent research prototypes that seem to provide promising results*:
*Side Note: Multiple OCR services target tasks such as images-in-the-wild. We skipped those services as we focused on only PDF document reading.
- Google API -- As one of the biggest online service providers, Google offers stunning results in document extraction with their pioneering computer vision technology. One can use their services for free if the usage is pretty low, but the price stacks up as the API calls increase.
- Deep Reader -- Deep Reader is a research work published in ACCV Conference 2019. It incorporates multiple state-of-the-art network architectures to perform tasks such as document matching, text retrieval, and denoising images. There are additional features such as tables and key-value-pair extraction that allow data to be retrieved and saved in an organized manner.
- Nanonets™ -- With a highly skillful deep learning team, Nanonets™ PDF OCR is completely template and rule independent. Therefore, not only can Nanonets™ work on specific types of PDFs, it could also be applied onto any document type for text retrieval.
Conclusion
Armed with an understanding of the key concepts, tools, and platforms covered in this article, you'll be well-equipped to implement or enhance data extraction capabilities in your own projects and organizations. As scanned document data extraction technologies continue to evolve, staying on top of the latest advancements will be key to maximizing efficiency, insights, and competitive advantage.
FAQs
Can you pull information from a scanned document?
You can pull information from a scanned document using optical character recognition (OCR) services. Nanonets AI-powered OCR can easily automate extracting text and data from scanned PDFs, images, and other document types.
How do I extract text from a scanned document?
Use an OCR service like Nanonets to extract text from scanned documents. Upload the scanned file to the website, and the AI-OCR engine will analyze and convert the document to your preferred format. You can export the output to your business systems for further processing.
How do I extract data from a scanned PDF online?
If you have scanned PDFs and need to extract data from them, you can use Nanonets OCR. You only need to upload your file and choose the data you wish to extract. Its AI-OCR engine will analyze the scanned PDF, extract the text and data, validate it automatically, and make it available for editing. After that, you can export it in the format of your choice.
Can Excel get data from scanned PDFs?
You can import tables from PDFs into Excel using Excel's built-in PDF import feature. Just open Excel, go to the Data tab, click 'Get Data', select 'From PDF', choose your file, select the table(s), and import into Excel. This method of getting data into Excel may not always be accurate and may lead to formatting errors. Use an AI-powered OCR tool like Nanonets for better results.
How do I select data from a scanned PDF?
To select data from a scanned PDF, you can use various methods such as OCR tools, PDF converters, manual extraction, or automated solutions. Online OCR tools and PDF converters can help make PDFs editable and searchable. A copy-and-paste approach can be practical for a small number of simple PDFs. Alternatively, automated PDF data extraction tools like Nanonets use AI and ML to extract, validate, and process data at scale from scanned documents accurately. They also offer automation features that enable seamless export and auto-population of data into your business systems for further processing.
How do I extract data from an image file?
If you need to extract text from image files, use an OCR engine like Nanonets. Select the files you want to convert from your computer or drag and drop them into the upload box. The tool supports PNG, JPG, and PDF files. Once you upload the image, the image-to-text converter tool analyzes the content and converts it into editable text, which you can download as a text file within seconds.