It’s not uncommon for the current generation of desktop or departmental scanners to capture in excess of 200 images per minute, it’s never been easier to capture large numbers of image files in a short amount of time.
There are also huge collections of legacy files stored as outdated or varied file formats. Historic collections may have been scanned in the past but if the documents aren’t text searchable then a huge amount of content may be missed by users.
All of these scenarios create a challenge: how do you convert a large collection of files into a useful, consistent format without expending huge amounts of time in the process?
Optical Character Recognition is the process of identifying text within an image file by an automated process and outputting it to a format that supports search functions such as PDF. Alternatively, TXT or HTML might be the preferred output to provide back-end search capability of databases of image files.
Genus has provided file conversion and OCR services for almost 20 years and we have one of the newest and most powerful server architectures in the UK for image file processing. We have invested heavily in server-based technology enabling us to process huge numbers of documents in a very short space of time.
With support for almost 200 languages and many fonts including gothic fonts, there are very few jobs we cannot handle. We support around 30 output file types including 11 PDF specifications. We have passed hundreds of thousands of pages of historic newspaper and military diaries through our process to produce high-quality text-searchable image files.
After the OCR process, we can then extract data using AI with supervised machine learning to recognise different document types and layouts such as a recent project to extract lines of code from 16 million microfiche images for a major client. The final accuracy reading was 97.8% on approximately 230 million characters validated against lookup tables.
For hand-written documents where OCR is not capable of recognition, we offer a full transcription service. We have a proven track record of successfully delivering over 2 million hand-written records dating back to 1824 in a single project. We have experience with longhand cursive script in many languages covering a wide timeframe.
For repositories of images where content searching is not appropriate, we can use our infrastructure to consolidate multiple file types into a consistent archive containing just the file format of your choice. All folder structures are preserved during the work and the output is always validated against the input files to ensure accuracy. We have performed major document conversion projects including the conversion of a range of over 11 million pages of Microsoft Office files including DOC, DOCX, XLS, XLSX, TXT and PPT to consistent PDF files.
We also have the ability to perform bulk file translations such as combining single PDF files to multipage documents on a large scale. To find out more on our document conversion and OCR services please contact us