Our largest project of 2019 was conducted on behalf of one of our closest Print Services & Solutions partners – HP. Genus has worked on several projects with HP, as HP’s specialist microfilm advisor. It was only through these previous projects that HP trusted Genus to take on a project of this scale and complexity for one of their largest clients. This was the scanning and OCR conversion of over 45 million images on microfiche and 16mm roll film for a major UK Retail Bank. This project alone allowed us to increase the number of onsite Wicks and Wilson 7750 Production Microfiche Scanners by 7. Also, the project helped us to reach an important milestone; to keep our Digitisation studio at full production and to maintain other on-going projects a second shift was introduced to run from 17.00 – 22.00 every night. Here you can read about our planning and the processes involved in delivering such a large scale, high security project.
Security & Collection
Before the project could commence the internal security department of the Bank made a site visit to Genus to satisfy themselves that the very highest levels of security could be maintained throughout the project. Genus passed this security audit at all levels. Genus has GPS-tracker fitted vans with no sign writing and the client insisted on the use of one of these vans for both the collection of the microfilms and the return of the data. The live tracker data was made available to the client. In addition, Genus provided two drivers for every trip to provide extra security and in case of medical emergencies. The refuelling of the vehicle was completed before each part of the journey where originals or data were going to be on board, this ensured that no refuelling took place while the van contained important originals or data. The microfilms and data were stored within secure and locked transport cases within the van. All data was contained on secure encrypted hard drives with Pin code access. The Pin code was not shared with the customer until a pre-approved hard drive delivery procedure had taken place and the client had confirmed safe delivery to a named individual. This whole Secure Data Delivery process was documented and entered into as part of the contractual agreement, in line with our ISO 27001 Information Security policies.
Given the sensitive nature of the work, security was always going to be of paramount importance. Physical access to the scanning bureau and server room was controlled with staff passes with all staff movements recorded in specialised software.
A Network Attached Storage (NAS) device was purchased especially for this project, along with 4 dedicated servers. These 5 devices were used exclusively for the project and locked in the server room. A private “air-gapped” network was put in place away from the main bureau network to ensure no connection to the outside world.
Genus has strict rules regarding storage devices, mobile phones and cameras, which are not allowed in the scanning bureau and must be stored in lockers in the breakout areas. Scanning Technicians are constantly reminded of the rules. CCTV cameras are in place and are used to monitor staff activities.
Given the complexity of the project and the interdependencies, Genus put in place a detailed Risk Register to identify risks, measure their probability, impact (low, medium, high), identify risk owners and have mitigating plans in place.
During the project one of the minor identified risks did occur and the mitigating action was put in place successfully with all parties informed and ready for the resulting change to the process.
Genus developed a change strategy through stakeholder analysis and based on their vast experience in digitisation. The Genus project team met on a regular basis and produced weekly highlight reports to ensure all relevant parties understood each phase of the project. A concise Communication Plan was devised that supplemented the actions.
Secure ingest of content
Upon arrival at Genus, the delivery van was driven into the loading bay and the doors closed behind it. The transport cases were moved directly into the secure studio storage area which is only accessible to supervisory staff. The security tabs were removed and transport cases opened, enabling the contents to be checked against the delivery manifest. At this point the client would have been notified of any discrepancies between the manifest and the delivered content, had there been any.
The individual boxes of microfiche were then removed from the transport cases and placed onto storage shelving in readiness for capture.
Capturing the data
Seven high-speed Wicks & Wilson 7750 ScanStations and two 8850 ScanStations were employed on this project to capture 40 million microfiche images and 10 million 16mm roll film images over a period of six months. The main fiche format was Computer Output Microfilm (C.O.M.) which presented 540 images to a 6” x 4” section of film. The fiche were captured at 300dpi at actual size (A4) and a preview of the entire fiche was displayed to the operator. Three of the corner images were selected and positioned within a location grid, the capture software then de-skews the fiche and aligns each cell image correctly. The operator can then visually check a selection of images at full size for Quality Control purposes before accepting the fiche. While this is happening, the scanner has captured the next fiche ready for processing.
Quality assurance is an important stage of any project, especially when dealing with such large quantities of data. As such it becomes vital to streamline the process to ensure as much data as possible can be checked in a reasonable amount of time. Due to the requirements of the customer and the nature of the content being captured, it was imperative to ensure three requirements were met. Firstly, making sure all content is captured. To meet this need we put all fiche through one of our InoTec document scanners. This allowed us to have a second count of the fiche to compare against the number captured. The second and third points of our quality assurance process were image quality and grid position. To ensure the fiche were captured correctly we had to check the three corner images were aligned accurately. By checking these three images we not only check the general quality of the fiche, but also if the grid was aligned correctly. The three corners we needed to check were always in the same position of the grid and as such are all named the same. This allowed us to search for each corner image separately from the entire data and provide a check on every single fiche captured. Not only did this allow us to check quality but to also ensure each fiche is indexed accurately by checking the first index field number against the indexed fiche file name.
Data Extraction with Ephesoft
The customer requirement was to create a searchable PDF (using Optical Character Recognition – OCR) for each document (almost 1.5 million in total). This was obtained from scanned micro fiche. Data was extracted from each scan, then added to a database. The database entries were used to create Comma Separated Value (CSV) files allowing the customer to use Microsoft Excel to read the metadata and upload it to their database application.
With over 1.5 million images, data validation was challenging. The first (and easiest) check was to ensure that there was a PDF file for each scan. Due to the large number of files, batch scripts were written to list folder contents and exported as CSV files. Excel could then be used to ensure that the number of scans into the system was the same as the number of PDF files out of the system. A list of input filenames and output filenames was also created.
The next part was more challenging. Each scan should have its own unique entry into the database. This was done in 2 ways. Firstly, using SQL (Structured Query Language), the number of database entries was counted and listed. Secondly, the database results were exported as CSV files.
The key data fields required for extraction were present in multiple places in each document. For each document, the system extracted the numeric key data field at the top of the document, and at the bottom. These numbers should be identical, but the quality of the micro fiche wasn’t 100% perfect. Given the sensitivity of obtaining the wrong key data number, the system checked to see if the numbers were the same and if the number was valid against a comparison and Luhn check sum validation using Ephesoft. Genus are proud to have achieved 94.4% success rate with this check validating 240 million characters.
Secure Data Delivery
The completed image files and associated CSV’s were packaged for delivery then audited on our servers. Secure hard drives were selected with built-in AES-ATS 256-bit disk encryption. The disks are also fully ruggedised with a tamper proof design. A unique 12-digit password was created for each data delivery and delivered to the client in two parts.
The data was uploaded to the disk using file transfer software and an audit of the disk taken. The server and disk audits were then compared to ensure complete upload of data.
Upon receipt of the data, the client extracted the files from the disk, performed their own audit then re-formatted the disk to ensure no data remained.
The successful completion of this project was a very proud moment for Genus. We were able to provide a secure and auditable digitisation process for a very high-profile customer and we converted and validated a huge amount of data in only 6 months. The integration of complex data extraction software, dedicated servers and a stand-alone secure working area (which was built and commissioned specifically for this Customer) were huge achievements.
The complexity and business critical nature of the information scanned and then extracted for our customer meant that we had to design and implement auditable cross-checking mechanisms with reporting to ensure that the accuracy was maintained. When you consider the reduction ratio of the page images on the microform material (and the fact that we were working from used copies of the fiche and film with scratches and damage) achieving a 94.4% number validation over the entire body of material was significant.
The reason the customer commissioned the project was because the analogue fiche and film they held were being accessed manually using reader printers that were becoming unreliable and where parts and servicing was becoming impossible to procure. As the information on the media was so important to them (for the provision of their services) this situation represented a significant risk.
The service that Genus provided delivered many benefits as it did not only replace the analogue media with searchable digital images, but it provided accurate extracted data for use in a database environment. This effectively cut out the need to refer to the images for every enquiry and therefore the project delivered much more benefit to the customer than was originally envisaged.
Since completing the project Genus has had numerous discussions with other organisations where their key data is locked in analogue microfilm and fiche media. Our proven track record has enabled our customers to trust us with their sensitive content for scanning. It has also enabled us to propose methodologies for extracting data into a usable format so that they can derive much more benefit from scanning their materials than just receiving digital scanned versions of their documents.