Optical Character Recognition

From Intellogist

Jump to: navigation, search
This Glossary entry exists for the community to share information related to common terms used in prior art searching. Registered users can add, edit, or delete material on this page. Users should keep in mind that the information on this page is the result of community collaboration and, as such, is vetted by the community at large, not individual experts or fact-checkers. All information contributed to this page is public information - do not post confidential information. For more information about creating and editing Glossary articles, please see our Help pages. If you found this page through a web search, we invite you to visit our Main Page to see what Intellogist is all about.


Introduction

Optical Character Recognition (OCR) is the process by which a computer analyzes a static image of a document (such as a TIFF, JPEG or Adobe PDF) and translates the words within the image into text characters. The text can then be modified, searched, or copied as in a standard text document. OCR technology is now employed in a wide variety of fields to digitize documents normally received or maintained in hard copy.


Accuracy

OCR technology is not without flaws. Inaccurate character identification remains an ongoing issue, and can lead to poor or useless renderings of text from the original document. Numerous factors increase the error rate of the digitization process, including unusual, small or complicated fonts in the original, low quality or damaged original documents, low quality digital images of the original, and the accuracy of the OCR software used.[1] Depending on these factors, the result of the OCR process can be up to 99% accurate.[2] Disciplines in which complete accuracy is paramount should be cautious of the indicated 1% baseline level of error. Additionally, OCR programs can be stymied by images or complex mathematical formulae when placed within or in close proximity to text. OCR should not be relied upon for conversion of these elements.


OCR and Patents

Many sources of patent information provide electronic text from original documents that have been processed with OCR software. (Most notably, the full text Patent Cooperation Treaty (WO/PCT) collections hosted by major commercial vendors are produced this way.) The advantages to users are numerous, and analogous to having a document produced in a digital format initially: the complete text of the documents can be searched and copied to make locating and comparing disparate patent documents easier. Keyword searching within patent specifications and claims can identify minor technical details that bibliographic information cannot convey, and such details are often critical in prior art searching.

Unfortunately, the pitfalls noted in the section above can limit the utility of the OCR text. In searches where an exact match is crucial, a 1% error rate can mean that important results go unfound if additional methods are not employed. Simplified examples follow:

  • In a genetic sequence search, the string “CTTCGTA” could be rendered as “CTTOGTA” if the OCR program mistakes the second “C” for an “O.”
  • A search for the computer code “cout << "Hello World!" << endl; cout <<” could fail because the code was misunderstood as “cout << "He1lo World!" << endI; cout <<”, substituting a 1 for the first lowercase “L” in “Hello.”
  • A search for a chemical compound such as “C8H10N4O2” would not find the mis-rendered compound “CBH10NA02” because the OCR program misinterpreted several of the small subscript letters that are frequently difficult for programs to identify.
  • Even more simply, an error such as rendering “multi-cylinder” as “rnulti-cylinder” will impact results because the OCR program has mistaken a lowercase “M” for an “r” and an “n” next to each other.


Here the OCR program (Adobe Acrobat 6.0) has failed to accurately reproduce an equation. The top text is as rendered by the patent, the bottom text is the result of copying the above into Microsoft Word.


In a more extreme example, the same program has failed to reproduce a sentence comprehensibly. The word "metal" is not highlighted because the OCR program failed to recognize it altogether.


The error rate is likely to rise with the age of the document. Older patents were produced with unusual fonts and poorer reproduction technology, and the effects of age and wear on the legibility of text contribute to a higher error rate for OCR.


An example of the difficulties involved with older patents, Google Patent (presumably using their OCR software, Tesseract) has misread the title of a patent from 1914.


Even with the above caveats, OCR technology is a valuable and necessary aid in searching large volumes of documents. A thorough searcher will take necessary steps to compensate for possible errors when querying collections produced via OCR.


User Experience

A user has submitted the following tip on searching OCR'ed patent documents in PDF format by converting the document into Google Doc format:

The technique works only with patents in PDF format downloaded from Google Patents, so it applies only to US patents and published applications. I tried this with PAT2PDF and also with free.patentfetcher.com, but they will not work. The qualification is that this will work only with patents that are less than 2MB and will work only on the first ten pages, irregardless of the length of the patent. Use docs.google.com to upload the PDF file or folder of PDFs by toggling the middle option permitting conversion to Google document formats: "Convert text from PDF and image files to Google documents." Then search within Google Docs or from within Gmail after selecting the option to "Search Mail and Docs" within Account Settings. I have not yet experimented breaking a patent PDF down into 10 page parts in order to test if it is feasible to search a patent longer than 10 pages.


Sources

  1. “About Optical Character Recognition (OCR)” Microsoft Office Online, http://office.microsoft.com/en-us/help/HP030812551033.aspx. Accessed on December 16, 2008.
  2. Rice, Stephen V., Nagy, George, and Nartker, Thomas A. Optical Character Recognition: An Illustrated Guide to the Frontier. New York: Springer-Verlag New York LLC, 1999. Page 2.


Patent search questions. Expert answers.  Brought to you by Landon IP
HOT Items

Intellogist is brought to you by the patent search experts at Landon IP.

Welcome to Intellogist!

To network with our international community of patent info pros, please create an account.

For a list of our current members, see our Community Page.