Google Learns To Read Text From Images!

3 11 2008

Google is number one for search. We’ve said this time and time again on Just Google It. Their textual search results are unrivalled. Their image search is, day-after-day, being improved by human interaction through the Image Labeler game. But it was rough-going when they tried to index both at the same time: Text on images.

The main way in which text is carried via an image is through pdfs. In the past these were included in Google’s textual search results, but Google only included them based on a small amount of information, such as page title, file name or the content of the pages that link to it. However, Google have now made a major breakthrough for this kind of search, and search altogether.

They have begun using Optical Character Recognition technology to understand what words are included in a pdf that includes a scanned image (normal pdfs have been indexed for a while now), and include them in search results. Official Google Blog says:

Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.Read the rest of this entry »