Reference number:	TIN2008-04998/TIN
Duration:	01/2009

Over the last decade, electronic communication of information has become an important part of everyday life. Internet based applications such as e-mail and the World Wide Web, have provided the platforms needed to enable massive information exchange on a daily basis in an electronic format. Initially, most communications were based solely on encoded text, but as connection speeds became faster, images are nowadays habitually included in electronic documents along with the encoded text information transmitted.

Of great importance are digital images that contain textual information, be it advertisements, logos, headers, or other semantically important entities of the document. Nevertheless, although text in images is important for the understanding of the documents, automatic methods for retrieving, indexing, filtering, and analyzing information have no way to extract text from digital images; therefore image text is simply ignored by all automated processes. The reason is that complex colour digital images are distinctly different from paper documents, and classical Document Image Analysis methods developed primarily for paper documents are inapplicable to this domain.

The industry has picked up on this issue, and has adopted digital text images as the platform of choice to implement CAPTCHA tests (“Completely Automated Public Turing Test to Tell Computers and Humans Apart”). CAPTCHAs are used extensively to control the abuse of online resources by robots by means of verifying that the user is human. The preferred form of CAPTCHA test displays a slightly obscured text encoded as a digital image. Reading the text in such images is a fairly straightforward task for humans, but computers are unable to deal with even the simplest obscuring techniques. Following the trend, spammers are nowadays using images increasingly to transmit unwanted messages to users, and spam filters arestruggling to respond to this new trend.

Locating and extracting the textual content from digital images is therefore a hot topic with a plethora of identified commercial applications. It is also apparent that classical DIA methods have failed to produce reasonable results for these images and a new concept is needed for the progress of this field.

Digital text images (and document images in general) differ from most other images (e.g. medical, satellite, real scenes etc) in the sense that they are prepared by human beings in order to be viewed and ultimately interpreted by other human beings. Therefore, consciously or subconsciously, the author employs certain design schemes so that the textual content stands out and is easy to read. It is therefore important to devise techniques inspired by human perception to locate and recognise text inside images.

Based on the sole established fact about digital text images, which is that they are designed explicitly to
be viewed by humans, we intend through this research activity to advance text extraction from colour digital images by drawing inspiration from human perception principles. We will investigate the applicability of existing human perception computational models, principles and mechanisms that are involved in the process of reading to the automatic location and segmentation of text in images. Models of attention, saliency maps, gestalt grouping laws and perceptual colour are a few of the concepts that we will focus on towards a holistic multidisciplinary approach to the analysis of the complex colour images in our corpus.

We will initially direct our focus on three specific groups of digital images, namely: Web images, spam advertisements and CAPTCHA tests. Web images and spam advertisements are commercially a prime target, since the need for their automatic analysis, indexing and filtering is pressing. Text-based CAPTCHA tests are of interest since their effectiveness depends on the trade-off between maintaining the tests pleasing and easy for humans while at the same time difficult for automatic recognition. This is an area that can directly benefit from a better understanding of the human process of understanding text images.

Through this activity we aim to approach the problem of text extraction from complex digital images in a holistic manner. The Document Analysis Group (DAG) is a leading group in the field of document image analysis and its members have long experience in the field (see also section 3.2). Between the members of the project we share experience in a diverse range of topics such as colour segmentation, word spotting, performance evaluation, human perception etc. For the particular project outlined here we will also establish collaboration with colleagues from the Cognitive Science and Neuroscience Group, Department of Psychology, University of Liverpool, UK.

Funding bodies

Ministerio de Ciencia e Innovación

Partners

University of Liverpool

Director

Dimosthenis Karatzas

Members

David Fernández
Jon Almazán
Lluís-Pere de las Heras Caballero
Muhammad Muzzamil Luqman
Partha Pratim Roy