Teaching computers to understand handwritten documents can help bringing our past to life, giving value to billions of documents currently stored in thousands of archives and libraries all over Europe. These document collections are not only composed of text documents, but also drawings, pictures, scientific cards, maps or music scores.
In every single historical building in Europe, there are ancient documents with nominal information that allows drawing the history of our ancestors or the migratory waves in a particular context. As our CVC researchers see it, it is bringing a huge amount of information to the surface and thus understanding the evolution of the European society through its featured actors: ourselves.
XARXES is funded by Recercaixa, the social program of La Caixa dedicated to research, and it is the continuation of another joint project (EINES) between CVC and the Centre for Demographic Studies (CED), of the Universitat Autònoma de Barcelona . “By using computer vision as an enabling technology, we are going to include computational methods in order to automatically incorporate contextual and semantic information to the original population sources”, states Dr. Alicia Fornés, fellow researcher at the CVC/UAB. This multidisciplinary project is coordinated by Dr. Fornés (CVC) and Dr. Joana Maria Pujadas (CED).
“Secondly, we will use record linkage techniques to connect different sources to be able to construct historical social networks”, appoints Dr. Fornes, “and finally, we will actively incorporate the participation of citizens and archivists, both in the extraction of demographic information through gamification, and in the design of new user experiences”. The objective is to facilitate the consumption and dissemination of the historical knowledge in an illustrative and pedagogic way. But at the same time, “social and historical researchers will enjoy of having open databases, something not still really common in Spain in comparison with Sweden or United States” states Dr. Pujadas.
Information extraction from historical documents
“The concept of information extraction can be explained in the following way: it is reading a document image, understanding the meaning of what is written in it, and filling a database. The idea is that you can automatically extract the knowledge of what is contained in the documents, and store it into a database. Consequently, all this information is accessible, ‘searchable’, and available not only for academia, but also open to society” Dr. Fornés explains.
In Europe, Europeana estimates that only the 10% of the documents that are stored in archives or museums, are digitized. “From this 10%, which is already an incredibly huge number, the amount of documents that are transcribed and stored into datasets is really, really ridiculous”. Therefore, instead of using an incredible amount of human resources to actually extract and read this information, the idea we have is to use document image analysis techniques to automatically process the information contained in these collections as much as possible. Instead of a mere transcription, the aim is to understand what is being transcribed and thus automatically extract and store the information contained.
Currently, the recognition of text in historical manuscripts is quite challenging due not only to the physical degradation of paper, but also the handwriting style (which is quite different from person to person) and the use of old, regional vocabulary and dialects. “This is the reason we’ve also decided to use word spotting techniques, which are defined as the detection of key words in in manuscripts in order to search and index the information contained”, as stated by Dr. Fornes.
How demographic networks are built
A typical application scenario of information extraction is demographic documents, since they contain people’s names, birthplaces, dates, occupations, etc. In this scenario, the extraction of the key contents and its storage into databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.
Once the information is extracted, demographic networks will be set up by record linkage and videogames. Record linkage refers to the task of combining data from same people across different data sources (as can be databases, books or websites). For example, one individual may appear in different documents, such as censuses, birth/marriage/death certificates.
And videogames. “The idea is to use videogames to motivate people to help in the transcription and information linkage”. Transcription is a key aspect because it is used to teach computers how to read, and also, to learn that a certain word means something specific (e.g. surname, place, etc.). The more we transcribe, the more the computer learns towards an almost automatic transcription. By using videogame techniques, the more a person plays, the further the transcription is improved. Therefore, we transform an automatic, and potentially monotonous process into a motivating activity.
Within this project, the municipalities that will be linked are located in the surroundings of Barcelona, and they will work as a primary experiment. The same model will then be replicable within the rest of Spain or Europe. The final aim would be to interconnect all the information stored in all the local archives that we have spread all over the old continent.
In this last step, the knowledge and expertise of the Centre for Demographic Studies of the UAB is crucial. Demographers will put all this information into context and interpret the demographic networks in order to narrate our history through the centuries stored at our local libraries.
Project supported by: