The Marriage Licenses ground-truth is compiled from the Marriage Licenses Books conserved at the Archives of the Cathedral of Barcelona.
The Marriage Register Books are composed of 244 books with information of approximately 600,000 unions celebrated in 250 parishes between 1451 and 1905. One example can be seen in Figure 1. In addition to the marriage licenses, each book includes an index with all the husband’s family names and the page number where the marriage information appears. In some cases (see Figure 2), the wife’s family name is also included.
Each marriage license (see Figure 3) contains information about the husband’s occupation, husband’s and wife’s former marital status, socioeconomic position signaled by the fee imposed on them, and in some cases, fathers’ occupations, place of residence or geographical origin.
The original documents have been digitized at 300 dpi in true colors.
BH2M: the Barcelona Historical Handwritten Marriages database
The BH2M database is composed of one volume, written by the same writer. It contains 174 pages, 1,740 registers, 5,498 lines. The database consists of annotated images in an XML hierarchical structure (from individual words to blocks of text). The minimum unit of information is a bounding box of an individual word, with the currently transcription. Additional attributes like line or register number, or category, are associated to words. This representation allows to easily retrieve a GT from the meta-data adapted to the tasks being tested (segmentation, word spotting, handwriting recognition, etc.). Three levels in the XML meta-data associated to images can be identified. The first one is designed to evaluate tasks for layout analysis, the second one for text transcription, and the third one for context dependent interpretation.
Each page is composed by different physical bocks: text blocks, paragraph, lines and words. A document has three text blocks: left block or husband’s surname, right block or tax and central block or license. The second level corresponds to segmented lines: an accurate segmentation of the text lines is provided, including the ascenders and descenders of the text line.
The third level consists of text words. The attributes stored for each word are:
- The ID of the page where the word is.
- The bounding box coordinates of the word.
- The text block where the word is located (in case the word is inside a text block).
- The text line where the word is located.
- The appearance order inside the text line.
- The word also stores information about special cases. For example, if the word is part of a title, a comment, a cross-out word, etc.
The handwritten text was literally transcribed by volunteers using a crowd-sourcing platform. Given the complexity of the handwriting style due to many subtle spelling variants and the language itself, a posterior revision by experts demographers was performed to ensure its correctness.
The complete set of transcription rules is described here.
The next table summarizes some basic figures of the BH2M text transcriptions (for the old marriage records, we have proposed 3 partitions):
|Unique Words classes||3,060||1,535||1,757||3,360|
|Words that appear 1 time||1,831||942||1,100||1,710|
|Words that appear 2 times||397||213||234||552|
|Words that appear >2 times||832||380||423||1,098|
|Out of Vocabulary (OOV) words||–||594||1,082||–|
In addition to word transcription and location, the atomic units of this database are also labelled with meta-text information. Next, we describe the mentioned semantic information.
Each marriage license has the purpose of accounting for a prospective marriage. Hence the licenses contain similar information and structure, albeit the structure can vary over centuries. The general sub-parts of a license (L) are, in appearance order, the date-related (D) information, the husband-related (H) information and the wife-related (W) information. These three sub-parts are joined by keywords as follows: date (DI) and husband parts are connected by rebere de (we received from in old Catalan), husband and wife (HI and WI) information are connected by ab (abbreviation of with in old Catalan). Both husband and wife parts can be divided in two sub-parts, being his/her own information (e.g. name, surname, home-town) and the correspondent parents (HPI and WPI) information (e.g. father’s name, deceased parents). These two last parts are connected by the keywords fill de and filla de (son of and daughter of in old Catalan) respectively. Formally, the synthetic rules are:
|D ^ H ^ W||=> L|
|DI ^ rebere ^ de||=> D|
|HI ˅ (fill ^ de ^ HPI)||=> H|
|WI ˅ (filla ^ de ^ WPI)||=> W|
The semantic information is stored in the text word layout. The words are first categorized in several classes: husband, wife, husband family, wife family and other information. Each of these general class tags can be accompanied by more specific sub-class tags in a semantic and ontology-like way.
Getting the ground truth dataset
The following ground truth dataset may only be used for non-commercial and research purposes. For other purposes, please contact us first. Additionally, if you use this ground truth in your scientific work or publications , please cite this work as follows:
- D. Fernández-Mota, J. Almazán, N. Cirera, A. Fornés and J. Lladós. “BH2M: the Barcelona Historical Handwritten Marriages database”. International Conference on Pattern Recognition. pp. 256-261, August 2014.
|Subset 1||4 Mb||A subset of the database that includes the layout of the different records and corresponding lines.|
There is a new version of this database, used in a international competition: link here (http://www.cvc.uab.es/5cofm/competition/)
If you have any questions or suggestions, please contact Alicia Fornes.