Chinese Knowledge and Poetry Medieval Library
Most of the body of texts acquired by the CHi-KNOW-PO team was: 1. digitized as pictures of ancient books preserved in French libraries (and published as such); 2. transformed into full text thanks to HTR; 3. converted under XML-TEI format.
All these traditional books are written in columns and read from top to bottom and right to left.
ILLUSTRATION
To develop a model for the HTR step, a sample of XXX images of either simple or double pages was annotated on the Calfa Vision platform.
This step of the work included the annotation of page layout and the transcription of the text.
Layout analysis includes three layers of annotation: 1. Regions 2. Baselines 3. Polygons.
Regions are defined by their contours – generally a rectangle – and their type.
In the CHI-KNOW-PO Corpus, we tagged the regions as either: 1. MainText (main zone where the text appears) 2. MainTextTableOfContents (when the main zone corresponds to a table of contents) 3. Marginalia (for additional handwritten comments by readers) 4. MarginaliaMetadata (liminal zone on the side of the page where titles and page number appear) 5. MarginaliaPageNumber (when libraries added an arabic number to count pages) 6. MarginaliaCaption (in case of the presence of illustrations, to identify the corresponding caption) 7. Image (for illustrations) 8. ImageSeal (for Chinese seals) 9. ImageStamp (for library stamps)
Baselines are straight lines oriented from top to bottom. They might correspond to a column. However, because commentaries are interspersed within the base text columns as double columns, one column is often divided in several baselines.
ILLUSTRATION
Baselines are also tagged by type, which may be one of the following ones: 1. Text (for main text), in a MainText region 2. Commentary (for commentaries), in a MainText region 3. Title 4. AuthorName 5. MarginaliaLine, in a Marginalia region (for handwritten commentaries by readers) 6. PageNumber
The tagging of baselines was crucial to retrieve the text in the right order (i.e. the order in which it would be read by a human reader).
Polygons were generated by the engine developed by the Calfa team and emended when needed.
Two kinds of polygons were generated: 1. polygons 2. b-boxes
2010 edition of the Hanyu da zidian 漢語大字典 dictionary identifies more than 60,000 different characters.
This very large number of glyphs represents a challenge for HTR processing.
The situation becomes even more complex due to: 1. Taboo characters: The characters that compose the name of the emperor cannot be written or carved on wood during his reign. They might be replaced by another standard character or slighlty modified. In the second case, the modified character isn’t available in unicode. 2. Writing styles: Some editions were written in a specific style (see for instance the guange ti for the Siku quanshu edition). Despite being styles, the modified characters may be available as different glyphs in the unicode set, but not all characters are. 3. Variants and simplified glyphs: A similar situation appears with variants and orthographic simplications, since come characters have different expressions in unicode. 4. Absence of standard version: However, no “standard version” is uninanimously defined. 5. Absence of some characters in unicode set: And finally, some rarely seen characters do not have any representative in the unicode set.
The number of glyphs we might encounter is hence far superior to the 60,000 indexed in the reference dictionary mentioned above, while some of the are impossible to write down in unicode.
Because of this situation and because the aim is to run text mining scripts on the corpus to find echoes within it (meaning that one character needs to be read as one unique character whatever its graphic expression is), it has been decided to standardize characters during the annotation phase.
To decide which version of a character stands as the standard version, we rely on the Dictionary of variants compiled at the Academia Sinica: https://dict.variants.moe.edu.tw/variants/rbt/home.do (last accessed July 25th 2024).