In recent years, the Heidelberg Research Architecture project Early Chinese Periodicals Online (ECPO) has evolved from a data silo into an open-access research platform. In the first decade of its existence, the project’s focus was on the systematization of digitized early Chinese press material. This resulted in a searchable database for image scans and bilingual metadata with over 435,000 entries – including 300,000 scans, 85,000 records and 50,000 agents names from Republican-era magazines and newspapers.
The ECPO platform was implemented in collaboration with the Institute of Modern History, Academia Sinica, Taiwan, made possible with funding from the Chiang Ching-kuo Foundation for International Scholarly Exchange. The platform has since been developed with further support from various institutions, such as the Centre for Asian and Transcultural Studies (CATS) Library, the Heidelberg Centre for Transcultural Studies (HCTS), the Institute of Chinese Studies and the Research Council Cultural Dynamics in Globalized Worlds from the University of Heidelberg; along with the Konfuzius-Institut Heidelberg and the University of Erlangen-Nürnberg as affiliated partners.
As the material basis of the database consists mostly of image scans, the project has been running experiments on one Republican newspaper to explore approaches towards full-text generation. Computer-aided processing of image scans of historical periodicals is still a challenging process with the current state of technology, in particular because processing standards for Latin-script newspapers are not applicable for the Chinese context. It is only with new approaches in machine learning that it is now possible to transform material which was previously inaccessible just a few years ago. However, many challenges remain. Extremely complex layouts resulting in difficulties for reliable automatic detection of page segmentation have so far prevented full-text generation for these newspapers even within China.
The application of artificial intelligence requires a ground truth data set. This error-free, manually corrected text with structural information is used both for evaluation and the training of software models for text and layout recognition. In fall of 2021, the project successfully implemented OCR on a sample from the newspaper 晶報 Jing bao (The Crystal), with a character error rate below 3% (Henke 2021). On that basis, the project is now expanding and generalizing its approach. With additional funding recently received from the Research Council Cultural Dynamics in Globalized Worlds for the first half of 2022, the project is currently producing a new data set. The project’s aim is to offer a solution to automatically produce full text from Republican newspapers using neural networks and machine learning.
The project’s current work will not only further develop its original aims, but will also contribute to the field of research as a whole. With the disclosure of the project’s network models and data sets, its results can be reproduced, evaluated and its approaches can be adopted by others in the field. Although processing non-Latin-script is still a challenge in many cases, the project hopes that its work may serve as good practice examples for such initiatives.
Reference:
Henke, Konstantin. Building and Improving an OCR Classifier for Republican Chinese Newspaper Text. BA thesis. Heidelberg University Library. 2021.
DOI: 10.11588/heidok.00030845
DOI: 10.11588/heidok.00030845