Report on “Computational Methods for Chinese History: A ‘Digging into Data Challenge’ Training Workshop”

The “Computational Methods for Chinese History: A ‘Digging into Data Challenge’ Training Workshop”, organized by the China Biographical Database (CBDB) project, was held at Harvard University on October 17, 2015. The workshop was put together to show researchers of Chinese history how to utilize and manipulate data of interest, as well as showcase projects that make use of computational methods. To provide hands-on training and demonstration, the event was held in a computer lab in Harvard’s Science Center. Over 50 participants from Harvard and beyond, ranging from graduate students to senior scholars, took part in this one-day workshop.

The workshop was part of the Automating Data Extraction from Chinese Texts (DID-ACTE) Project, which aims to provide humanists and social scientists with means of transforming historical Chinese sources into structured data. The project was funded by Digging into Data Challenge, an international research initiative to develop big data analysis methods for the humanities and social sciences.

The first presentation was given by Michael A. Fuller (UC Irvine), the designer of the structure of CBDB, which is a relational database with biographical information about more than 360,000 individuals primarily from the 7th through 19th centuries. This data is open to all researchers for statistical, social network, and spatial analysis, and can also serve as a kind of biographical reference. Fuller introduced some concepts about modelling historical data, then explained the advantages of having a database that is relational for storing information of biographical figures. He also guided the workshop participants through the installation procedures and basic operation of making queries and exporting data on the standalone version of CBDB, which is in MS Access format and downloadable from the project’s website

In the next session, Lik Hang Tsui (Harvard University) introduced the open-source platform MARKUS, which was developed by the European Research Council funded project “Communication and Empire: Chinese Empires in Comparative Perspective”. He showed how one could use the platform’s different techniques and reference tools for reading a wide variety of Chinese historical and literary texts, including the tagging of personal names, dates, place names, official titles, etc. Hongsu Wang (Harvard University) further demonstrated methods of extracting and converting such textual information for analysis. These allow users to utilize the tagged data for the purpose of visualization, which was the theme of the next two presentations.

In his presentation, Peter K. Bol (Harvard University) demonstrated the uses of spatial analysis for historical GIS data from China. He outlined the kinds of research questions that could be asked or even answered by applying GIS techniques to data about China, such as from the open-access China Historical GIS project and ChinaMap. Mapping results of queries with such data enables researchers to identify further points of interest that are related to locational factors. Song Chen’s (Bucknell University) presentation concerned historical social networks. He gave a concise introduction to concepts in social network analysis, then provided a step-by-step tutorial of how biographical data from CBDB could be visualized in the form of network graphs in the application Gephi.

The workshop concluded with four presentations of case studies that evolved from digital projects. Hang Yin (Peking University), the former project manager of the CBDB editorial team in Beijing, reflected on the workflows of how their team inputs, processes, and cleans up data in both manual and semi-automated ways for CBDB. He reminded researchers to be aware of the possible pitfalls of manual data processing if the goals are not adequately well-defined. Donald Sturgeon (Harvard University) introduced his study of text reuse based on the Pre-Qin and Han data generated from his Chinese Text Project. By analyzing and visualizing these textual relationships, he identified the clustering of texts according to schools of thought of the time. Xin Wen’s (Harvard University) study was about the military garrisons (fubing) system in the early stages of the Tang dynasty. By mapping the locations of those garrisons, which was of crucial importance to the empire’s military strength, he observed that they did not correspond to the population density of the time. Instead, the elites were clustered along the capital corridor, indicating the political significance of that region. The final presentation by Weichu Wang (Harvard University) took a comprehensive look at families which produced multiple jinshi degree holders in Ming China. By taking newly available data of name lists of degree holders in the CBDB, she was able to show the geographical distribution and other characteristics of such families as part of her effort to quantify and analyze social mobility in China during that period.

The workshop has attracted the attendance of historians from a good variety of fields in East Asian studies. Their interest in this workshop is testimony to how the current state of digital methods and datasets are transforming the study of Chinese history. Scholars could no longer afford to ignore the potential of these new research approaches.

Lik Hang Tsui

Harvard University

Nov. 2015

Recent Tweets