Key Objectives
"Automating Data Extraction from Chinese Texts" is designed as an international and interdisciplinary collaboration that will facilitate and promote research techniques for large-scale structured datasets derived from unstructured corpora of Chinese texts. The platform will allow researchers and students to learn how to harvest comprehensive structured datasets from texts.
The written record of China extends 2200 years into the past, and much of it has already been transformed into searchable text by academic and cultural institutions, commercial vendors, and individuals. However, the user interfaces of current text repositories typically allow only keyword and phrase searches. Without sophisticated methods of extracting data from those texts according to researchers’ needs and interests, the use of extensive data in research will continue to languish. Therefore, the Automating Data Extraction from Chinese Texts project is developing a means of transforming texts written in classical Chinese into highly structured data. We aim to lift the current search constraints by creating a platform which will enable users to apply data-mining tools to the texts of their choosing.
Users will be able to upload texts, query a desired type of information, tag and code it, and extract the resultant data into a spreadsheet or other structured format. In contrast to text analysis techniques which are designed for the purpose of identifying broad trends, this project aims to capture precise data in its original context, i.e., to preserve its position in the text as well as the surrounding information. Thus the platform supports an array of research questions, allowing scholars to not only discover how certain terms are used within a given corpus of texts and in what context, but also analyze groups of related data using geospatial, statistical, and network analysis.
Key Outcomes
The project runs for 24 months (April 2014 - March 2016), with Year 1 dedicated to system development and data preparation and Year 2 focusing on data extraction and workshops. Within this timeframe, the project aims to produce three major outputs: (1) an open platform for the tagging and extraction of data from Chinese historical documents (MARKUS), (2) the extracted biographical data from local gazetteers and (3) workshops and supporting documentation
Recent Tweets
-
@Sean Wang 王修恩
📰 @MPIWG new publication on images in Chinese local gazetteers hot off the press! https://t.co/P0uz1DdLjF We discu… https://t.co/dNgvEgEsEC11 months, 3 weeks ago -
@PaulSpence
Interested in #transcultural approaches to digital study/practice? Deadline for proposals (March 16th) for our ‘Dis… https://t.co/dnCp9BHcq512 months ago -
@Centre for Digital Scholarship
Please note the upcoming deadline of Monday, March 2nd (23:59 CET) for the Call for Papers: Digital Humanities Bene… https://t.co/EjAmNUbPyx12 months ago -
@𝘋𝘪𝘨𝘪𝘵𝘢𝘭 𝘔𝘢𝘱𝘱𝘢
☞☞☞ Note: to view these DM 2.0 projects, Chrome or Firefox browsers are recommended for viewing. Today's release… https://t.co/5azuwj0MC612 months ago -
@Paul Vierthaler
Want to see a real research example of using regular expressions to transform a 3000 page natural language document… https://t.co/3KCBczAY8k1 year ago -
@Center for Open Data in the Humanities (CODH)
なお、Googleストーリーの記事には詳細情報へのリンクがありませんので、より詳しく知りたい方は、以下のページなどをご覧ください。 KuroNet https://t.co/kPSyYbTDTn KuroNetに関する論文(I… https://t.co/P24vM9KOp41 year ago