From Documents to Data: A Framework for Total Corpus Quality

As large corpora of digitized text become increasingly available, sociologists are rediscovering the potential of text data for inquiries into social and cultural phenomena. While text data promise to enrich our knowledge of the social world, data quality remains a challenge.

Like a messy kitchen hidden behind a closed door while an impressive dinner is presented to the guests, scientists have largely ignored potential issues with data quality even as the body of empirical work using text data to explore the social world continues to grow. Hence, evaluating the quality of a corpus will be pivotal for future social scientific work grounded on text data.

This paper proposes a conceptual framework for total corpus quality, incorporating three key dimensions: total corpus error, corpus comparability, and corpus reproducibility. The three dimensions impact the validity and reliability of inferences drawn from text data. The framework permits evaluating and improving studies based on large-scale text analyses.

The total corpus quality framework.The total corpus quality framework.

We apply the framework to a historical corpus covering Sweden’s four national newspapers during 1945-2019. We demonstrate the quantification of crucial corpus quality dimensions and, importantly, we discuss common scenarios where quantification is impossible and where trade-offs exist between the different dimensions of total corpus error.

Read or download the article

Hurtado Bodell, M., Magnusson, M., & Mützel, S. (2022). From Documents to Data: A Framework for Total Corpus Quality. Socius, 8.