From Documents to Data: A Framework for Total Corpus Quality

As large corpora of digitized text become increasingly available, sociologists are rediscovering the potential of text data for inquiries into social and cultural phenomena. While text data promise to enrich our knowledge of the social world, data quality remains a challenge.

Like a messy kitchen hidden behind a closed door while an impressive dinner is presented to the guests, scientists have largely ignored potential issues with data quality even as the body of empirical work using text data to explore the social world continues to grow. Hence, evaluating the quality of a corpus will be pivotal for future social scientific work grounded on text data.

This paper proposes a conceptual framework for total corpus quality, incorporating three key dimensions: total corpus error, corpus comparability, and corpus reproducibility. The three dimensions impact the validity and reliability of inferences drawn from text data. The framework permits evaluating and improving studies based on large-scale text analyses.

The total corpus quality framework.

We apply the framework to a historical corpus covering Sweden’s four national newspapers during 1945-2019. We demonstrate the quantification of crucial corpus quality dimensions and, importantly, we discuss common scenarios where quantification is impossible and where trade-offs exist between the different dimensions of total corpus error.

Read or download the article

Hurtado Bodell, M., Magnusson, M., & Mützel, S. (2022). From Documents to Data: A Framework for Total Corpus Quality. Socius, 8. https://doi.org/10.1177/23780231221135523

Researchers

Miriam Hurtado Bodell

Postdoc

Sophie Muetzel

Professor of Sociology

Research

Computational Text Analysis

Computational analysis at IAS uses large text corpora as social sensors to analyze public discourse, trace emerging narratives, and study how politics, media, and online publics shape shared understandings of societal events.

Organisation

Illustration of people surrounded by data

The Institute for Analytical Sociology (IAS)

IAS conduct cutting-edge research on important social, political and cultural matters. The research is sociological - in its original and broadly conceived meaning.