In an article for Sociological Methods and Research, Étienne Ollion along with Rubing Shen and Salomé Do shows how social scientists can leverage recent advances in natural language processing to automatically yet finely annotate millions of texts. They also provide a package and a tutorial for the open use of these methods.

Illustration for the study about social scientistOn two different tasks, not only did the model trained by the expert (the “Augmented social scientist”) perform at least as well as the micro-workers and the research assistants, but it did so while being trained on just a few hundred sentences.

The last decade witnessed a spectacular rise in the volume of available textual data. With this new abundance came the question of how social scientists can make the most of it. Textual analysis has a long tradition in the human and social sciences, but until recently no approach managed to combine the quality of human annotation with the power of a statistical analysis.

To bring together a fine-grained analysis and the ability to annotate millions of texts is the promise of new methods that have attracted a lot of attention. This is the case of sequential transfer learning, this class of deep learning models that have led to many innovations in the treatment of data (language, images) since their introduction in 2018. But how precise are these models? And if they work, how long does it take to train them? And what role does expertise play?

Building upon an experiment carried out on classic tasks of annotation, the paper shows that:

  • It is now possible to massively annotate texts, with a level of accuracy comparable to human coders. The models can also be trained to match any specific need, enabling the researcher to craft a wide set of indicators that fit their research. This result stands in sharp contrast with previous methods of textual analysis that, despite their merits, did not manage to reach such a high-level of precision.
  • While human coders quickly get tired of annotating, leading to coding errors, machines do not experience this classic “fatigue effect” and are more reliable. They can also annotate much more, and faster.
  • For harder tasks, quality trumps quantity and having an expert is better than multiplying annotations. For easier tasks, it is possible to outsource the annotation activity.

Read a summary on Twitter.

To make it easier for social scientists to use this novel method, we created a package and a short tutorial with nice examples. Check them out!

Read or download the article

Do, S., Ollion, E., & Shen, R. (2022, August 1). The Augmented Social Scientist. Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy.


Research group