Adapting text analysis tools to massive corpora

The growing size of many contemporary text corpora makes it increasingly difficult to use standard techniques of computational text analysis. These complications include comparisons of Bayesian probabilistic models as well as the implementation of sampling algorithms for the hierarchical Dirichlet process, a popular alternative to standard topic modeling. Måns Magnusson and colleagues address these challenges in two papers presented at influential natural language processing and machine learning conferences. 

Abstract picture of  iStock, koto_feja

The first methodological contribution scales the leave-one-out cross-validation model comparison to very large data.

For probabilistic inference, diagnosing model performance and comparing different models is crucial but often overlooked. Leave-one-out cross-validation has become an increasingly popular method for comparing Bayesian models with respect to each model's total expected log predictive density (ELPD). Unfortunately, the approach does not scale well to massive data. The authors propose subsampling for large-scale model comparisons and combine the difference estimator with approximations of the ELPD of individual observations using truncated importance sampling and the delta method.

The paper also studies the performance of different approximations of the ELPD and derives an estimator of each model’s data uncertainty. Compared to alternative approaches (e.g., the Hansen-Hurwitz estimator), the difference estimator is better suited for the large-scale model comparison setting, a feature the authors show theoretically and empirically. In sum, these results open up the possibility of using leave-one-out cross-validation for comparing models designed for large corpora of social text.

Magnusson, Måns, Michael Riis Andersen, Johan Jonasson, and Aki Vehtari. 2020. Leave-one-out cross-validation for model comparison in large data. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 341-351.


The second methodological contribution enhances the applicability of the hierarchical Dirichlet process (HDP) to very large corpora.

The HDP is a probabilistic topic model that uses a Bayesian non-parametric prior and models textual data with an unknown number of topics. Although the model has been a popular alternative for the standard Latent Dirichlet Allocation model, the sampling algorithms for the HDP have previously not scaled well to large corpora used in computational social science. To scale these models, the inference algorithms need to be parallelizable to be used with multicore computers or in distributed systems.

This paper proposes a new parallel Markov Chain Monte Carlo sampler for the HDP topic model. The sampler is doubly sparse and data-parallel for both fast and parallelizable inference. Måns Magnusson and colleagues show that the number of topics can be estimated in the PubMed corpus with 8 million documents and 768 million tokens on a single multicore machine in under four days. Both methodological contributions will enable broader use of Bayesian non-parametrics for large-scale computational text analysis.

Terenin, Alexander, Måns Magnusson, and Leif Jonsson. 2020. Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2925–2934.