iStock, koto_feja
The first methodological contribution scales the leave-one-out cross-validation model comparison to very large data.
For probabilistic inference, diagnosing model performance and comparing different models is crucial but often overlooked. Leave-one-out cross-validation has become an increasingly popular method for comparing Bayesian models with respect to each model's total expected log predictive density (ELPD). Unfortunately, the approach does not scale well to massive data. The authors propose subsampling for large-scale model comparisons and combine the difference estimator with approximations of the ELPD of individual observations using truncated importance sampling and the delta method.
The paper also studies the performance of different approximations of the ELPD and derives an estimator of each model’s data uncertainty. Compared to alternative approaches (e.g., the Hansen-Hurwitz estimator), the difference estimator is better suited for the large-scale model comparison setting, a feature the authors show theoretically and empirically. In sum, these results open up the possibility of using leave-one-out cross-validation for comparing models designed for large corpora of social text.
The second methodological contribution enhances the applicability of the hierarchical Dirichlet process (HDP) to very large corpora.
The HDP is a probabilistic topic model that uses a Bayesian non-parametric prior and models textual data with an unknown number of topics. Although the model has been a popular alternative for the standard Latent Dirichlet Allocation model, the sampling algorithms for the HDP have previously not scaled well to large corpora used in computational social science. To scale these models, the inference algorithms need to be parallelizable to be used with multicore computers or in distributed systems.
This paper proposes a new parallel Markov Chain Monte Carlo sampler for the HDP topic model. The sampler is doubly sparse and data-parallel for both fast and parallelizable inference. Måns Magnusson and colleagues show that the number of topics can be estimated in the PubMed corpus with 8 million documents and 768 million tokens on a single multicore machine in under four days. Both methodological contributions will enable broader use of Bayesian non-parametrics for large-scale computational text analysis.