Sunday, September 23, 2007
Statistical methods of text analysis have become increasingly sophisticated over the years. A good example is automated topic analysis using latent models, two variants of which are Probabilistic latent semantic analysis and Latent Dirichlet Allocation.
Earlier this year, Amit Gruber, a Ph.D. student at the Hebrew University of Jerusalem, presented a technique for analyzing the topical content of text at the Eleventh International Conference on Artificial Intelligence and Statistics in Puerto Rico.
Gruber's approach, dubbed Hidden Topic Markov Models (HTMM), was developed in collaboration with Michal Rosen-Zvi and Yair Weiss. It differs notably from others in that, rather than treat each document as a single "bag of words," it imposes a temporal Markov structure on the document. In this way, it is able to account for shifting topics within a document, and in so doing, provides a topic segmentation within the document, and also seems to effectively distinguish among multiple senses that the same word may have in different contexts within the same document.
Amit is currently a doing graduate internship at Google. As part of his project, he has developed a fresh implementation of his method in C++. We are pleased to release it as the OpenHTMM package to the research community under the Apache 2 license, in the hopes that it will be of general interest and facilitate further research in this area.