Review - Automatic Discovery of Language Models for Text Databases.

Laura M. Haas: Review - Automatic Discovery of Language Models for Text Databases. ACM SIGMOD Digital Review 2: (2000)

Review

This paper looks at how you can construct a model of what types of information is in a text database to aid in the direction of queries to appropriate databases. Previous work has pretty much assumed that the owners of the text database provide the model (typically a list of words or indexing terms and associated frequency information). However, this assumes that the owners will cooperate, will not willfully misrepresent their data, and that different owners will provide comparable information (there are many different IR techniques that will affect the choice of indexing terms, etc.).

Thus, the authors propose to build a language model by "sampling" the text source. They repeatedly ask single-term queries of the source, retrieve the top N documents, and use them to construct (and then modify) a language model for the source. The authors describe a number of experiments looking at how accurate a model is built, how quickly, and conclude that in fact, this query-based sampling does build reasonably accurate language models while only looking at a few hundred documents (acquired via about 100 queries).

The paper is well-written and easy to read, and covers interesting new ground. The authors make a good case that query-based sampling offers important potential benefits. However, I was not convinced by their experimental results that sampling did deliver "reasonably accurate" models. The authors admit that it is hard for sampling to do a good job of estimating the frequency information for a language model, but don't know (as they do, in their defense, state) how important that is for accurate database selection. Thus, this paper could either be a rather academic study (should it be the case that the frequency information is critical and unobtainable without cooperation), or a very important work (should it some day be shown that frequency info is relatively unimportant, or a way found to estimate it via sampling). In any case, it is an interesting direction for research.

References

[1]: James P. Callan, Margaret E. Connell, Aiqun Du: Automatic Discovery of Language Models for Text Databases. SIGMOD Conference 1999: 479-490