Estimation of N-gram Probabilities Lynellen D.S.P. Smith Submitted to Dr. Lois Boggess CS7003: Independent Study in Natural Language Processing Second Five Week Term, Summer 1994 Introduction N-grams, or collocations, play a significant role in computational linguistics. A collocation is a "recurrent combination of words that co-occur more often than expected by chance and that correspond to arbitrary word usages" [Smadja, 143] and they typically are found by computing the probability of bigrams and trigrams. The literature uses the terms "n-gram" and "collocation" almost interchangeably, and so will this paper. N-grams are useful for several tasks: 1) part of speech tagging [Church, Liberman] 2) finding collocations of technical terms for use in machine translation [Daille, Su] 3) constraining language models, aiding disambiguation, retrieving texts from large databases, and compiling lexico- syntactic facts for identifying normal and conventional uses of a given word [Church] 4) aiding disambiguation of prepositional phrase attachment [Hindle} 5) estimating the probability of previously unseen collocations based on "similar" words [Dagan]. Properties of Collocations Smadja presents an excellent tutorial on collocations at the beginning of his paper, and his points are summarized here. He identifies four properties of collocations that are important to remember in computational linguistics. Collocations are 1) "arbitrary": Non-native speakers of a language will have difficulty learning and spotting collocations because they are not word-for-word translations of an idea in their own language. 2) "Domain-dependent": Technical words, jargon, and even previously known words being used in a different sense can be difficult for non-technical people to understand. 3) "Recurrent": "These combinations are not exceptions, but rather that they are very often repeated in a given context" [Smadja, 147]. 4) "Cohesive lexical clusters": This means that the words of the collocation will belong together (within a certain context) in the mind of the native speaker. "The presence of one or several words of the collocations often implies or suggests the rest of the collocation" [Smadja, 147]. In other words, if a native speaker is given the phrase "in _____ words", they will very likely fill in the blank with the word "other". In terms of computational linguistics, this means that the probability of seeing the bigram "other words" is significantly larger than the probability of "other" multiplied by the probability of "words". Types of Collocations Smadja also identifies three types of collocations. The most flexible, yet hardest to identify type of collocation is the predictive relation. These are made of "two words repeatedly used together as a similar syntactic relation" [Smadja, 184]. The most rigid type of collocation is the rigid noun phrase--a sequence of words that cannot be broken or interrupted without losing the meaning of the phrase. Phrasal templates are long collocations, usually highly domain dependent. A phrasal template may or may not have any empty slots. If there are empty slots, the template will specify the part of speech for the word that fills that slot. Phrasal templates are useful for language generation, unlike predictive relations and rigid noun phrases [Smadja, 149]. Finding Collocations N-grams are computed by what at first appears to be a fairly simple process. First, the researcher must decide what type of collocation she is interested in. Sequences of words are not as meaningful statistically as words that are grammatically related [Joshi]. Examining grammatically related words means looking at verbs and their objects, nouns and their adjectives, etc. While this will produce collocations that are more significant and interesting, it means that the corpus must be annotated with part of speech information either automatically or manually. Either approach to tagging requires a fair bit of effort and has a certain error rate attached. In addition, using part of speech information will make the algorithm for finding target words more complicated than finding a simple sequence of words. Next, all the instances of target words are counted up and these counts are changed into frequency percentages by dividing the count by the total number of words in the corpus from which the counts were taken [Liberman]. Here is where the trouble comes. Most of the interesting collocations have a low frequency, so either a large corpus must be used to get an accurate measure of their frequency or the researcher must estimate the frequency by using statistical distributions. Having a large corpus may be impractical or impossible. Yet estimating may be just as impractical if the results are not valid due to assumptions made about the distribution of the words in question. We will return to this point later. Filtering Collocations Once a list of candidate collocations has been compiled, they must be screened to filter out false collocations or even ones that are just not of interest to the researcher. The list is likely to be large and have a "tail"--a long list of candidates with a count of one, as asserted in Zipf's law that "frequency is roughly proportional to inverse rank" [Church]. Many methods for filtering exist. Some are better than others, but none have been explored and verified to the point of being a standard in the field. Some of these methods include: simple frequencies (as described above) [Hindle], relative frequencies, Shannon diversity, contingency tables [Daille], mutual information [Church], 2, likelihood ratios, z-score, and Pearson's Chi-squared test [Dunning]. Several of these methods are described below. Contingency Tables Contingency tables are used to compare counts of n-grams to determine if the n-gram is a true collocation, meaning that the words that make up the n-gram are highly associated in the text. Below is a table for a bigram "AB". The counts of bigrams that appear in the corpus go into the contingency table in the following way. ------------------------------------------------- | k(A B) | k(~A B) | ------------------------------------------------- | k(A ~B) | k(~A ~B) | ------------------------------------------------- k(A B) is the count of the bigram with the first word being A and the second word being B. k(~A B) is the count of bigrams with the second word B, but the first word is not A. Similarly with k(A ~B) and k(~A ~B). If A and B are not highly associated, then "we would expect p(AB) = p(A)p(B) where p(AB) is the probability of A and B occurring in sequence, p(A is the probability of A appearing in the first position, and p(B) is the probability of B appearing in the second position" [Dunning, 70]. This method can be extended to n- grams of any length. Mutual Information Church and Hanks give a good treatment of the mutual information method. Mutual information compares the probability of two words occurring together with the probability of the words occurring separately (independently). The mutual information of A and B is given by I(A,B) = log2[p(A,B)/p(A)p(B)] where log2 is log base 2. If A and B are highly associated, then I(A,B) will be much greater than zero. Church and Hanks suggest that I(A,B)>3 produces the best set of interesting bigrams. If there is no interesting relationship between the two words, then the mutual information will be approximately zero. If A and B are in a complementary distribution, then I(A,B) will be much less than zero, but researchers hardly ever see this because the corpora are "small" (meaning one million words or less, according to Church and Hanks). In statistics, joint probabilities should be symmetric. P(A,B) should equal P(B,A) and thus I(A,B) = I(B,A) should hold. Real corpora prove this symmetry does not hold, meaning that we tend to associate words in a particular order. Church and Hanks give the example that we associate doctor->nurse, bread->butter, but not butter->bread. This is similar to the Kraft TV commercial for "cheese and macaroni" versus "macaroni and cheese". Similarity-based Estimation Dagan et al. propose that one does not necessarily need the actual count of occurrence of a particular N-gram. They suggest that the probability of a previously unseen collocation can be estimated based on the probability of other "similar" words which have been seen. Suppose the following situation: A is seen with B. B has been seen before, so that its probability is known, and B is called the "conditioning word". We do not know the probability of A, and it is called the "conditioned word". In the past, P(A|B) would have to be estimated in this case, by estimating the frequency of A in the corpus. This would assume the independence of A and B. Class-based and similarity-based N-gram models don't need to assume independence though. Dagan et al. describe both of these models. In a class-based N-gram model, known words with similar cooccurrence distributions are clustered into predetermined word classes. The probability of a previously unknown collocation is then estimated by the average of the cooccurrence probability of the two word classes to which these words belong. The similarity-based model described by Dagan et al. is an update of the similarity model. Each word has its own class of words that are most similar to it. The old version of the model could predict which unseen collocations were more likely than others, but could not provide a probability estimate for the collocation. Therefore, the old version of this model could not be used in an N-gram language model. However, Dagan et al. have developed a way to estimate the probability of the collocation. First, they show that the un-modified Maximum Likelihood Estimator (MLE) for bigrams will not work in a similarity-based model because the MLE predicts a zero probability for unseen bigrams. MLE is computed by Pmle(B,A) = c(B,A)/N where c(B,A) is the frequency count for the bigram in question, and N is the total number of bigrams in the candidate list. However, MLE can be adjusted by interpolation or discounting such that Pmle < 1 in order to leave some probability mass for unseen events. The probability mass thus freed has to be distributed over the unseen events. Dagan et al. use a variant of Katz's Back-Off model to redistribute the probability mass non-uniformly over unseen collocations in proportion to the frequency of A. Problems in Finding Collocations Despite their usefulness, there are problems with collecting and using N-grams. Dunning argues that care must be taken in statistically analyzing potential N-grams so that researchers don't overestimate the probability of the N-gram. There are three groups of computational linguistics researchers according to Dunning. Those who "1. Collect enormous volumes of text in order to make straightforward, statistically based measures work well. 2. Do simple-minded statistical analysis on relatively small volumes of text and either 'correct empirically' for the error or ignore the issue. 3. Perform no statistical analysis whatsoever" [Dunning, 60]. Many of the methods discussed above probably fall into his second category, and he says, "Many of the results from their work are entirely usable, and the measures they use work well for the examples given in their papers. In general, though, their methods lead to problems. For example, mutual information estimates based directly on counts are subject to overestimation when the counts involved are small, and z-scores substantially overestimate the significance of rare events" [Dunning, 62]. The main problem, he argues, is that these measures are based on the assumption of a normal or approximately normal distribution of the variables being sampled. While this assumption is valid in most instances, it is not valid when comparing the rates of occurrence of rare events, and texts are composed mostly of rare events [Dunning,62]. Instead of using the normal distribution, Dunning suggests that the binomial distribution be used. Binomial distributions are good when the "data to be analyzed are derived by counting the number of positive outcomes of repeated identical and independent experiment" such as counting the times that each word in a corpus is identical to the word being counted [Dunning, 64]. The binomial distribution can be approximated by the normal distribution for common events, but not for rare events. The normal distribution will overestimate probabilities for rare events in a way that is not constant and thus cannot simply be corrected for in calculations. Because of this, Dunning proposes a likelihood ratio test that does not rely on the assumption of a normal distribution. This has the benefit that much smaller corpora are needed for testing. Dunning also presents ways to use the likelihood ratio test with multinomial distributions, and then suggests that research should be conducted to see how the Poisson distribution and some distribution free methods of analysis (such as Fischer's exact method) perform on N-gram calculations. One method that attempts to find precise N-gram probabilities is discussed in Stolcke and Segal. They present and algorithm for computing N-gram probabilities from stochastic context-free grammars (SCFG), which are "a set of phrase-structure rules, annotated with probabilities of choosing a certain production given the left-hand side nonterminal" [Stolcke]. The probability of any N-gram can be computer by finding substring probabilities expressed as two linear systems of equations which can be solved. Conclusion This paper has reviewed some of the research about the computation of N-gram probabilities, which are useful for several computational linguistics functions including finding interesting collocations for use in machine translation and automatic tagging of text. Many of the methods used to screen the large lists of candidate collocations are based on the assumption of a normal distribution, which does not accurately compute probabilities for rare events. This means that the probabilities for the most interesting events (the rare events) are overestimated. Dunning has suggested several other statistical distributions that should be explored instead of the normal distribution, and also that distribution free methods should be researched. REFERENCES Church, Kenneth. A Stochastic Parts Program and Noun Phrase Praser for Unrestricted Text. Proceedings of the Association for Computational Linguistics Second Conference on Applied Natural Language Processing. Church, Kenneth, and Patrick Hanks. 1989. Word Association Norms, Mutual Information, and Lexicography. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics. Dagan, Pereira, and Lee. 1994. Similarity-based Estimation of Word Cooccurrence Probabilities. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics 272-277. Daille, B., E. Gaussier, and J.M. Lange. Towards Automatic Extraction of Monolingual and Bilingual Terminalogy. Dunning, Ted. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 3:61-74. Hindle, Donald, and Mats Rooth. 1991. Structural Ambiguity and Lexical Relations. Proceedings of the 29th Annual Meeting of teh Association for Computational Linguistics. Joshi, Aravind K. 1994. Some Recent Trends in Natural Language Processing. Linguistica Computazionale 9-10:491-501. Liberman, Mark, and Mitch Marcus. 1992. Very Large Text Corpora: What You Can Do With Them, and How To Do It. Tutorial at the 31st Annual Meeting of the Association for Computational Linguistics. Pereira, Tishby, and Lee. 1993. Distributional Clustering of English Words. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics 183-190. Smadja, Frank. 1993. Retrieving Collocaionts from Text:Xtract. Computational Linguistics. 19:143-175. Stolcke, Andreas, and Jonathan Segal. 1994. Precise N-gram Probabilites from Stochastic Context-free Grammars. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics 74-79. Su, Keh-Yih, Ming-Wen Wu, and Jing-Shin Chang. A Corpus-based Approach to Automatic Compound Extraction. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 242-247.