Estimation of N-gram Probabilities














                           Lynellen D.S.P. Smith
                       Submitted to Dr. Lois Boggess
         CS7003: Independent Study in Natural Language Processing
                    Second Five Week Term, Summer 1994


                               Introduction

     N-grams, or collocations, play a significant role in
computational linguistics.  A collocation is a "recurrent
combination of words that co-occur more often than expected by
chance and that correspond to arbitrary word usages" [Smadja, 143]
and they typically are found by computing the probability of
bigrams and trigrams.  The literature uses the terms "n-gram" and
"collocation" almost interchangeably, and so will this paper.

     N-grams are useful for several tasks:
     1) part of speech tagging [Church, Liberman]
     2) finding collocations of technical terms for use in machine
translation [Daille, Su]
     3) constraining language models, aiding disambiguation,
retrieving texts from large databases, and compiling lexico-
syntactic facts for identifying normal and conventional uses of a
given word [Church]
     4) aiding disambiguation of prepositional phrase attachment
[Hindle}
     5) estimating the probability of previously unseen
collocations based on "similar" words [Dagan].


                        Properties of Collocations

     Smadja presents an excellent tutorial on collocations at the
beginning of his paper, and his points are summarized here.  He
identifies four properties of collocations that are important to
remember in computational linguistics.  Collocations are
     1) "arbitrary": Non-native speakers of a language will have
difficulty learning and spotting collocations because they are not
word-for-word translations of an idea in their own language.
     2) "Domain-dependent": Technical words, jargon, and even
previously known words being used in a different sense can be
difficult for non-technical people to understand.
     3) "Recurrent": "These combinations are not exceptions, but
rather that they are very often repeated in a given context"
[Smadja, 147].
     4) "Cohesive lexical clusters": This means that the words of
the collocation will belong together (within a certain context) in
the mind of the native speaker.  "The presence of one or several
words of the collocations often implies or suggests the rest of the
collocation" [Smadja, 147].  In other words, if a native speaker is
given the phrase "in _____ words", they will very likely fill in
the blank with the word "other".  In terms of computational
linguistics, this means that the probability of seeing the bigram
"other words" is significantly larger than the probability of
"other" multiplied by the probability of "words".

                           Types of Collocations
     Smadja also identifies three types of collocations.  The most
flexible, yet hardest to identify type of collocation is the
predictive relation.  These are made of "two words repeatedly used
together as a similar syntactic relation" [Smadja, 184].  The most
rigid type of collocation is the rigid noun phrase--a sequence of
words that cannot be broken or interrupted without losing the
meaning of the phrase.  Phrasal templates are long collocations,
usually highly domain dependent.  A phrasal template may or may not
have any empty slots.  If there are empty slots, the template will
specify the part of speech for the word that fills that slot. 
Phrasal templates are useful for language generation, unlike
predictive relations and rigid noun phrases [Smadja, 149].

                           Finding Collocations
     N-grams are computed by what at first appears to be a fairly
simple process.  First, the researcher must decide what type of
collocation she is interested in.  Sequences of words are not as
meaningful statistically as words that are grammatically related
[Joshi].  Examining grammatically related words means looking at
verbs and their objects, nouns and their adjectives, etc.

     While this will produce collocations that are more significant
and interesting, it means that the corpus must be annotated with
part of speech information either automatically or manually. 
Either approach to tagging requires a fair bit of effort and has a
certain error rate attached.  In addition, using part of speech
information will make the algorithm for finding target words more
complicated than finding a simple sequence of words.

     Next, all the instances of target words are counted up and
these counts are changed into frequency percentages by dividing the
count by the total number of words in the corpus from which the
counts were taken [Liberman].  Here is where the trouble comes. 
Most of the interesting collocations have a low frequency, so
either a large corpus must be used to get an accurate measure of
their frequency or the researcher must estimate the frequency by
using statistical distributions.  Having a large corpus may be
impractical or impossible.  Yet estimating may be just as
impractical if the results are not valid due to assumptions made
about the distribution of the words in question.  We will return to
this point later.

                          Filtering Collocations
     Once a list of candidate collocations has been compiled, they
must be screened to filter out false collocations or even ones that
are just not of interest to the researcher.  The list is likely to
be large and have a "tail"--a long list of candidates with a count
of one, as asserted in Zipf's law that "frequency is roughly
proportional to inverse rank" [Church].

     Many methods for filtering exist.  Some are better than
others, but none have been explored and verified to the point of
being a standard in the field.  Some of these methods include:
simple frequencies (as described above) [Hindle], relative
frequencies, Shannon diversity, contingency tables [Daille], mutual
information [Church],  2, likelihood ratios, z-score, and Pearson's
Chi-squared test [Dunning].  Several of these methods are described
below.

                            Contingency Tables
     Contingency tables are used to compare counts of n-grams to
determine if the n-gram is a true collocation, meaning that the
words that make up the n-gram are highly associated in the text. 
Below is a table for a bigram "AB".  The counts of bigrams that
appear in the corpus go into the contingency table in the following
way.
     -------------------------------------------------
     |   k(A B)               |     k(~A B)           |
     -------------------------------------------------
     |    k(A ~B)             |    k(~A ~B)           |
     -------------------------------------------------
k(A B) is the count of the bigram with the first word being A and
the second word being B.  k(~A B) is the count of bigrams with the
second word B, but the first word is not A.  Similarly with k(A ~B)
and k(~A ~B).  If A and B are not highly associated, then "we would
expect p(AB) = p(A)p(B) where p(AB) is the probability of A and B
occurring in sequence, p(A is the probability of A appearing in the
first position, and p(B) is the probability of B appearing in the
second position" [Dunning, 70].  This method can be extended to n-
grams of any length.

                            Mutual Information
     Church and Hanks give a good treatment of the mutual
information method.  Mutual information compares the probability of
two words occurring together with the probability of the words
occurring separately (independently).  The mutual information of A
and B is given by 
     I(A,B) = log2[p(A,B)/p(A)p(B)]  where log2 is log base 2.

     If A and B are highly associated, then I(A,B) will be much
greater than zero.  Church and Hanks suggest that I(A,B)>3 produces
the best set of interesting bigrams.  If there is no interesting
relationship between the two words, then the mutual information
will be approximately zero.  If A and B are in a complementary
distribution, then I(A,B) will be much less than zero, but
researchers hardly ever see this because the corpora are "small"
(meaning one million words or less, according to Church and Hanks).

     In statistics, joint probabilities should be symmetric. 
P(A,B) should equal P(B,A) and thus I(A,B) = I(B,A) should hold. 
Real corpora prove this symmetry does not hold, meaning that we
tend to associate words in a particular order.  Church and Hanks
give the example that we associate doctor->nurse, bread->butter,
but not butter->bread.  This is similar to the Kraft TV commercial
for "cheese and macaroni" versus "macaroni and cheese".

                        Similarity-based Estimation
     Dagan et al. propose that one does not necessarily need the
actual count of occurrence of a particular N-gram.  They suggest
that the probability of a previously unseen collocation can be
estimated based on the probability of other "similar" words which
have been seen.  Suppose the following situation: A is seen with B. 
B has been seen before, so that its probability is known, and B is
called the "conditioning word".  We do not know the probability of
A, and it is called the "conditioned word".  In the past, P(A|B)
would have to be estimated in this case, by estimating the
frequency of A in the corpus.  This would assume the independence
of A and B.

     Class-based and similarity-based N-gram models don't need to
assume independence though.  Dagan et al. describe both of these
models.  In a class-based N-gram model, known words with similar
cooccurrence distributions are clustered into predetermined word
classes.  The probability of a previously unknown collocation is
then estimated by the average of the cooccurrence probability of
the two word classes to which these words belong.

     The similarity-based model described by Dagan et al. is an
update of the similarity model.  Each word has its own class of
words that are most similar to it.  The old version of the model
could predict which unseen collocations were more likely than
others, but could not provide a probability estimate for the
collocation.  Therefore, the old version of this model could not be
used in an N-gram language model.  However, Dagan et al. have
developed a way to estimate the probability of the collocation. 

     First, they show that the un-modified Maximum Likelihood
Estimator (MLE) for bigrams will not work in a similarity-based
model because the MLE predicts a zero probability for unseen
bigrams.  MLE is computed by
     Pmle(B,A) = c(B,A)/N
where c(B,A) is the frequency count for the bigram in question, and
N is the total number of bigrams in the candidate list.  However,
MLE can be adjusted by interpolation or discounting such that Pmle
< 1 in order to leave some probability mass for unseen events.  The
probability mass thus freed has to be distributed over the unseen
events.  Dagan et al. use a variant of Katz's Back-Off model to
redistribute the probability mass non-uniformly over unseen
collocations in proportion to the frequency of A.

                     Problems in Finding Collocations
     Despite their usefulness, there are problems with collecting
and using N-grams.  Dunning argues that care must be taken in
statistically analyzing potential N-grams so that researchers don't
overestimate the probability of the N-gram.  There are three groups
of computational linguistics researchers according to Dunning. 
Those who
     "1.  Collect enormous volumes of text in order to make
straightforward, statistically based measures work well.
     2.  Do simple-minded statistical analysis on relatively small
volumes of text and either 'correct empirically' for the error or
ignore the issue.
     3.  Perform no statistical analysis whatsoever" [Dunning, 60].

     Many of the methods discussed above probably fall into his
second category, and he says,
     "Many of the results from their work are entirely usable, and
     the measures they use work well for the examples given in
     their papers.  In general, though, their methods lead to
     problems.  For example, mutual information estimates based
     directly on counts are subject to overestimation when the
     counts involved are small, and z-scores substantially
     overestimate the significance of rare events" [Dunning, 62].

     The main problem, he argues, is that these measures are based
on the assumption of a normal or approximately normal distribution
of the variables being sampled.  While this assumption is valid in
most instances, it is not valid when comparing the rates of
occurrence of rare events, and texts are composed mostly of rare
events [Dunning,62].

     Instead of using the normal distribution, Dunning suggests
that the binomial distribution be used.  Binomial distributions are
good when the "data to be analyzed are derived by counting the
number of positive outcomes of repeated identical and independent
experiment" such as counting the times that each word in a corpus
is identical to the word being counted [Dunning, 64].

     The binomial distribution can be approximated by the normal
distribution for common events, but not for rare events.  The
normal distribution will overestimate probabilities for rare events
in a way that is not constant and thus cannot simply be corrected
for in calculations.  Because of this, Dunning proposes a
likelihood ratio test that does not rely on the assumption of a
normal distribution.  This has the benefit that much smaller
corpora are needed for testing.
     Dunning also presents ways to use the likelihood ratio test
with multinomial distributions, and then suggests that research
should be conducted to see how the Poisson distribution and some
distribution free methods of analysis (such as Fischer's exact
method) perform on N-gram calculations.  

     One method that attempts to find precise N-gram probabilities
is discussed in Stolcke and Segal.  They present and algorithm for
computing N-gram probabilities from stochastic context-free
grammars (SCFG), which are "a set of phrase-structure rules,
annotated with probabilities of choosing a certain production given
the left-hand side nonterminal" [Stolcke].  The probability of any
N-gram can be computer by finding substring probabilities expressed
as two linear systems of equations which can be solved.

                                Conclusion
     This paper has reviewed some of the research about the
computation of N-gram probabilities, which are useful for several
computational linguistics functions including finding interesting
collocations for use in machine translation and automatic tagging
of text.  Many of the methods used to screen the large lists of
candidate collocations are based on the assumption of a normal
distribution, which does not accurately compute probabilities for
rare events.  This means that the probabilities for the most
interesting events (the rare events) are overestimated.  Dunning
has suggested several other statistical distributions that should
be explored instead of the normal distribution, and also that
distribution free methods should be researched.

                                REFERENCES

Church, Kenneth. A Stochastic Parts Program and Noun Phrase Praser
for Unrestricted Text.  Proceedings of the Association for
Computational Linguistics Second Conference on Applied Natural
Language Processing.

Church, Kenneth, and Patrick Hanks. 1989. Word Association Norms,
Mutual Information, and Lexicography.  Proceedings of the 27th
Annual Meeting of the Association for Computational Linguistics.

Dagan, Pereira, and Lee. 1994. Similarity-based Estimation of Word
Cooccurrence Probabilities.  Proceedings of the 32nd Annual Meeting
of the Association for Computational Linguistics 272-277.

Daille, B., E. Gaussier, and J.M. Lange.  Towards Automatic
Extraction of Monolingual and Bilingual Terminalogy.

Dunning, Ted. 1993. Accurate Methods for the Statistics of Surprise
and Coincidence.  Computational Linguistics 3:61-74.

Hindle, Donald, and Mats Rooth. 1991. Structural Ambiguity and
Lexical Relations.  Proceedings of the 29th Annual Meeting of teh
Association for Computational Linguistics.

Joshi, Aravind K. 1994. Some Recent Trends in Natural Language
Processing.  Linguistica Computazionale 9-10:491-501.

Liberman, Mark, and Mitch Marcus. 1992. Very Large Text Corpora:
What You Can Do With Them, and How To Do It. Tutorial at the 31st
Annual Meeting of the Association for Computational Linguistics.

Pereira, Tishby, and Lee. 1993. Distributional Clustering of
English Words. Proceedings of the 31st Annual Meeting of the
Association for Computational Linguistics 183-190.

Smadja, Frank. 1993. Retrieving Collocaionts from Text:Xtract.
Computational Linguistics. 19:143-175.

Stolcke, Andreas, and Jonathan Segal. 1994. Precise N-gram
Probabilites from Stochastic Context-free Grammars.  Proceedings of
the 32nd Annual Meeting of the Association for Computational
Linguistics 74-79.

Su, Keh-Yih, Ming-Wen Wu, and Jing-Shin Chang. A Corpus-based
Approach to Automatic Compound Extraction. Proceedings of the 32nd
Annual Meeting of the Association for Computational Linguistics,
242-247.