PRELIMINARY EXAM

QUESTION FIVE






By

Lynellen D. S. Perry







A Preliminary Exam Response
Submitted to the Faculty of
Mississippi State University
in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
in Computer Science
in the Department of Computer Science






Mississippi State, Mississippi

February 199

QUESTION 5

You have been doing some work on the problem of constructing a computer program which can 
automatically assign the correct syntactic and semantic tag to words in natural language text.  The 
techniques that you (and your colleagues) are applying to this problem are various types of Machine 
Learning techniques, including artificial neural networks and genetic algorithms.  These approaches all 
involve taking some real-world data set, which consists of complex patterns of high-dimensionality, with 
accompanying noise and redundancy, and attempting to discover (1) the features inherent in the data set 
that are most useful in assigning the correct tags, and (2) the rules for using these features to correctly 
assign tags.  These approaches assume that the categories are already known, and that the task is to 
develop a method of classifying each new case into the appropriate category.
	The “curse of dimensionality” in this problem comes from the number of words that may occur 
in the data set, the fact that many words may be tagged differently in different contexts, and the 
potentially broad extent of the context required to tag a given word.  Your goal is to take the original set 
of N features and derive a smaller subset of relevant features, along with rules for using these features to 
classify words into tag categories.  This subset should preserve the useful information that is present in 
the data set (information that allows us to discriminate members of one class from members of another), 
but should exclude noise and redundant information.

Question 5, Part 1. Briefly discuss the possibility that inappropriate categories have been 
selected, and that feature-extraction techniques might find a more appropriate and 
relevant set of categories in the data. (Do not spend much time on this, as to some extent 
you are stuck with the tag set that is currently in widespread use.)

	First, allow me to address a few small points of misunderstanding in relation to the  
program of research that my colleagues and I have been pursuing.  While it is true that 
previous research for the KUDZU (Knowledge Under Development from Zero 
Understanding) project made use of both syntactic and semantic tags, our current research 
ignores semantic tags because they unmanageably increase the dimensionality of the 
problem space.  Additionally, while we are (in a basic sense) researching the assignment of 
part-of-speech (POS) tags to natural language words, we are not primarily working with 
the initial assignment process.  Instead, we allow the highly acclaimed rule-based tagger 
created by Brill (Brill, 1992; Brill, 1994) to pre-process the text and assign POS tags.  
Following this pass, our goal is to search for places where we feel the Brill tagger has 
made errors.  We do not try to perform the original assignment of tags, but merely to find 
and correct some of the more serious mis-taggings, especially where verbs have been 
tagged incorrectly as something other than verbs.  The reason we focus on this particular 
tagging mistake is reported in the literature (Boggess and Smith, 1996).  
Briefly, the Brill tagger tends to label unknown words as nouns, which is a 
reasonable guess since the noun category is the largest POS class.  However, in the 
domain of physical chemistry journal articles, the “success rate on novel verbs in main verb 
position is in the neighborhood of 50%” (Boggess and Smith, 1996).  When the main verb 
is mis-tagged, the subsequent parse of the sentence can be severely altered, which restricts 
the possibility of successful knowledge extraction.  Since the KUDZU system focuses on 
small parts of each journal article (the abstract, the introduction, and the conclusion), there 
isn’t much room for losing sentences, especially several sentences in a row.  So we 
decided to focus on discovering the contexts in which the Brill tagger makes mistakes in 
tagging main verbs in particular, and noun/verb confusions in general.  This process comes 
after POS tags have been assigned, and while we may correct a tag assignment, we do not 
attempt to perform large-scale original tag assignment.
It is interesting to note that Brill (1992) recorded the fact that his tagger has 
problems with correctly tagging verbs and nouns.  Our focus is to further patch the results 
of the Brill tagger so as to reduce the noun/verb confusion problem he specifies:
For example, when the initial tagger tags the patch corpus, it mis-tags 159 words 
as verbs when they should be nouns.  If the patch change the tag from verb to 
noun if one of the two preceding words is tagged as a determiner is applied, it 
corrects 98 of the 159 errors.  However, it results in an additional 18 errors from 
changing tags which really should have been verb to noun.  This patch results in a 
net decrease of 80 errors on the patch corpus (Brill, 1992).

	So far, our use of genetic algorithms has not been directly related to discovering 
contexts in which nouns and verbs are confused.  Instead, we have been using genetic 
algorithms in an attempt to evolve the weights for a fixed-architecture neural network 
which does the actual work of examining the tagged text for errors.  This approach was 
inspired by Porto and Fogel (1995) who suggested that “instead of training the weight 
connections between the nodes, an evolutionary approach can help to overcome the 
problem of the network’s getting caught in local minimum, in addition to providing 
smaller training times” (Boggess and Smith, 1996).
	As to finding the `features inherent in the data set that are most useful in assigning 
the correct tags’, this we are trying to do in the sense that this is how a neural network 
operates.  Since the neural network encodes these features in the values of the weights on 
the connections, there isn’t any way that we can easily extract or evaluate the features 
themselves.  All we can do is decide if the network is performing adequately given 
whatever features it has encoded during training.  
	Since the task is to examine and correct part-of-speech tags, there is not a lot of 
room to inspect the idea that inappropriate categories have been selected.  The tag set we 
use is not a standard one in the literature, but has been augmented with a few extra tags so 
that subsequent parsing can be a little more accurate.  We could change categories by 
switching to a different tag set, either a less specific tag set from the literature or a tag set 
that is further customized.  Other than that, I can’t envision what other textual features 
would be appropriate or relevant to the task.

Question 5 Part 2. Discuss to what extent your a priori decisions to limit (1) word 
information (last few letters of the word versus entire word), and (2) contextual 
information (number of preceding and/or following words) may be affecting the ability of 
your machine learning techniques to extract the necessary information.

Brill (1992) records the features that are used in the Brill tagger for deciding on 
the appropriate tag for a word.  These features include: the context around the word to be 
tagged, the lexical properties of the target word, and a combination of lexical property and 
the “region” in which the word resides.  Additionally, Brill has a couple of procedures to 
correct some common mis-taggings: “words that were not in the training corpus and are 
capitalized tend to be proper nouns” and “tag words not seen in the training corpus by 
assigning such words the tag most common for words ending in the same three letters” 
(Brill, 1992).  Some tag corrections are accomplished by following eight rules that 
examine the context around the target word: the preceding/following word, the word two 
locations before/after the target, either of the two preceding/following words, one of the 
three preceding/following words, the preceding and following word, and the capitalization 
of the current word or the previous word.
	Our decisions to limit the word information to the last few letters of the word 
instead of using the entire word, and to limit the contextual information are not entirely a 
priori.  We follow Brill’s lead in part, as explained above.  The templates that Brill uses to 
limit the word information examined when trying to tag a word impose a 4 character 
upper limit on both prefixes and suffixes (Brill, 1993).  We currently only use the last two 
letters of the target word.  Perhaps more suffix letters and even some prefix letters would 
be useful to the neural network since this information is useful to a rule-based tagger.  To 
avoid the problem of hard-coding the number of letters to be used, a recurrent neural 
network should receive the letters and dynamically decide how many are useful.
	It is certainly the case that our current word information and context limitations 
may be affecting the ability of the neural network to extract the necessary information for 
the tagging task.  Long range dependencies do occur in syntax, and are not constant for all 
possible constructs.  Some syntactic patterns will require only one word to the left to 
distinguish while others will require information about words a greater distance away.  We 
have run a number of experiments in an attempt to find the correct context size on each 
side of the target word.  However, the context is still a fixed size, which will not be 
appropriate to every instance but merely to a greater proportion of instances.  
One way to overcome this inflexible obstacle is to use recurrent neural networks 
instead of conventional neural networks.  Rodriguez (1995) says that 
it is well established that sentences of natural languages are more than just a linear 
arrangement of words.  Sentences have complex temporal structures that are often 
characterized syntactically, such as phrase structures, subject-verb agreement, and 
relative clauses . . . A recurrent neural network (RNN) architecture contains 
feedback connections which allow the system to have node activation values at one 
time step affect the same node activation values at the next time step, thus 
reflecting temporal processing.

By using a recurrent neural network, the effective context could be flexible and infinite 
because the input is no longer a hard-coded set of context tags but is simply a stream of 
single tags (one tag per time step, similar to how the eye reads one word at a time).  The 
network is then allowed to dynamically use as much or as little of the context as will help 
it achieve good results.

Question 5, Part 3. Discuss to what extent your data structure (the way you have chosen 
to represent words, letters, tags, contexts, etc. to your machine learning algorithm) may 
be affecting the ability of your machine learning techniques to extract the necessary 
information.

In the area of natural language research, there are several ways to represent words, 
letters, and tags to neural networks.  Elman (1995) represents words as a single high bit in 
a vector with enough slots to uniquely represent every word in the lexicon.  Clearly, he 
uses small lexicons (29 words) and this word-based approach is not feasible for real-world 
text.  In our case it doesn’t matter because we do not represent entire words to the neural 
networks.  Neither is this type of approach suitable for representing the letters of the 
target word (we would need a slot for each letter of the alphabet in both small and capital 
letters in addition to punctuation and other text characters that appear in formulas, etc.).  
The above method becomes a more viable way to represent the part-of-speech 
tags, but there are still 36 tags in the Penn Treebank tag-set (Brill, 1993), which we have 
augmented for our research.  So each tag would be represented by a binary string of 36 
locations.  Currently, we represent the tags by a four bit Gray code in which each code 
represents several closely related part-of-speech tags.  We mapped the Penn Treebank tags 
into a Gray code in order to reduce the dimensionality of the problem space when we were 
conducting a language visualization experiment.  We decided to keep the Gray coding 
because of the way it represented part-of-speech tags that were similar in distribution or 
meaning into bit strings that were similar.  Mathias and Whitley (1994) report that “the use 
of Gray coding has been found to enhance the performance of genetic search in some 
cases.  However, Gray coding produces a different function mapping that may have fewer 
local optima and different relative hyperplane relationships.  Therefore, inferences about a 
function will not necessarily hold when transformed to another search space.”  So it is 
entirely possible that our choice to use the Gray codes is a significant factor affecting the 
ability of the neural network to extract the correct information.
In addition, the network might be interpreting the Gray code values in a way that is 
different from what we intended.  For example, the network is not instructed on how to 
view each clump of four bits as one input entity, so it has no idea that there are meaningful 
clusters in the input.  A network architecture that explicitly recognized this fact might have 
quite different results from our current networks.  A different representation of the tag-set 
would very likely change the network’s ability to find meaning in the input stream.
A Gray code tag-set representation means that the context input to the neural 
network is less straight-forward than it could be.  In order to efficiently handle network 
input in the form of a vector with one high bit representing the part-of-speech tag for that 
context location, we would need to move to a recurrent network architecture where the 
inputs are presented one at a time instead of in a hard-coded left-to-right representation of 
the context.  This would also remove the spatial encoding of time so that a more natural 
method of processing could be followed.  “The temporal order of input events (first-to-
last) is represented by the spatial order (left-to-right) of the input vector.  There are a 
number of problems with this approach.  One of the most serious is that the left-to-right 
spatial ordering has no intrinsic significance at the level of computation which is 
meaningful for the network” (Elman, 1995).
In summary, it is probable that our choices for the representations of the letters, 
tags, and context do indeed affect the ability of the neural network to extract the necessary 
information.  Each of these approaches should probably be changed.
Question 5, Part 4. Read the attached article, “No Free Lunch Theorems for Search”.  
What implications does the No Free Lunch theorem have for the kinds of approaches that 
might be expected to yield the best expectations for success with such a complex and 
difficult problem as text tagging?  How does Wolpert and Macready’s thesis apply to 
your situation?  Describe at least one new approach (or changes or additions to your 
current approach) that the No Free Lunch theorem suggests might produce improved 
results over your current approach.

	Quite simply, the No Free Lunch (NFL) theorem implies, for the kinds of 
approaches (such as hill-climbing, simulated annealing, and genetic algorithms) that are 
expected to be successful with a complex and difficult problem, that it doesn’t matter what 
algorithm is used to search unless the search function is customized to take advantage of 
the domain knowledge for the particular problem.  “Blind faith in an algorithm to search 
effectively across a broad class of problems is rarely justified” (Wolpert and Macready, 
1996).  Not only is there the chance that a given search algorithm will perform randomly 
on the particular search space, but “that it may very well perform even worse” than 
random (Wolpert and Macready, 1996).
	In our situation we use a generic neural network to search for the global minimum 
error on the function that relates the input string to the desired output.  The NFL theorem 
says that we are fortunate that the neural network approach has been successful to any 
degree at all.  In addition, according to Kolen and Pollack (1990), we have been lucky that 
we have had initial network conditions that converge to a greater or lesser extent when we 
could have hit upon initial conditions that simply do not allow for convergence.  The NFL 
theorem says that unless we customize our search function to consider the features and 
domain knowledge of the problem space, then we might as well use any of several other 
search algorithms.  In addition, just because neural networks have worked well for other 
classification or prediction problems does not mean it must work well for the part-of-
speech tag classification of prediction problem.
	To improve the results we have achieved with our current approach, the NFL 
theorem says we must customize the search algorithm.  “It is well known that generic 
methods (like simulated annealing and genetic algorithms) are unable to compete with 
carefully hand-crafted solutions for specific search problems” (Wolpert and Macready, 
1996).  To do this, we must “determine salient features of it [our cost function], and then 
construct a search algorithm, a, specifically tailored to match those features” (Wolpert and 
Macready, 1996).  
What a priori knowledge do we have available about the problem space that we 
can apply to the customization of the search algorithm?  I would argue that we are already 
using domain knowledge to restrict and customize our search.  If we were attempting to 
assign part-of-speech tags to a completely untagged text, then we would need to build in 
some knowledge about English grammar rules, lexical hints to the syntax, and context 
rules such as are already encoded in the Brill tagger.  However, our search is already 
constrained quite a bit.  We are not trying to assign a tag without any information about 
the word or its context.  Instead, we already have part-of-speech tags on each side of the 
word which have been assigned by a very good (but still fallible) tagger.  In addition, we 
are not examining every word in the text, but only those which are prospective mistakes 
(the verbs, nouns, adjectives, and adverbs).  Because we feed the context and some lexical 
information into the network, I believe we have customized the search to some extent.  
The neural network algorithm itself hasn’t been customized, but the information used as a 
base for the search has been customized.
Further customization would be possible by creating a neural network architecture 
that explicitly recognized the existence of chunks of information in the input string.  For 
example, the network could be forced to realize that groups of four bits (the Gray code 
symbols) are related to each other and not merely part of a 24 bit context.  We could also 
use a different network architecture such as a recurrent network so that the context 
information was not presented all at once but was presented one tag at a time in a more 
temporally-nature manner.  
Customization of the data structures could also improve our results.  The tag-set 
could be represented by a more straight-forward bit string instead of a Gray code.  The 
lexical information could be presented as a more compact code (such as the numeric value 
of the letter) instead of a 7 bit string for each letter.

REFERENCES

Boggess, Lois and Lynellen D. S. P. Smith. 1996. But `propeller’ is a verb! Automatic 
tagging and noun/verb confusions. In Proceedings of the Ninth Florida Artificial 
Intelligence Research Symposium, Key West, Florida, May 20-22 1996, edited by 
John H. Stewman, by the Florida Artificial Intelligence Research Symposium, 511-
515.

Brill, Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of the Third 
Conference on Applied Natural Language Processing, in Trento, Italy, by The 
Association for Computational Linguistics, 152-155.

Brill, Eric. 1993. A corpus-based approach to language learning. Ph.D. dissertation, 
Department of Computer and Information Science, University of Pennsylvania.

Brill, Eric. 1994. Some advances in transformation-based part of speech tagging. In 
Proceedings of the Twelfth National Conference on Artificial Intelligence, Menlo 
Park: AAAI Press/The MIT Press, 722-727.

Elman, Jeffrey L. 1995. Language as a dynamical system. Mind as Motion: Explorations in 
the Dynamics of Cognition, Robert F. Port and T. van Gelder, eds. Cambridge, MA: 
The MIT Press, 195-223.

Kolen, John F., and Jordan B. Pollack. 1990. Back propagation is sensitive to initial 
conditions. Complex Systems, 4(3):269-280.

Mathias, Keith E. and L. Darrell Whitley. 1994. Transforming the search space with Gray 
coding. In Proceedings of the IEEE Conference on Evolutionary Computation, Vol. 
1, 513-518.

Porto, Vincent W., and David B. Fogel. 1995. Alternative neural network training 
methods. IEEE Expert. June, 16-22.

Rodriguez, Paul. 1995. Representing the structure of a simple context-free language in a 
recurrent neural network: A dynamical systems approach. The newsletter of the 
Center for Research in Language, University of California, San Diego. 10(1).  
Http://crl.ucsd.edu/newsletter/10-1/newsletter.html

Wolpert, David H. and William G. Macready. 1996. “No free lunch theorems for search.” 
Santa Fe Institute Technical Report.SFI-TR-95-02-010.