![]() |
![]() |
![]() |
PRELIMINARY EXAM QUESTION FIVE By Lynellen D. S. Perry A Preliminary Exam Response Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science in the Department of Computer Science Mississippi State, Mississippi February 199 QUESTION 5 You have been doing some work on the problem of constructing a computer program which can automatically assign the correct syntactic and semantic tag to words in natural language text. The techniques that you (and your colleagues) are applying to this problem are various types of Machine Learning techniques, including artificial neural networks and genetic algorithms. These approaches all involve taking some real-world data set, which consists of complex patterns of high-dimensionality, with accompanying noise and redundancy, and attempting to discover (1) the features inherent in the data set that are most useful in assigning the correct tags, and (2) the rules for using these features to correctly assign tags. These approaches assume that the categories are already known, and that the task is to develop a method of classifying each new case into the appropriate category. The “curse of dimensionality” in this problem comes from the number of words that may occur in the data set, the fact that many words may be tagged differently in different contexts, and the potentially broad extent of the context required to tag a given word. Your goal is to take the original set of N features and derive a smaller subset of relevant features, along with rules for using these features to classify words into tag categories. This subset should preserve the useful information that is present in the data set (information that allows us to discriminate members of one class from members of another), but should exclude noise and redundant information. Question 5, Part 1. Briefly discuss the possibility that inappropriate categories have been selected, and that feature-extraction techniques might find a more appropriate and relevant set of categories in the data. (Do not spend much time on this, as to some extent you are stuck with the tag set that is currently in widespread use.) First, allow me to address a few small points of misunderstanding in relation to the program of research that my colleagues and I have been pursuing. While it is true that previous research for the KUDZU (Knowledge Under Development from Zero Understanding) project made use of both syntactic and semantic tags, our current research ignores semantic tags because they unmanageably increase the dimensionality of the problem space. Additionally, while we are (in a basic sense) researching the assignment of part-of-speech (POS) tags to natural language words, we are not primarily working with the initial assignment process. Instead, we allow the highly acclaimed rule-based tagger created by Brill (Brill, 1992; Brill, 1994) to pre-process the text and assign POS tags. Following this pass, our goal is to search for places where we feel the Brill tagger has made errors. We do not try to perform the original assignment of tags, but merely to find and correct some of the more serious mis-taggings, especially where verbs have been tagged incorrectly as something other than verbs. The reason we focus on this particular tagging mistake is reported in the literature (Boggess and Smith, 1996). Briefly, the Brill tagger tends to label unknown words as nouns, which is a reasonable guess since the noun category is the largest POS class. However, in the domain of physical chemistry journal articles, the “success rate on novel verbs in main verb position is in the neighborhood of 50%” (Boggess and Smith, 1996). When the main verb is mis-tagged, the subsequent parse of the sentence can be severely altered, which restricts the possibility of successful knowledge extraction. Since the KUDZU system focuses on small parts of each journal article (the abstract, the introduction, and the conclusion), there isn’t much room for losing sentences, especially several sentences in a row. So we decided to focus on discovering the contexts in which the Brill tagger makes mistakes in tagging main verbs in particular, and noun/verb confusions in general. This process comes after POS tags have been assigned, and while we may correct a tag assignment, we do not attempt to perform large-scale original tag assignment. It is interesting to note that Brill (1992) recorded the fact that his tagger has problems with correctly tagging verbs and nouns. Our focus is to further patch the results of the Brill tagger so as to reduce the noun/verb confusion problem he specifies: For example, when the initial tagger tags the patch corpus, it mis-tags 159 words as verbs when they should be nouns. If the patch change the tag from verb to noun if one of the two preceding words is tagged as a determiner is applied, it corrects 98 of the 159 errors. However, it results in an additional 18 errors from changing tags which really should have been verb to noun. This patch results in a net decrease of 80 errors on the patch corpus (Brill, 1992). So far, our use of genetic algorithms has not been directly related to discovering contexts in which nouns and verbs are confused. Instead, we have been using genetic algorithms in an attempt to evolve the weights for a fixed-architecture neural network which does the actual work of examining the tagged text for errors. This approach was inspired by Porto and Fogel (1995) who suggested that “instead of training the weight connections between the nodes, an evolutionary approach can help to overcome the problem of the network’s getting caught in local minimum, in addition to providing smaller training times” (Boggess and Smith, 1996). As to finding the `features inherent in the data set that are most useful in assigning the correct tags’, this we are trying to do in the sense that this is how a neural network operates. Since the neural network encodes these features in the values of the weights on the connections, there isn’t any way that we can easily extract or evaluate the features themselves. All we can do is decide if the network is performing adequately given whatever features it has encoded during training. Since the task is to examine and correct part-of-speech tags, there is not a lot of room to inspect the idea that inappropriate categories have been selected. The tag set we use is not a standard one in the literature, but has been augmented with a few extra tags so that subsequent parsing can be a little more accurate. We could change categories by switching to a different tag set, either a less specific tag set from the literature or a tag set that is further customized. Other than that, I can’t envision what other textual features would be appropriate or relevant to the task. Question 5 Part 2. Discuss to what extent your a priori decisions to limit (1) word information (last few letters of the word versus entire word), and (2) contextual information (number of preceding and/or following words) may be affecting the ability of your machine learning techniques to extract the necessary information. Brill (1992) records the features that are used in the Brill tagger for deciding on the appropriate tag for a word. These features include: the context around the word to be tagged, the lexical properties of the target word, and a combination of lexical property and the “region” in which the word resides. Additionally, Brill has a couple of procedures to correct some common mis-taggings: “words that were not in the training corpus and are capitalized tend to be proper nouns” and “tag words not seen in the training corpus by assigning such words the tag most common for words ending in the same three letters” (Brill, 1992). Some tag corrections are accomplished by following eight rules that examine the context around the target word: the preceding/following word, the word two locations before/after the target, either of the two preceding/following words, one of the three preceding/following words, the preceding and following word, and the capitalization of the current word or the previous word. Our decisions to limit the word information to the last few letters of the word instead of using the entire word, and to limit the contextual information are not entirely a priori. We follow Brill’s lead in part, as explained above. The templates that Brill uses to limit the word information examined when trying to tag a word impose a 4 character upper limit on both prefixes and suffixes (Brill, 1993). We currently only use the last two letters of the target word. Perhaps more suffix letters and even some prefix letters would be useful to the neural network since this information is useful to a rule-based tagger. To avoid the problem of hard-coding the number of letters to be used, a recurrent neural network should receive the letters and dynamically decide how many are useful. It is certainly the case that our current word information and context limitations may be affecting the ability of the neural network to extract the necessary information for the tagging task. Long range dependencies do occur in syntax, and are not constant for all possible constructs. Some syntactic patterns will require only one word to the left to distinguish while others will require information about words a greater distance away. We have run a number of experiments in an attempt to find the correct context size on each side of the target word. However, the context is still a fixed size, which will not be appropriate to every instance but merely to a greater proportion of instances. One way to overcome this inflexible obstacle is to use recurrent neural networks instead of conventional neural networks. Rodriguez (1995) says that it is well established that sentences of natural languages are more than just a linear arrangement of words. Sentences have complex temporal structures that are often characterized syntactically, such as phrase structures, subject-verb agreement, and relative clauses . . . A recurrent neural network (RNN) architecture contains feedback connections which allow the system to have node activation values at one time step affect the same node activation values at the next time step, thus reflecting temporal processing. By using a recurrent neural network, the effective context could be flexible and infinite because the input is no longer a hard-coded set of context tags but is simply a stream of single tags (one tag per time step, similar to how the eye reads one word at a time). The network is then allowed to dynamically use as much or as little of the context as will help it achieve good results. Question 5, Part 3. Discuss to what extent your data structure (the way you have chosen to represent words, letters, tags, contexts, etc. to your machine learning algorithm) may be affecting the ability of your machine learning techniques to extract the necessary information. In the area of natural language research, there are several ways to represent words, letters, and tags to neural networks. Elman (1995) represents words as a single high bit in a vector with enough slots to uniquely represent every word in the lexicon. Clearly, he uses small lexicons (29 words) and this word-based approach is not feasible for real-world text. In our case it doesn’t matter because we do not represent entire words to the neural networks. Neither is this type of approach suitable for representing the letters of the target word (we would need a slot for each letter of the alphabet in both small and capital letters in addition to punctuation and other text characters that appear in formulas, etc.). The above method becomes a more viable way to represent the part-of-speech tags, but there are still 36 tags in the Penn Treebank tag-set (Brill, 1993), which we have augmented for our research. So each tag would be represented by a binary string of 36 locations. Currently, we represent the tags by a four bit Gray code in which each code represents several closely related part-of-speech tags. We mapped the Penn Treebank tags into a Gray code in order to reduce the dimensionality of the problem space when we were conducting a language visualization experiment. We decided to keep the Gray coding because of the way it represented part-of-speech tags that were similar in distribution or meaning into bit strings that were similar. Mathias and Whitley (1994) report that “the use of Gray coding has been found to enhance the performance of genetic search in some cases. However, Gray coding produces a different function mapping that may have fewer local optima and different relative hyperplane relationships. Therefore, inferences about a function will not necessarily hold when transformed to another search space.” So it is entirely possible that our choice to use the Gray codes is a significant factor affecting the ability of the neural network to extract the correct information. In addition, the network might be interpreting the Gray code values in a way that is different from what we intended. For example, the network is not instructed on how to view each clump of four bits as one input entity, so it has no idea that there are meaningful clusters in the input. A network architecture that explicitly recognized this fact might have quite different results from our current networks. A different representation of the tag-set would very likely change the network’s ability to find meaning in the input stream. A Gray code tag-set representation means that the context input to the neural network is less straight-forward than it could be. In order to efficiently handle network input in the form of a vector with one high bit representing the part-of-speech tag for that context location, we would need to move to a recurrent network architecture where the inputs are presented one at a time instead of in a hard-coded left-to-right representation of the context. This would also remove the spatial encoding of time so that a more natural method of processing could be followed. “The temporal order of input events (first-to- last) is represented by the spatial order (left-to-right) of the input vector. There are a number of problems with this approach. One of the most serious is that the left-to-right spatial ordering has no intrinsic significance at the level of computation which is meaningful for the network” (Elman, 1995). In summary, it is probable that our choices for the representations of the letters, tags, and context do indeed affect the ability of the neural network to extract the necessary information. Each of these approaches should probably be changed. Question 5, Part 4. Read the attached article, “No Free Lunch Theorems for Search”. What implications does the No Free Lunch theorem have for the kinds of approaches that might be expected to yield the best expectations for success with such a complex and difficult problem as text tagging? How does Wolpert and Macready’s thesis apply to your situation? Describe at least one new approach (or changes or additions to your current approach) that the No Free Lunch theorem suggests might produce improved results over your current approach. Quite simply, the No Free Lunch (NFL) theorem implies, for the kinds of approaches (such as hill-climbing, simulated annealing, and genetic algorithms) that are expected to be successful with a complex and difficult problem, that it doesn’t matter what algorithm is used to search unless the search function is customized to take advantage of the domain knowledge for the particular problem. “Blind faith in an algorithm to search effectively across a broad class of problems is rarely justified” (Wolpert and Macready, 1996). Not only is there the chance that a given search algorithm will perform randomly on the particular search space, but “that it may very well perform even worse” than random (Wolpert and Macready, 1996). In our situation we use a generic neural network to search for the global minimum error on the function that relates the input string to the desired output. The NFL theorem says that we are fortunate that the neural network approach has been successful to any degree at all. In addition, according to Kolen and Pollack (1990), we have been lucky that we have had initial network conditions that converge to a greater or lesser extent when we could have hit upon initial conditions that simply do not allow for convergence. The NFL theorem says that unless we customize our search function to consider the features and domain knowledge of the problem space, then we might as well use any of several other search algorithms. In addition, just because neural networks have worked well for other classification or prediction problems does not mean it must work well for the part-of- speech tag classification of prediction problem. To improve the results we have achieved with our current approach, the NFL theorem says we must customize the search algorithm. “It is well known that generic methods (like simulated annealing and genetic algorithms) are unable to compete with carefully hand-crafted solutions for specific search problems” (Wolpert and Macready, 1996). To do this, we must “determine salient features of it [our cost function], and then construct a search algorithm, a, specifically tailored to match those features” (Wolpert and Macready, 1996). What a priori knowledge do we have available about the problem space that we can apply to the customization of the search algorithm? I would argue that we are already using domain knowledge to restrict and customize our search. If we were attempting to assign part-of-speech tags to a completely untagged text, then we would need to build in some knowledge about English grammar rules, lexical hints to the syntax, and context rules such as are already encoded in the Brill tagger. However, our search is already constrained quite a bit. We are not trying to assign a tag without any information about the word or its context. Instead, we already have part-of-speech tags on each side of the word which have been assigned by a very good (but still fallible) tagger. In addition, we are not examining every word in the text, but only those which are prospective mistakes (the verbs, nouns, adjectives, and adverbs). Because we feed the context and some lexical information into the network, I believe we have customized the search to some extent. The neural network algorithm itself hasn’t been customized, but the information used as a base for the search has been customized. Further customization would be possible by creating a neural network architecture that explicitly recognized the existence of chunks of information in the input string. For example, the network could be forced to realize that groups of four bits (the Gray code symbols) are related to each other and not merely part of a 24 bit context. We could also use a different network architecture such as a recurrent network so that the context information was not presented all at once but was presented one tag at a time in a more temporally-nature manner. Customization of the data structures could also improve our results. The tag-set could be represented by a more straight-forward bit string instead of a Gray code. The lexical information could be presented as a more compact code (such as the numeric value of the letter) instead of a 7 bit string for each letter. REFERENCES Boggess, Lois and Lynellen D. S. P. Smith. 1996. But `propeller’ is a verb! Automatic tagging and noun/verb confusions. In Proceedings of the Ninth Florida Artificial Intelligence Research Symposium, Key West, Florida, May 20-22 1996, edited by John H. Stewman, by the Florida Artificial Intelligence Research Symposium, 511- 515. Brill, Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, in Trento, Italy, by The Association for Computational Linguistics, 152-155. Brill, Eric. 1993. A corpus-based approach to language learning. Ph.D. dissertation, Department of Computer and Information Science, University of Pennsylvania. Brill, Eric. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Menlo Park: AAAI Press/The MIT Press, 722-727. Elman, Jeffrey L. 1995. Language as a dynamical system. Mind as Motion: Explorations in the Dynamics of Cognition, Robert F. Port and T. van Gelder, eds. Cambridge, MA: The MIT Press, 195-223. Kolen, John F., and Jordan B. Pollack. 1990. Back propagation is sensitive to initial conditions. Complex Systems, 4(3):269-280. Mathias, Keith E. and L. Darrell Whitley. 1994. Transforming the search space with Gray coding. In Proceedings of the IEEE Conference on Evolutionary Computation, Vol. 1, 513-518. Porto, Vincent W., and David B. Fogel. 1995. Alternative neural network training methods. IEEE Expert. June, 16-22. Rodriguez, Paul. 1995. Representing the structure of a simple context-free language in a recurrent neural network: A dynamical systems approach. The newsletter of the Center for Research in Language, University of California, San Diego. 10(1). Http://crl.ucsd.edu/newsletter/10-1/newsletter.html Wolpert, David H. and William G. Macready. 1996. “No free lunch theorems for search.” Santa Fe Institute Technical Report.SFI-TR-95-02-010. |