![]() |
![]() |
![]() |
PRELIMINARY EXAM QUESTION TWO By Lynellen D. S. Perry A Preliminary Exam Response Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science in the Department of Computer Science Mississippi State, Mississippi February 199 QUESTION TWO In the domain of learning, please address the following issues: 1. Name and describe briefly three different paradigms for machine learning, and include in your comments the principle aspects of the training regimen. 2. Do any of the paradigms in 1. above correspond relatively closely to human learning? If so, in what ways? If not, why? 3. What are desirable properties of a learning paradigm? Why do you consider these properties to be desirable? PARADIGMS OF LEARNING Most artificial intelligence researchers would agree that “learning ability is no doubt central to human intelligence” (Michalski, 1987). In attempting to either model human intelligence or to merely mimic the products of human intelligence, researchers have pursued progress in the field of machine learning. Langley (1996) proposes five paradigms of machine learning: neural networks, instance-based (or case-based) learning, genetic algorithms, rule induction, and analytic learning. Briscoe and Caelli (1996) list only four paradigms: symbolic empirical learning which includes inductive learning methods and focuses on knowledge acquisition, explanation-based learning which focuses on knowledge enhancement, the genetic paradigm, and the connectionist paradigm. For this paper, I will explain in more detail the paradigms of neural networks, genetic algorithms, and instance-based or case-based learning. NEURAL NETWORKS Neural networks are very popular in a wide variety of research areas presently. They are viable computational models for a wide variety of problems, including pattern classification, speech synthesis and recognition, adaptive interfaces between humans and complex physical systems, function approximation, image data compression, associative memory, clustering, forecasting and prediction, combinatorial optimization, nonlinear system modeling, and control (Hassoun, 1995). Although neural networks were inspired by the biology of human and animal brains, they are not necessarily “faithful models of biologic neural or cognitive phenomena” (Hassoun, 1995). Hassoun (1995) presents an excellent coverage of a significant portion of neural network research and application. The following summary of neural networks is gleaned from his book unless otherwise noted. A wide variety of neural network architecture categories exists, including: single linear threshold gates, single layer networks, multilayer networks, adaptive multilayer networks, and associative neural memories. Basically, a neural network is a “network of threshold units that spreads activation from input nodes through internal units to output nodes. Weights on the links determine how much activation is passed on” (Langley, 1996). Hassoun (1995) notes that “a very important feature of these networks is their adaptive nature, where `learning by example’ replaces traditional `programming’ in solving problems. This feature makes such computational models very appealing in application domains where one has little or incomplete understanding of the problem to be solved but where training data is readily available”. Learning rules are different for each type of architecture category but can be divided into the basic groups of supervised, unsupervised, and reinforcement learning rules. In supervised learning, each input pattern is associated with the `correct’ output pattern that the network is to learn to produce. Activation is spread through the network after an input pattern is presented, and the output pattern is compared to the expected output. The particular learning rule is applied so that the weights on the links in the network are changed so that the actual output will more closely resemble the expected output for that pattern. Supervised learning rules are grouped into error-correcting rules and gradient-descent-based rules. “Unsupervised learning involves the clustering of (or the detection of similarities among) unlabeled patterns of a given training set. The idea here is to optimize (maximize or minimize) some criterion or performance function defined in terms of the output activity of the units in the network” (Hassoun, 1995). The weights on the links of the network are adjusted so that they cause the output to “converge to representations that capture the statistical regularities of the input data” (Hassoun, 1995). Unsupervised learning rules include principal-component analysis, clustering, vector quantization, and self-organizing feature maps. Reinforcement learning “is a process of trial and error designed to maximize the expected value of a criterion function” (Hassoun, 1995). A three-layer network that uses back-propagation for learning is a fairly standard architecture. Training this type of network follows this algorithm (Hassoun, 1995): 1. Initialize all weights to random values (uniformly distributed, usually between -0.5 and +0.5). 2. Select an input pattern and propagate activation levels through the network. Hidden layer activations are computed by first finding the sum of the input layer nodes’ activation multiplied by the weight on the link. This sum is `squashed’, usually by a sigmoidal function and the result is the activation of the hidden layer node. A similar process is repeated to compute the activation of each output layer node. 3. Compare the activation levels of the output nodes to the expected output. Computer output layer weight changes. 4. Decide the extent to which each connection to the output layer was responsible for creating the output activation and adjust the weights of the links so as to reduce the error between the actual output and the expected output. Repeat this process for the links between the hidden and input layers. 5. Repeat the whole process until the weights converge. One of the problems with neural network learning is that the ability of the network to train is dependent on several factors such as the order in which training pairs are presented (Wiegerinck and Heskes), the proportion of positive and negative examples presented, the initial starting weight values (Kolen and Pollack, 1990), and the parameters used for training (such as the learning rate, and any momentum rate). Neural networks can become stuck in local minima, which is the reason some choose to use a momentum factor that might pop the network out of a shallow local minima. GENETIC ALGORITHMS Genetic algorithms “represent acquired knowledge as a set of Boolean or binary features . . . The standard learning operators in genetic algorithms generate new candidate rules from parents that have high strengths, where strength reflects some measure of performance on training cases. In effect, genetic methods carry out a parallel hill-climbing search” (Langley, 1996). Like neural networks, genetic algorithms were inspired by a biological process. In this case, a solution to a suitably encoded problem evolves “according to the principles of natural selection and `survival of the fittest’” (Beasley et. al, 1993). A genetic algorithm executes as follows (see Whitley, 1993 for more specific coverage, but the algorithm below is summarized from that work). 1. A population of uniform-length chromosomes is created randomly. A chromosome is simply a string of values, typically binary in nature. Each location in the chromosome has some meaning mapped to it in order for the search to find some meaningful result. 2. Each chromosome is evaluated by an appropriate fitness function. 3. Reproductive opportunities are then allocated using some scheme (random, proportional, and others). 4. Two parents are selected and an evolutionary operation is applied. Duplication, crossover, and mutation are typical operations. Duplication simply consists of creating an exact copy of the parent chromosome. Crossover can occur at one or several points in the parent chromosomes. In a single point crossover, a location is randomly selected along the length of the chromosome. Then each parent is cut at that point and the end pieces are traded between the parents. For multi-point crossover, several cuts are made and the segments are extracted and traded. 5. An optional mutation step can be taken at this point. Mutation involves simply inverting the value of a random slot in the chromosome and is usually done rather infrequently. Mutation is similar to the momentum rate for neural networks in that it helps the population jump out of a local fitness minima. 6. The children created in steps 4 and 5 are evaluated by the same fitness function as the parents, and a new generation is chosen by some scheme (such as choosing the N most fit chromosomes). 7. Steps 2 through 6 are repeated until the population of chromosomes converges or some other factor interrupts the search (for example, a time constraint or a limitation on the number of generations to be produced and evaluated). INSTANCE- OR CASE-BASED LEARNING Case-based learning “represents knowledge in terms of specific cases or experiences and relies on flexible matching methods to retrieve these cases and apply them to new situations” (Langley, 1996). Michalski (1987) says that there is a spectrum of case-based strategies ranging from a system that “performs no inference, but directly accepts and uses the information given to it (or built into it)” to the other extreme where “the system performs complex, search-based inductive inference that on occasion leads to the discovery of new knowledge”. Michalski (1987) outlines six strategies that fall along this spectrum. The first strategies require the most effort from the teacher while the last ones listed require the most effort from the learner. Michalski points out that, in humans, this same progression “reflects also an increasing confidence in the acquired knowledge.” First is the “direct implanting of knowledge” which requires little or no inference and includes “rote learning, learning by imitation, learning by being constructed or by being programmed.” Next is “learning from instruction” where “a learner selects and transforms the knowledge from the input language to an internally-usable representation and integrates it with prior knowledge for effective retrieval and use.” Learning by “deductive (truth-preserving) inference” on existing knowledge and new knowledge supplied to the system is also called “analytical or explanation-based learning” by some researchers. Fourth, learning by analogy “involves transforming or extending knowledge (or skill) applicable in one domain to perform a similar task in another domain.” Next is “learning from examples” where the system is “given a set of examples and (optionally) counter-examples of a concept, the learner induces a general concept description.” Finally, “learning by observation and discovery” creates “classifications of given observations, discovering relationships and laws governing a given system, or forming a theory to explain a given phenomenon” (Michalski, 1987). To examine one of these strategies in a little more depth, Briscoe and Caelli (1996) describe the case-based reasoning process as follows. Entire cases are stored in a memory framework, rather than storing any generalized knowledge or rules. “When given a new problem, case-based systems first search their case memory for an existing case that comes closest to the presented problem. The retrieved case is then modified to meet the specifications of the new case (case adaption), and the resulting solution may be added to the case memory for future use” (Briscoe and Caelli, 1996). The main concept is that it “should be more efficient to modify an existing solution than to solve each new problem from the beginning” (Briscoe and Caelli, 1996). Open research questions in this area include: exactly what information should be stored in the cases, how to structure the case memory and how to index the cases for efficient retrieval, what metrics are best for measuring the similarity of cases, and how case adaption can be performed best (Briscoe and Caelli, 1996). Two rather well known case-based reasoning systems are `JUDGE’ (which reasons about criminal sentencing) and `CHEF’ (which generates new recipes based on user input) (Briscoe and Caelli, 1996). HUMAN LEARNING IN COMPARISON TO MACHINE LEARNING Clearly, genetic algorithm learning does not correspond to a strategy that humans use for learning. Instead, genetic algorithms are based on the broader biological process of natural selection and evolution (Beasley et. al, 1993). Neural networks are inspired by the human neural anatomy and physiology (Hassoun, 1995) in that our brains have neurons that activate (or ‘fire’) according to a thresholding function, and which have many interconnections with other neurons. However, “the majority of neural network models . . are more closely related to nonparametric pattern classifiers, clustering algorithms, linear and nonlinear filters, and statistical regression models than they are to neurobiologic models” (Hassoun, 1995). Many of the instance-based learning strategies are closely modeled on human learning strategies. For example, case-based reasoning is based on psychological research that suggests that there are two types of human knowledge: generic and episodic. Generic knowledge concerns generalized experience that is usually expressible in rule form. However, much of our knowledge, if not most, is thought to be episodic, that is, related to experience. These specific experiences, that may be linked (indexed) to other experiences, guide our reasoning and actions, especially in solving more complex problems (Briscoe and Caelli, 1996). I recall learning how to solve algebra problems by first determining what type of problem it was then following a recipe for solving that type of problem. The particulars of the problem were different, but by finding a closely matching case, I could arrive at a solution to the new problem. Returning to the continuum of strategies proposed by Michalski (1987), it is clear that direct implantation of knowledge (such as the knowledge in an expert system) via rote learning, and learning by imitation is a human method of learning employed every day in schools. Learning by instruction is “the most widely used strategy of human learning: it includes learning from teachers, books, publications, exhibits, displays, and similar sources” (Michalski, 1987). Learning by deduction is something we do at a rather subconscious level but to a less formal degree than is used in the machine learning version of this strategy. “The system uses the domain knowledge to determine or explain why a given fact is an example of the concept. This process takes a form of formal proof.” (Michalski, 1987). Learning by analogy is an obvious human learning strategy. Learning from examples is a strategy that is tested on the SAT and GRE where one has to induce a general concept description from a set of examples and counter-examples. Learning by observation and discovery is the human process of original research where we create classifications, record relationships and laws, and develop theories to explain particular phenomenon given only observation. Langley (1996) states that there are four main goals that researchers have for their pursuit of machine learning. The first is to model the mechanisms of human learning. The second, an empirical goal, is to “discover general principles that relate the characteristics of learning algorithms, and of the domain in which they operate, to learning behavior” (Langley, 1996). Third, a mathematical study of machine learning, aims “to formulate and prove theorems about the tractability of entire classes of learning problems and algorithms designed to solve those problems” (Langley, 1996). Finally, an applications approach concerns itself only with utilizing machine learning strategies to solve real-world problems. Since the goals of researchers differ so widely, it is obvious that not every machine learning strategy will closely resemble the methods that humans use to learn, but some of the strategies will be developed specifically to model and resemble human strategies. DESIRABLE PROPERTIES OF LEARNING PARADIGMS First, and most simplistically, any particular learning strategy should strive to meet the goal with which it is associated. By this I mean, for example, that if the strategy is being developed to model an actual human learning strategy, then it should be as faithful to the human method as is possible. The approach should also be scaleable to problems larger than toy-size laboratory experiments because humans live and learn in a huge, complex world not a micro-world. Machine learning paradigms and the specific instances thereof should also be able to generalize, robustly handle noisy input, correctly use prior knowledge, handle complex environments, form new concepts, and actively explore the search space in question (Russell, 1996). Similarly, Langley (1996) has a list of properties that a machine learning (ML) paradigm should have: it should be able to handle both the noise of corrupted supervised feedback and the noise of a corrupted instance description. ML strategies should be able to “represent the experience that leads to learning” in addition to the “knowledge that learning produces” (Langley, 1996). They should also be able to interpret conceptual knowledge and have a good organization scheme for the knowledge contained in the system (Langley, 1996). Briscoe and Caelli (1996) say that ML systems should be able to acquire knowledge from external sources in addition to being able to improve “knowledge representations and structures so that existing knowledge may be better exploited.” Additionally, there must be methods of assessing the validity, effectiveness, and abstraction level of the strategy so that it can be compared to other methods. Validity is the “degree of accuracy with which the representation agrees with reality”, while effectiveness includes the “performance aspects of the system” and the abstraction level is “the explanatory power of the representation, including the scope, detail, and precision of concepts used in the description” (Briscoe and Caelli, 1996). Finally, I think that the foremost property that a machine learning system needs to possess is the ability to pass the Turing test. “If an observer cannot distinguish the responses of a programmed machine from those of a human being” (Gardner, 1985) then it doesn’t matter what method was used. The goal is to have a computer that learns. So if the ML system responds after learning in an indistinguishable manner from the way a human responds after learning the same material, then the ML system has indeed learned, regardless of what process is employed. REFERENCES Beasley, David, David R. Bull, and Ralph R. Martin. 1993. An overview of genetic algorithms: Part 1, Fundamentals. University Computing, 15(2):58-69. Briscoe, Garry, and Terry Caelli. 1996. A compendium of machine learning. Vol. 1, Symbolic machine learning. Norwood, New Jersey: Ablex Publishing Company. Gardner, Howard. 1985. The Mind’s New Science. BasicBooks. Hassoun, Mohamad H. 1995. Fundamentals of artificial neural networks. Cambridge, Massachusetts: The MIT Press. Kolen, John F., and Jordan B. Pollack. 1990. Back propagation is sensitive to initial conditions. Complex Systems, 4(3):269-280. Langley, Pat. 1996. Elements of machine learning. San Francisco: Morgan Kaufmann. Michalski, Ryszard S. 1987. Learning strategies and automated knowledge acquisition: An overview. Computational Models of Learning, Leonard Bolc (ed.). Berlin: Springer- Verlag. Russell, Stuart. 1996. Machine learning. In Handbook of Perception and Cognition series, Artificial Intelligence, 2d ed. Margaret A. Boden (ed.). San Diego: Academic Press. Wiegerinck, Wim, and Tom Heskes. To appear. How dependencies between successive examples affect on-line learning. Neural Computation. Whitley, Darrell. 1993. A genetic algorithm tutorial. Department of Computer Science, Colorado State University. Technical Report CS-93-103. |