PRELIMINARY EXAM

QUESTION TWO






By

Lynellen D. S. Perry







A Preliminary Exam Response
Submitted to the Faculty of
Mississippi State University
in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
in Computer Science
in the Department of Computer Science






Mississippi State, Mississippi

February 199

QUESTION TWO


In the domain of learning, please address the following issues:

1.  Name and describe briefly three different paradigms for machine learning, and 
include in your comments the principle aspects of the training regimen.
2.  Do any of the paradigms in 1. above correspond relatively closely to human 
learning?  If so, in what ways? If not, why?
3.  What are desirable properties of a learning paradigm?  Why do you consider these 
properties to be desirable?


PARADIGMS OF LEARNING 
Most artificial intelligence researchers would agree that “learning ability is no 
doubt central to human intelligence” (Michalski, 1987).  In attempting to either model 
human intelligence or to merely mimic the products of human intelligence, researchers 
have pursued progress in the field of machine learning.   Langley (1996) proposes five 
paradigms of machine learning: neural networks, instance-based (or case-based) learning, 
genetic algorithms, rule induction, and analytic learning.  Briscoe and Caelli (1996) list 
only four paradigms: symbolic empirical learning which includes inductive learning 
methods and focuses on knowledge acquisition, explanation-based learning which focuses 
on knowledge enhancement, the genetic paradigm, and the connectionist paradigm.  For 
this paper, I will explain in more detail the paradigms of neural networks, genetic 
algorithms, and instance-based or case-based learning.



NEURAL NETWORKS
	Neural networks are very popular in a wide variety of research areas presently.  
They are
viable computational models for a wide variety of problems, including pattern 
classification, speech synthesis and recognition, adaptive interfaces between humans 
and complex physical systems, function approximation, image data compression, 
associative memory, clustering, forecasting and prediction, combinatorial 
optimization, nonlinear system modeling, and control (Hassoun, 1995).

Although neural networks were inspired by the biology of human and animal brains, they 
are not necessarily “faithful models of biologic neural or cognitive phenomena” (Hassoun, 
1995).  Hassoun (1995) presents an excellent coverage of a significant portion of neural 
network research and application.  The following summary of neural networks is gleaned 
from his book unless otherwise noted.  A wide variety of neural network architecture 
categories exists, including: single linear threshold gates, single layer networks, multilayer 
networks, adaptive multilayer networks, and associative neural memories.  Basically, a 
neural network is a “network of threshold units that spreads activation from input nodes 
through internal units to output nodes.  Weights on the links determine how much 
activation is passed on” (Langley, 1996).  Hassoun (1995) notes that “a very important 
feature of these networks is their adaptive nature, where `learning by example’ replaces 
traditional `programming’ in solving problems.  This feature makes such computational 
models very appealing in application domains where one has little or incomplete 
understanding of the problem to be solved but where training data is readily available”.
Learning rules are different for each type of architecture category but can be 
divided into the basic groups of supervised, unsupervised,  and reinforcement learning 
rules.  In supervised learning, each input pattern is associated with the `correct’ output 
pattern that the network is to learn to produce.  Activation is spread through the network 
after an input pattern is presented, and the output pattern is compared to the expected 
output.  The particular learning rule is applied so that the weights on the links in the 
network are changed so that the actual output will more closely resemble the expected 
output for that pattern.  Supervised learning rules are grouped into error-correcting rules 
and gradient-descent-based rules.  
“Unsupervised learning involves the clustering of (or the detection of similarities 
among) unlabeled patterns of a given training set.  The idea here is to optimize (maximize 
or minimize) some criterion or performance function defined in terms of the output 
activity of the units in the network” (Hassoun, 1995).  The weights on the links of the 
network are adjusted so that they cause the output to “converge to representations that 
capture the statistical regularities of the input data” (Hassoun, 1995).  Unsupervised 
learning rules include principal-component analysis, clustering, vector quantization, and 
self-organizing feature maps.  Reinforcement learning “is a process of trial and error 
designed to maximize the expected value of a criterion function” (Hassoun, 1995).  
A three-layer network that uses back-propagation for learning is a fairly standard 
architecture.  Training this type of network follows this algorithm (Hassoun, 1995):
1.  Initialize all weights to random values (uniformly distributed, usually between    
-0.5 and +0.5).
2.  Select an input pattern and propagate activation levels through the network.  
Hidden layer activations are computed by first finding the sum of the input 
layer nodes’ activation multiplied by the weight on the link.  This sum is 
`squashed’, usually by a sigmoidal function and the result is the activation of 
the hidden layer node.  A similar process is repeated to compute the activation 
of each output layer node.
3.  Compare the activation levels of the output nodes to the expected output.  
Computer output layer weight changes.
4.  Decide the extent to which each connection to the output layer was responsible 
for creating the output activation and adjust the weights of the links so as to 
reduce the error between the actual output and the expected output.  Repeat 
this process for the links between the hidden and input layers.
5.  Repeat the whole process until the weights converge.
One of the problems with neural network learning is that the ability of the network 
to train is dependent on several factors such as the order in which training pairs are 
presented (Wiegerinck and Heskes), the proportion of positive and negative examples 
presented, the initial starting weight values (Kolen and Pollack, 1990), and the parameters 
used for training (such as the learning rate, and any momentum rate).  Neural networks 
can become stuck in local minima, which is the reason some choose to use a momentum 
factor that might pop the network out of a shallow local minima.

GENETIC ALGORITHMS
	Genetic algorithms “represent acquired knowledge as a set of Boolean or binary 
features . . . The standard learning operators in genetic algorithms generate new candidate 
rules from parents that have high strengths, where strength reflects some measure of 
performance on training cases.  In effect, genetic methods carry out a parallel hill-climbing 
search” (Langley, 1996).  Like neural networks, genetic algorithms were inspired by a 
biological process.  In this case, a solution to a suitably encoded problem evolves 
“according to the principles of natural selection and `survival of the fittest’” (Beasley et. 
al, 1993).
A genetic algorithm executes as follows (see Whitley, 1993 for more specific 
coverage, but the algorithm below is summarized from that work).  
1.  A population of  uniform-length chromosomes is created randomly.  A 
chromosome is simply a string of values, typically binary in nature.  Each 
location in the chromosome has some meaning mapped to it in order for the 
search to find some meaningful result.  
2.  Each chromosome is evaluated by an appropriate fitness function.  
3.  Reproductive opportunities are then allocated using some scheme (random, 
proportional, and others).  
4.  Two parents are selected and an evolutionary operation is applied.  
Duplication, crossover, and mutation are typical operations.  Duplication 
simply consists of creating an exact copy of the parent chromosome.  
Crossover can occur at one or several points in the parent chromosomes.  In a 
single point crossover, a location is randomly selected along the length of the 
chromosome.  Then each parent is cut at that point and the end pieces are 
traded between the parents.  For multi-point crossover, several cuts are made 
and the segments are extracted and traded.
5.  An optional mutation step can be taken at this point.  Mutation involves simply 
inverting the value of a random slot in the chromosome and is usually done 
rather infrequently.  Mutation is similar to the momentum rate for neural 
networks in that it helps the population jump out of a local fitness minima.
6.  The children created in steps 4 and 5 are evaluated by the same fitness function 
as the parents, and a new generation is chosen by some scheme (such as 
choosing the N most fit chromosomes).
7.  Steps 2 through 6 are repeated until the population of chromosomes converges 
or some other factor interrupts the search (for example, a time constraint or a 
limitation on the number of generations to be produced and evaluated).

INSTANCE- OR CASE-BASED LEARNING
	Case-based learning “represents knowledge in terms of specific cases or 
experiences and relies on flexible matching methods to retrieve these cases and apply them 
to new situations” (Langley, 1996).  Michalski (1987) says that there is a spectrum of 
case-based strategies ranging from a system that “performs no inference, but directly 
accepts and uses the information given to it (or built into it)” to the other extreme where 
“the system performs complex, search-based inductive inference that on occasion leads to 
the discovery of new knowledge”.  Michalski (1987) outlines six strategies that fall along 
this spectrum.  The first strategies require the most effort from the teacher while the last 
ones listed require the most effort from the learner.  Michalski points out that, in humans, 
this same progression “reflects also an increasing confidence in the acquired knowledge.”
	First is the “direct implanting of knowledge” which requires little or no inference 
and includes “rote learning, learning by imitation, learning by being constructed or by 
being programmed.”  Next is “learning from instruction” where “a learner selects and 
transforms the knowledge from the input language to an internally-usable representation 
and integrates it with prior knowledge for effective retrieval and use.”  Learning by 
“deductive (truth-preserving) inference” on existing knowledge and new knowledge 
supplied to the system is also called “analytical or explanation-based learning” by some 
researchers.  Fourth, learning by analogy “involves transforming or extending knowledge 
(or skill) applicable in one domain to perform a similar task in another domain.”  Next is 
“learning from examples” where the system is “given a set of examples and (optionally) 
counter-examples of a concept, the learner induces a general concept description.”  
Finally, “learning by observation and discovery” creates “classifications of given 
observations, discovering relationships and laws governing a given system, or forming a 
theory to explain a given phenomenon” (Michalski, 1987).
	To examine one of these strategies in a little more depth, Briscoe and Caelli (1996) 
describe the case-based reasoning process as follows.  Entire cases are stored in a memory 
framework, rather than storing any generalized knowledge or rules.  “When given a new 
problem, case-based systems first search their case memory for an existing case that comes 
closest to the presented problem.  The retrieved case is then modified to meet the 
specifications of the new case (case adaption), and the resulting solution may be added to 
the case memory for future use” (Briscoe and Caelli, 1996).  The main concept is that it 
“should be more efficient to modify an existing solution than to solve each new problem 
from the beginning” (Briscoe and Caelli, 1996). 
Open research questions in this area include: exactly what information should be 
stored in the cases, how to structure the case memory and how to index the cases for 
efficient retrieval, what metrics are best for measuring the similarity of cases, and how 
case adaption can be performed best (Briscoe and Caelli, 1996).  Two rather well known 
case-based reasoning systems are `JUDGE’ (which reasons about criminal sentencing) and 
`CHEF’ (which generates new recipes based on user input) (Briscoe and Caelli, 1996).

HUMAN LEARNING IN COMPARISON TO MACHINE LEARNING
	Clearly, genetic algorithm learning does not correspond to a strategy that humans 
use for learning.  Instead, genetic algorithms are based on the broader biological process 
of natural selection and evolution (Beasley et. al, 1993).  Neural networks are inspired by 
the human neural anatomy and physiology (Hassoun, 1995) in that our brains have 
neurons that activate (or ‘fire’) according to a thresholding function, and which have many 
interconnections with other neurons.  However, “the majority of neural network models . .  
are more closely related to nonparametric pattern classifiers, clustering algorithms, linear 
and nonlinear filters, and statistical regression models than they are to neurobiologic 
models” (Hassoun, 1995).
	Many of the instance-based learning strategies are closely modeled on human 
learning strategies.  For example, case-based reasoning is 
based on psychological research that suggests that there are two types of human 
knowledge: generic and episodic.  Generic knowledge concerns generalized 
experience that is usually expressible in rule form.  However, much of our 
knowledge, if not most, is thought to be episodic, that is, related to experience.  
These specific experiences, that may be linked (indexed) to other experiences, 
guide our reasoning and actions, especially in solving more complex problems 
(Briscoe and Caelli, 1996).

I recall learning how to solve algebra problems by first determining what type of problem 
it was then following a recipe for solving that type of problem.  The particulars of the 
problem were different, but by finding a closely matching case, I could arrive at a solution 
to the new problem.
	Returning to the continuum of strategies proposed by Michalski (1987), it is clear 
that direct implantation of knowledge (such as the knowledge in an expert system) via rote 
learning, and learning by imitation is a human method of learning employed every day in 
schools.  Learning by instruction is “the most widely used strategy of human learning: it 
includes learning from teachers, books, publications, exhibits, displays, and similar 
sources” (Michalski, 1987).  Learning by deduction is something we do at a rather 
subconscious level but to a less formal degree than is used in the machine learning version 
of this strategy.  “The system uses the domain knowledge to determine or explain why a 
given fact is an example of the concept.  This process takes a form of formal proof.” 
(Michalski, 1987).  Learning by analogy is an obvious human learning strategy.  Learning 
from examples is a strategy that is tested on the SAT and GRE where one has to induce a 
general concept description from a set of examples and counter-examples.  Learning by 
observation and discovery is the human process of original research where we create 
classifications, record relationships and laws, and develop theories to explain particular 
phenomenon given only observation.
	Langley (1996) states that there are four main goals that researchers have for their 
pursuit of machine learning.  The first is to model the mechanisms of human learning.  The 
second, an empirical goal, is to “discover general principles that relate the characteristics 
of learning algorithms, and of the domain in which they operate, to learning behavior” 
(Langley, 1996).  Third, a mathematical study of machine learning, aims “to formulate and 
prove theorems about the tractability of entire classes of learning problems and algorithms 
designed to solve those problems” (Langley, 1996).  Finally, an applications approach 
concerns itself only with utilizing machine learning strategies to solve real-world problems.  
Since the goals of researchers differ so widely, it is obvious that not every machine 
learning strategy will closely resemble the methods that humans use to learn, but some of 
the strategies will be developed specifically to model and resemble human strategies.

DESIRABLE PROPERTIES OF LEARNING PARADIGMS
	First, and most simplistically, any particular learning strategy should strive to meet 
the goal with which it is associated.  By this I mean, for example, that if the strategy is 
being developed to model an actual human learning strategy, then it should be as faithful 
to the human method as is possible.  The approach should also be scaleable to problems 
larger than toy-size laboratory experiments because humans live and learn in a huge, 
complex world not a micro-world.
	Machine learning paradigms and the specific instances thereof should also be able 
to generalize, robustly handle noisy input, correctly use prior knowledge, handle complex 
environments, form new concepts, and actively explore the search space in question 
(Russell, 1996).  Similarly, Langley (1996) has a list of properties that a machine learning 
(ML) paradigm should have: it should be able to handle both the noise of corrupted 
supervised feedback and the noise of a corrupted instance description.  ML strategies 
should be able to “represent the experience that leads to learning” in addition to the 
“knowledge that learning produces” (Langley, 1996).  They should also be able to 
interpret conceptual knowledge and have a good organization scheme for the knowledge 
contained in the system (Langley, 1996).
	Briscoe and Caelli (1996) say that ML systems should be able to acquire 
knowledge from external sources in addition to being able to improve “knowledge 
representations and structures so that existing knowledge may be better exploited.”  
Additionally, there must be methods of assessing the validity, effectiveness, and 
abstraction level of the strategy so that it can be compared to other methods.  Validity is 
the “degree of accuracy with which the representation agrees with reality”, while 
effectiveness includes the “performance aspects of the system” and the abstraction level is 
“the explanatory power of the representation, including the scope, detail, and precision of 
concepts used in the description” (Briscoe and Caelli, 1996).
	Finally, I think that the foremost property that a machine learning system needs to 
possess is the ability to pass the Turing test.  “If an observer cannot distinguish the 
responses of a programmed machine from those of a human being” (Gardner, 1985) then 
it doesn’t matter what method was used.  The goal is to have a computer that learns.  So if 
the ML system responds after learning in an indistinguishable manner from the way a 
human responds after learning the same material, then the ML system has indeed learned, 
regardless of what process is employed. 


REFERENCES
Beasley, David, David R. Bull, and Ralph R. Martin. 1993. An overview of genetic 
algorithms: Part 1, Fundamentals. University Computing, 15(2):58-69.

Briscoe, Garry, and Terry Caelli. 1996. A compendium of machine learning. Vol. 1, 
Symbolic machine learning. Norwood, New Jersey: Ablex Publishing Company.

Gardner, Howard. 1985. The Mind’s New Science. BasicBooks.

Hassoun, Mohamad H. 1995. Fundamentals of artificial neural networks. Cambridge, 
Massachusetts: The MIT Press.

Kolen, John F., and Jordan B. Pollack. 1990. Back propagation is sensitive to initial 
conditions. Complex Systems, 4(3):269-280.

Langley, Pat. 1996. Elements of machine learning. San Francisco: Morgan Kaufmann.

Michalski, Ryszard S. 1987. Learning strategies and automated knowledge acquisition: An 
overview. Computational Models of Learning, Leonard Bolc (ed.). Berlin: Springer-
Verlag.

Russell, Stuart. 1996. Machine learning. In Handbook of Perception and Cognition series, 
Artificial Intelligence, 2d ed. Margaret A. Boden (ed.). San Diego: Academic Press.

Wiegerinck, Wim, and Tom Heskes. To appear. How dependencies between successive 
examples affect on-line learning. Neural Computation.

Whitley, Darrell. 1993. A genetic algorithm tutorial. Department of Computer Science, 
Colorado State University. Technical Report CS-93-103.