PRELIMINARY EXAM

QUESTION ONE






By

Lynellen D. S. Perry







A Preliminary Exam Response
Submitted to the Faculty of
Mississippi State University
in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
in Computer Science
in the Department of Computer Science






Mississippi State, Mississippi

February 199

QUESTION ONE
“Over the last decade or so, much research has focused on the issue of understandability 
of software and of maintenance difficulty.  A goal has been to discover underlying 
principles that impact software understandability and to build tools that can function as 
an `assistant’ to help software developers build software products with better 
understandability, and therefore better maintainability.  Much of the research has 
addressed the structural design of the product and an analysis of the syntax and symbols 
within each component.  For example, one measure of understandability proposed by two 
or three researchers was entropy (where any new symbol constituted a `surprise’).  
	First, discuss the issue of understandability of software products.  This should 
include measurement issues.
	Second, consider your intended area of research in natural language processing 
and identify potential areas of applicability to software understandability.”

WHAT IS UNDERSTANDABILITY?
	We need to start with a consensus of what we mean by the term 
‘understandability’.  It is one of the ‘ilities’ of software quality (Cioch, 1991), along with 
other concepts as maintainability, reusability, reliability, and readability.  Roger Pressman 
(1992) references Bertrand Meyer’s modularity criteria related to object-oriented design 
when defining usability as “the ease with which a program component can be understood 
without reference to other information or modules”.  This is, of course, something of a 
circular definition.  What does it mean to ‘understand’ a program component?  The only 
detail we have gained from this definition is that a program component that is 
understandable should not refer to external information or to external modules.
According to Frank Cioch (1991), understandability consists of two components: 
comprehension, and lack of misinterpretation.  “When one wishes to ascertain the 
understandability of a particular software-related product, one is often concerned not only 
with the degree to which, or the ease with which, the information is grasped mentally, but 
also with the degree to which it is misinterpreted by the person examining the product” 
(Cioch, 1991).  So comprehension is one aspect of understandability and entails the person 
being able to mentally grasp the information and to interpret it correctly.  Lack of 
comprehension means the person is unable to mentally grasp the information.  
Misinterpretation means the person is able to mentally grasp the information but interprets 
it incorrectly due to any of several factors.  The person is erroneously confident that they 
have comprehended the information.  Misinterpretation can occur when the information is 
ambiguous (has multiple possible interpretations) or when the presenter and the recipient 
have divergent perspectives on the information (Cioch, 1991).
	Kari Laitinen (1996) argues that software development is, in summary, a 
documentation process, a communication process, and a learning process.  
Understandability then relates to the ability of the software-related items to communicate 
and document the information learned through the design and implementation of the 
software system. 
Documents such as requirements definitions are developed first, and they are 
followed by various design documents which give a more detailed description of 
the system being developed.  The final documents of software development are 
source programs which describe every possible detail of the system, and from 
which the executable system can be generated using compilers and other tools 
(Laitinen, 1996).

One focus of Laitinen’s work is to show how good naming conventions go a long way 
towards improving understandability.  Maintenance personnel will often need to 
understand the requirements definitions and design documents in order to understand the 
source code, so Laitinen rightly includes all software documents in the realm of a need for 
understandability.
	Much literature in the field of software engineering simply uses the term 
understandability without formally defining it.   Instead, the authors tend to present factors 
that affect the level of understandability or factors that in some sense measure the level of 
understandability.  Varney (1976), for example, records some relatively old (but still 
relevant) aspects of the meaning of software understandability while discussing operating 
system design.  Understandability “substantially improves the ease of implementation, 
debugging, maintenance, and development” of the software system (Varney, 1976).  He 
argues that reducing and/or controlling complexity leads to understandability, which in 
turn leads to reliability.  Indeed, Laitinen (1996) boldly states, “Complexity is inversely 
proportional to understandability.”
	While speaking most directly to software specification methods, Williams (1994) 
details three indicators of understandability without defining the term itself.  These can be 
equally applied to other aspects of software development such as requirements 
documentation and source code understandability.  The three indicators are notation, 
organization, and level of abstraction.  Notation should be straight-forward and therefore 
one might need to use several methods of notation in order to most naturally convey 
different types of information.  Information should be well-organized so that it is 
“relatively easy to find individual pieces of information” (Williams, 1994).  Abstraction 
will suppress “irrelevant or distracting details, which lets you focus on the essential parts 
of the problem” (Williams, 1994).
	Using abstraction (or information-hiding) to combat complexity and therefore 
increase understandability is also the focus of work by Rising and Calliss (1994).  They 
report that “maintainers spend at least half their time trying to understand the system and 
user requests. . . our most formidable weapon against complexity is abstraction” (Rising 
and Calliss, 1994).  Likewise, Di Felice (1993) says that the understandability of software 
can be increased by organizing software into units and making sure that “units are self-
contained software components featuring a high degree of information hiding”.  In 
addition, Di Felice suggests all of the following.  Standardized algorithms allow 
“implementation [that] is, to a large extent, independent from the programmer’s skill and 
this facilitates the code comprehension” (Di Felice, 1993).  Documentation which explains 
the software unit’s purpose and external data types, along with hardware, compiler, and 
operating system independence also improves understandability.  “Generality enlarges the 
chances that those who do not know how a software unit was developed can understand 
it” (Di Felice, 1993).  Communication between units should be restricted to the use of 
procedure parameters, disallowing direct data access.
	Finally, Bennett (1994) has a good summary of the issues above when he says 
there are four issues that affect understandability: “1) Cohesion. Can the component be 
understood without reference to other components? 2) Naming. Are the names used 
meaningful? 3) Documentation. Does this make clear the mapping between real-world 
entities and the component? 4) Complexity. How complex are the algorithms used to 
implement the component?”

HOW CAN UNDERSTANDABILITY BE MEASURED?
	In the above section, many indicators of understandability were mentioned. 
Understandability is not a binary concept, but a spectrum from incomprehensible to 
understandable. A measure, or metric, is “the assignment of a value to an entity to 
characterize an attribute of that entity.  Therefore, a measure is not a number but a 
mapping between the entity and attribute” (Rising and Calliss, 1994).  Rising and Calliss 
(1994) say that “most software metrics can be classified as micro- or macro-level” where 
micro-level metrics focus on the internal mechanisms of system components, 
whereas macro-level metrics focus on the interconnections between system 
components.  Micro-level metrics are also called code metrics because they are 
based on implementation details in code.  Macro-level metrics are also called 
structure metrics because they are based on an analysis of the architectural design 
(Rising and Calliss, 1994).

Several code-level or micro-level metrics have been proposed, such as Halstead’s 
complexity metric, and McCabe’s complexity metric (Pressman, 1992).  Macro-level 
metrics include Mohanty’s entropy metric, Henry and Kafura’s information flow, and 
Harrison and Cook’s micro-/macro- measure of complexity (Rising and Calliss, 1994).  
	Many of the authors cited above who proposed indicators of understandability 
imply a relative metric in their articles.  They don’t supply a hard scale of measure for any 
particular document, but they do imply that in evaluating and comparing several software-
related items one can judge the relative level of understandability between the individual 
items.  One source code file will contain more thorough comments than another, or will 
have a superior level of abstraction and thus be at a higher understandability level than the 
first document.
	Laitinen (1996) details a relative method which is “particularly suitable for 
estimating the understandability of source programs”.  First, Laitinen reviews her previous 
research which shows that “names for variable, constants, tables, functions, etc.” affect 
not only the visual appearance of programs, but their understandability too.  Mnemonic 
names where several “natural words of a natural language, while respecting the 
grammatical rules of the natural language” describe the functionality of the entity are best.  
Abbreviations which must have comments in order to be meaningful are to be avoided 
because the use of “natural naming makes comments superfluous” (Laitinen, 1996).
Laitinen (1996) also describes a number metric of understandability, “based on 
identifying distinct symbols in software documents”, which is summarized below.  One key 
conclusion of this metric is that it is “important to use the same symbols both in 
specifications and in implementation documents”.  Avoiding unnecessary symbols (where 
a symbol is an identifier name or other similar token) leads to a higher level of 
understandability.  There are two rules to measure understandability of the documents 
(documents being a generic term for any software-related file).  Rule one says, “smaller 
languages are usually easier to understand than larger languages.  This means that the 
number of symbols of a language affects its understandability.”  Rule two says, “it is easier 
to understand closely related languages than more distantly related languages.”  A 
language consists of a set of symbols, each of which have a meaning.  If one masters the 
meanings of the symbols, then any document in that language can be understood.  
This metric is relative in that it relies on the comparison between two languages.  
So a comparison can be made between two documents to compare their relative 
understandability, or one can compare each document to the native language of the user 
(or some other baseline language).  If both files are in the same programming language 
(like Pascal, C, etc.) then the understandability can be measured by comparing only the 
languages comprised of textual symbols and ignoring the technical symbols that comprise 
the programming language itself.  
	The metric consists basically in listing out all the symbols of the two languages to 
be compared and counting the number of symbols in each.  The two rules identified above 
are then applied.  If the comparison is between two source files, etc., then the metric will 
indicate which is more understandable.  If the comparison is between a source file and a 
known baseline language (such as English), then the metric will determine how 
understandable the language in the file is, relative to the baseline language (which 
obviously assumes that the baseline language represents complete understandability).
	Cioch (1991) proposes a more numeric metric by which to measure 
understandability.  He gives a numeric value to both the comprehension and the 
misinterpretation levels of the particular software-related item.  “Comprehension of a 
software-related product is typically measured in an experiment as the amount of 
information about the product that is grasped mentally by the person studying it, the 
amount of time it takes to grasp the information, or both” (Cioch, 1991).  
For this metric, a short-answer test instrument is developed for the software-
related item which can “distinguish between not knowing the correct answer and giving a 
wrong answer.”  Test questions consist of statements about the software-related item 
which the subject must judge to be true or false.  The test questions are scored twice, once 
for comprehension and once for misinterpretation.  For example, if a given statement is in 
fact true, then the question would be scored as follows:
Response
Comprehension
Misinterpretation
“I am certain this statement is false”
0
2
“I am fairly sure this statement is false”
0
1
“I don’t know”
0
0
“I am fairly sure this statement is true”
1
0
“I am certain this statement is true”
2
0

Thus the degree of comprehension and of misinterpretation can be measured (Cioch, 
1991).  
Clearly, the drawback in this metric is that someone must be an expert on each 
product to be evaluated in order to write the questions for the test.  Also, this metric can 
only measure whether an individual person understands the particular document, not 
whether the document is understandable in general.  Though, presumably, if an `average’ 
programmer (or user, or whomever) can comprehend it with minimal misinterpretation, 
then most other `average’ people should be able to replicate the result.
Since abstraction and information-hiding have been cited as important ways to 
improve understandability, Rising and Calliss (1994) propose two information-hiding 
metrics, one at the module level, and one at the system level.  Information-hiding is when 
“each module hides a single design decision from other modules” (Rising and Calliss, 
1994).  The module level metric is the sum of two ordinal values.  The first value is either 
a zero (if the module hides more than one design decision) or a one (if only one design 
decision is in the module).  The second value is the count of extraneous entities in the 
module, where an extraneous entity is one that is “not required to implement a single 
[design] decision” (Rising and Calliss, 1994) and includes all visible type declarations and 
all visible global variables.
The system level metric considers both the information-hiding capability of each 
module in the system and the “use of the module by the rest of the system” (Rising and 
Calliss, 1994).  The metric is then the median “over all the modules in the system of the 
sum of the information-hiding measure for each module and the sum over the extraneous 
entities in the module of the number of client modules referencing the entity” (Rising and 
Calliss, 1994).

WHAT TOOLS CAN IMPROVE UNDERSTANDABILITY?
	There are a number of tools and methods that can be used to increase the 
understandability of software-related documents all along the range from requirements 
specification documents to source code and user’s manuals.
	Di Felice (1993) feels that code reusability requires a high degree of code 
understandability.  The same areas that aid in reusability thus aid understandability.  These 
areas are: object-oriented programming languages which promote software organization 
and composition, design methodologies, “software-engineering-based programming 
environments”, and semi-automatic methodologies and tools.
	Williams (1994) shows how reviews, Computer Aided Software Engineering 
(CASE) tools, and formal methods such as theorem proving “can reduce the ambiguity in 
specifications and provide a basis for verification later on”.  When ambiguity is reduced it 
is intuitive that the document will be more understandable.
	If understandability is a part of software quality, then the usual techniques that 
improve the other “ilities” should also help here.  Following a software design 
methodology carefully and insuring that each step creates quality products should carry 
quality (and thus understandability) through to the final products.  Writing a clear, 
unambiguous, understandable requirements specification document should help the 
process of creating a clear design document.  A good design with modularity, 
requirements matching and traceability, and notation that lends itself to readability and 
unambiguity will aid in creating source code and documentation that is understandable.  
Style guides, reviews, CASE tools, and formal methods can improve the chances that 
understandability (and other quality features) are built in to the software product at each 
step of its development.

HOW DOES NATURAL LANGUAGE PROCESSING APPLY TO SOFTWARE 
UNDERSTANDABILITY?
	My particular interest in using fractal, chaotic or dynamic tools to find and exploit 
dynamic patterns in natural language will hopefully lead to more robust part-of-speech 
(POS) tagging.  POS tagging feeds into sentence parsing, and thus to information 
extraction and many other more flashy results.  The improved capabilities to find the ideas 
and meanings of text, I can imagine the following applications of natural language 
processing (NLP) to the issues of software understandability.
	First, an NLP system could attempt to compare end user and source code 
documentation to the requirements specification.  Such an automatic approach could help 
to link end user documentation to particular modules of code, and link code to particular 
requirements for a legacy system in which good software engineering methods were not 
followed during the time of development.  This reverse-engineering linkage could help 
maintainers understand where and how to modify code to meet new expectations.  This 
application could also evolve into a soft measure of the understandability and the quality 
of newly developed software that did follow good software engineering practices by 
generating this linkage structure which could be compared to a linkage structure that was 
generated by the developers or by a CASE tool along the path of development.  The 
degree to which the two structures agree would measure how well the engineering process 
had been followed.
	A simpler application of NLP to understandability would be to measure the 
complexity (or ambiguity) of software documents written in natural language (such as 
requirements specification documents) by determining how many sentences had 
unambiguous parsing structures.  If the grammar of the document is complex (as seen by a 
large number of sentences having ambiguous or faulty parses) then it will be difficult to 
understand and thus difficult to successfully translate into a system that is correct.  A 
straight comparison of the count of badly parsed sentences could be a measure of the 
relative and absolute understandability of the document.  You could not only say that one 
document was more understandable than another (it has fewer badly parsed sentences), 
but over time one could develop a feel for the range of values that indicates various levels 
of understandability.
Natural language generation might be used to automatically generate comments for 
code, or user documentation from code comments, or user documentation from 
requirements specifications and other software engineering documents.  Perhaps 
structured English specifications could be automatically generated from free-text 
requirements specifications documents, along with the traceability documentation that 
shows how the two are linked together.  NL generation could also be used to dynamically 
generate explanations (or answers) in response to questions from developers and 
maintainers of how some software item is related to another or to the previous steps of the 
software lifecycle.
	Parsed sentences from design documents could also be translated to a meaning 
representation language (MRL)  so that logical reasoning could be done about that 
information.  As such assertions are processed, a consistency check could be run on each 
new item that is being translated.  This check might be able to find places where one 
requirement conflicts with another, or where one design decision impacts on others 
already processed.  Once a requirements document has been so translated, perhaps another 
AI system could attempt some algorithm planning and then generate natural language that 
explains the plan and the reasoning used to reach that plan.
	In most of the above ideas, NLP serves as a pre-processing step that must be 
completed before the other Artificial Intelligence applications perform their work, at which 
point NLP may step back in to generate natural language versions or explanations of the 
results.  Many of these applications were inspired by the general types of capabilities and 
applications of NLP reported in Gazdar and Mellish (1989).


REFERENCES
Bennett, J. P. 1994. Understandability. Software Design I. 
Http://www2.bath.ac.uk/~masjpb/teaching/c11/lectures/subsection2_2_3_3.html, 5 
Feb. 1997.

Cioch, Frank A. 1991. Measuring software misinterpretation. Journal of Systems and 
Software, 14(2):85-95.

Di Felice, Paolino. 1993. Reusability of mathematical software: A contribution. IEEE 
Transactions on Software Engineering, 19(8):835-843.

Gazdar, Gerald, and Chris Mellish. 1989. Natural language processing in LISP: An 
introduction to computational linguistics. Wokingham, England: Addison-Wesley.

Laitinen, Kari. 1996. Estimating understandability of software documents. Software 
Engineering Notes, 21(4):81-93.

Pressman, Roger S. 1992. Software engineering: A practitioner’s approach. 3d ed. New 
York: McGraw-Hill.

Rising, Linda S. and Frank W. Calliss. 1994. An information-hiding metric. Journal of 
Systems and Software, 26(3):211-220.

Varney, R. C. 1976. Toward the understandability of an operating system. The Computer 
Journal, 19(3):213-215.

Williams, Lloyd G. 1994. Assessment of safety-critical specifications. IEEE Software, 
11(1):51-60.
11


1