Two ideas
are the most characteristic of C between the languages of its class: the
relationship between arrays and pointers, and how the declaration syntax mimics
the expression syntax. They are also among its most frequently criticized
features, and often serve as stumbling blocks to beginners. In both cases,
historical accidents or errors have exacerbated their difficulties. The most
important of these was the tolerance of C compilers to type errors. As should
be clear from the above history, C has evolved from a variety of languages. It
did not suddenly appear to its early users and developers as a completely new
language with its own rules; Instead, we had to continuously adapt the existing
programs as the language was developed and make provision for an existing body
of code. (Later, the ANSI X3J11 C standardization committee would have the same
problem.)
Compilers in
1977, and even after that, did not complain about uses such as assigning
between integers and pointers, or using objects of the wrong type to refer to
the members of the structure. Although the language definition presented in the
first edition of K&R was reasonably (though not entirely) consistent in its
handling of the type rules, the book admitted that existing compilers did not
enforce them. Besides, some of the rules designed to ease early transitions
have contributed to later confusion. For example, empty square brackets in the
declaration function int f(a) int
a[]; {....}
They are a
living fossil, a remnant of NB 's way of declaring a pointer; an is, in this
special case only, interpreted in C as a pointer. The notation survived partly
because of compatibility, partly because of the rationalization that it would
allow programmers to communicate to their readers an intention to pass a
pointer generated from the array, rather than a reference to a single integer.
Unfortunately, it serves as much to confuse the learner as it does to alert the
reader.
In K&R
C, it was the responsibility of the programmer to supply arguments of the
proper type to a function call, and the existing compilers did not check for a
type agreement. The failure of the original language to include the type of
argument in the type of function signature was a significant weakness, indeed
one that required the boldest and most painful innovation of the X3J11
committee to be remedied. The early design is explained (if not justified) by
avoiding technological problems, in particular cross-checking between separate
source files, and my incomplete assimilation of the implications of moving from
untyped to typed language. The lint program mentioned above tried to ease the
problem: among its other functions, lint checks the consistency and consistency
of the entire program by scanning a set of source files, comparing the types of
function arguments used in calls with those used in their definitions.
The syntax
accident contributed to the perceived complexity of the language. The
indirection operator, spelled * in C, is syntactically a unary prefix operator,
as in BCPL and B. This works well in simple expressions, but in more complex
cases, parentheses are needed to direct parsing.
For example, to distinguish indirection by a value returned by a function from calling a function designated by a pointer, one writes *fp() and (*pf)()respectively. The style used in expressions is followed by a declaration, so that names can be declared
int *fp();
int (*pf)();
In more ornate but still realistic cases, things get worse:
int
*(*pfp)();
It is a
pointer to a function that returns a pointer to an integer. Two effects are
occurring. Most importantly, C has a relatively rich set of ways to describe
types (compared, say, with Pascal). Statements in languages as expressive as C
— Algol 68, for example — describe objects that are equally difficult to
understand, simply because the objects themselves are complex. A second effect
is due to the syntax details. Statements in C must be read in an 'inside-out'
style that many find difficult to grasp [Anderson 80]. Sethi [Sethi 81] noted
that many of the nested statements and expressions would have become simpler if
the indirect operator had been taken as a postfix operator instead of a prefix,
but by then it was too late to change. Despite its difficulties, I believe that
the C approach to declarations remains plausible, and I am comfortable with it;
it is a useful unifying principle.
The other
characteristic feature of C, its treatment of arrays, is more suspect on
practical grounds, although it also has real virtues. Although the relationship
between pointers and arrays is unusual, this can be learned. Moreover, language
has considerable power to describe important concepts, such as vectors whose
length varies over time, with only a few basic rules and conventions. In
particular, character strings are handled by the same mechanisms as any another
array, plus the convention that a null character will terminate a string. It is
interesting to compare C's approach with that of two almost contemporary
languages, Algol 68 and Pascal [Jensen 74].
Arrays in Algol 68 either have fixed limits or are 'flexible:' a considerable mechanism is required both in language definition and in compilers to accommodate flexible arrays (and not all compilers fully implement them.) Original Pascal had only fixed-size arrays and strings, and this proved to be confined to [Kernighan 81]. Later, this was partially fixed, although the resulting language is not yet universally available.
C treats
strings as character arrays conventionally terminated by a marker. Apart from a
specific rule on string literal initialization, string semantics are fully
subsumed by more general rules governing all arrays and, as a result, the
language is easier to describe and translate than one that incorporates a
string as a unique data type. Some costs arise from its approach: certain
string operations are more expensive than other designs because the application
code or library routine must occasionally search for the end of a string. After
all, few built-in operations are available, and because the burden of string management
falls more on the user. C's approach to strings, however, works well.
On the other
hand, C's treatment of arrays in general (not just strings) has unfortunate
implications for both optimization and future extensions. The prevalence of
pointers in C programs, whether explicitly stated or derived from arrays, means
that optimizer must be prudent and must use careful data flow techniques to
achieve good results. Sophisticated compilers can understand what most pointers
might change, but some important uses remain difficult to analyze. Functions
with pointer arguments derived from arrays, for example, are difficult to
compile into efficient vector machine code, because it is rarely possible to
determine that one argument pointer does not overlap data that is also referred
to by another argument or accessible externally. More fundamentally, the C
definition so specifically describes the semantics of arrays that changes or
extensions that treat arrays as more primitive objects, and allow operations on
them as a whole, are difficult to fit into the existing language. Even
extensions to allow the declaration and use of multidimensional arrays whose
size is dynamically determined are not entirely straightforward [MacDonald 89]
[Ritchie 90], although they would make it much easier to write numerical
libraries in C. Thus, C covers the most important uses of strings and arrays
resulting from a uniform and simple mechanism in practice but leaves problems
for highly efficient implementations and extensions.
There are,
of course, many minor infelicities in the language and its description besides
those discussed above. There are also general criticisms to be made, which go
beyond detailed points. The most important of these is that language and its
generally-expected environment are of little help in the writing of very large
systems. The naming structure only provides two main levels, 'external'
(visible everywhere) and 'internal' (within a single procedure). The
intermediate level of visibility (within a single file of data and procedures)
is weakly linked to the language definition. There is therefore little direct
support for modularization, and project designers are forced to set up their
conventions.
Similarly, C itself provides two storage duration: 'automatic' objects that exist while the control resides in or below the procedure, and 'static' objects that exist throughout the execution of the program. Off-stack, dynamically allocated storage is provided only by a library routine and the burden of managing it is placed on the programmer: C is hostile to automatic garbage collection.