The Problems With B Language
The machines on which we first used BCPL and then B were word-addressed, and the single data type of these languages, the 'cell,' was comfortably matched to the hardware machine word. The advent of the PDP-11 exposed several inadequacies of the B semantic model. First, its character-handling mechanisms, inherited with few changes from BCPL, were clumsy: using library procedures to spread packed strings to individual cells and then repack them, or to access and replace individual characters, began to feel awkward, even stupid, on a byte-oriented machine.
Second, although the original PDP-11 did not provide for floating-point arithmetic, the manufacturer promised that it would soon be available. Floating-point operations were added to BCPL in our Multics and GCOS compilers by defining special operators, but this mechanism was possible only because, on the machines concerned, a single word was large enough to contain a floating-point number; this was not true for the 16-bit PDP-11.
Finally, the B and BCPL models implied overhead when dealing with pointers: language rules, by defining a pointer as an index in an array of words, forced pointers to be represented as word indices. Each pointer reference generated a conversion of the run-time scale from the pointer to the byte address expected by the hardware.
For all these reasons, it seemed that a typing scheme was needed to deal with characters and byte addresses, and to prepare for the upcoming floating-point hardware. Other issues, particularly the type of security and interface checks, did not seem as important as they did later.
Apart from problems with the language itself, the threaded-code technique of the B compiler produced programs so much slower than their assembly-language counterparts that we discounted the possibility of re-coding the operating system or its central utilities in B.
In 1971, I started to extend the B language by adding a character type and also rewrote its compiler to generate PDP-11 machine instructions instead of threaded code. Thus, the transition from B to C was contemporaneous with the creation of a compiler capable of producing programs quickly and small enough to compete with assembly language. I called the slightly-extended NB language 'new B.'
Embryonic C
NB existed so briefly that there was no full description of it. It provided int and char types, arrays of them, and points to them, declared in a style typified by
int i, j;
char c, d;
int
iarray[10];
int
ipointer[];
char
carray[10];
char cpointer[];
The semantics of the arrays remained exactly as in B and BCPL: iarray and carray statements create dynamically initialized cells with a value pointing to the first of a sequence of 10 integers and characters, respectively. Statements for ipointer and cpointer omit the size to state that no storage should be allocated automatically. Within procedures, the language interpretation of the pointers was the same as that of the array variables: the pointer declaration created a cell that differed from the array declaration only in that the programmer was expected to assign the reference, instead of allowing the compiler to allocate the space and initialize the cell.
The values stored in the cells bound to the array and the names of the pointers were machine addresses, measured in bytes, of the corresponding storage area. Therefore, indirection through a pointer did not imply a run-time overhead to scale the pointer from word to byte offset. On the other hand, the machine code for array subscription and arithmetic pointer now depended on the type of array or pointer: to compute iarray[i] or ipointer+i Implicit scaling of the addend I by the size of the object in question.
These semantics were an easy transition from B, and I've been experimenting with them for a few months. Problems became apparent when I tried to extend the type notation, especially to add structured (record) types. Structures, it seemed, were supposed to map intuitively to the memory in the machine, but in a structure containing an array, there was no good place to stash the pointer containing the base of the array, or any convenient way to get it initialized. For example, the directory entries of early Unix systems could be described as C.
struct {
int inumber;
char name[14];
};
I wanted the structure not only to characterize an abstract object, but also to describe a collection of bits that could be read from a directory. Where could the compiler hide the name pointer requested by the semantics? Even if structures were thought more abstractly, and the pointer space could somehow be hidden, how could I deal with the technical problem of properly initializing these pointers when assigning a complicated object, perhaps one that specified structures containing arrays containing structures to arbitrary depth?
The solution was a crucial jump in the evolutionary chain between typeless BCPL and typed C. It eliminated the materialization of the pointer in storage, and instead caused the pointer to be created when the name of the array is mentioned in the expression. The rule that survives in today's C is that the values of the array type are converted to the first of the objects that make up the array when they appear in expressions.
This invention allowed most of the existing B code to continue to work, despite the underlying shift in language semantics. The few programs that assigned new values to an array name to adjust its origin — possible in B and BCPL, irrelevant in C — were easily repaired. More importantly , the new language retained a coherent and workable (if unusual) explanation of the array semantics while opening the way to a more comprehensive type structure.
The second innovation that most clearly distinguishes C from its predecessors is the fuller type structure and, in particular, its expression in the syntax of declarations. NB offered basic types int and char, along with arrays of them, and pointed to them, but no further compositional methods. Generalization was required: given an object of any kind, it should be possible to describe a new object that gathers several objects into an array, generates them from a function, or is a pointer to it.
There was already a way for each object of such a composite type to mention the underlying object: index the array, call the function, use the indirect operator on the pointer. Analogical reasoning led to a declaration syntax for names that mirrored the syntax of the expression in which names typically appear. So,
int i, *pi,
**ppi;
Declare an integer, a pointer to an integer, a pointer to an integer. The syntax of these statements reflects the observation that I * pi, and * * ppi all give an int type when used in an expression. In the same way,
int f(),
*f(), (*f)();
Declare a function that returns an integer, a function that returns a pointer to an integer, a function that returns an integer;
int
*api[10], (*pai)[10];
Declare an array of pointers to the integers, and a pointer to the integers array. In all these cases, the declaration of a variable is similar to its use in an expression whose type is the one named at the head of the declaration.
The type composition scheme adopted by C owes a considerable debt to Algol 68, although it may not have emerged in the form that Algol's adherents would approve. The central notion that I captured from Algol was a type structure based on atomic types (including structures), composed of arrays, pointers (references), and functions (procedures). The concept of unions and casts of Algol 68 also had an influence that appeared later.
After creating the type system, the syntax associated with it, and the compiler for the new language, I felt it deserved a new name; NB seemed insufficiently distinctive. I decided to follow the single-letter style and called it C, leaving open the question of whether the name represented a progression through the alphabet or the letters in the BCPL.
nice contents
ReplyDelete