A Unit of Analogy

Porting Lemon to Zig: A YACC Shave

Lemon.c: A Critical Review

The source text for Lemon is actually lost to the mists of time. What remains is an amalgamation: a single C file aggregated from several .c and .h files. This is, to my knowledge, the oldest of Richard Hipp’s works available. It has of course been maintained over the years, and updated somewhat, but the bulk of it was completed, quote:

by D. Richard Hipp (also the creator of SQLite) while he was in graduate school1 at Duke University between 1987 and 1992. The original creation date of Lemon has been lost, but was probably sometime around 1990.

What follows is a stylistic discussion of the text of lemon.c. Some will find this keenly interesting, the rest are encouraged to skip ahead.

The next last paragraph of the above source also deserves quoting in full:

The author of Lemon and SQLite (Hipp) reports that his C programming skills were greatly enhanced by studying John Ousterhout’s original source code to Tcl. Hipp discovered and studied Tcl in 1993. Lemon was written before then, and SQLite afterwards. There is a clear difference in the coding styles of these two products, with SQLite seeming to be cleaner, more readable, and easier to maintain.

I am familiar with neither the codebases of Tcl nor SQLite, but am by now quite intimate with both Lemon and Pikchr. Pikchr being the most recent of drh’s works, and Lemon the earliest we have. I can say that both reward close study, and that having done so has taught me a great deal about how to construct low-level systems code, whether in C or in Zig.

It fascinates me as well to see the early work of such an acclaimed and accomplished master of the art, and to have the opportunity to compare it with the latest opus in his canon. The task at hand, however, is to introduce lemon.c. So let us begin.

Style

Lemon is written in ANSI C, the first major revision of the language and the first standard one. This standard was published in 1989, making it brand new when the program was begun.

The code reflects this. There are a grand total of six typedefs, for example, and only two of these hand-written (more about this later). All function declarations are fully prototyped, however, and parameters are typed within the parentheses, not after. All locals, as they must be, are declared at the top of the function2, all comments are in the /* Classic style */, and so on. Perhaps the oddest-seeming idiom, to someone used to more modern dialects of C, is the casting of void pointers returned by malloc and friends.

ANSI C remains the most compatible version of the language, and there are projects which deliberately stay within it for that compatibility. Lua, for an example, is also written in ANSI C. While a code generator such as Lemon might not actually have active users who need a dialect that old, there is little to be gained from ‘modernizing’ a codebase like this one.

In its brace style, it follows K&R, with function braces on a separate line, but all others in the Egyption form. In other respects it differs considerably, and in ways which are remarkably consistent with the style used in Pikchr.

Indentation uses two spaces. Type declarations and function definitions are invariably and clearly commented, and every field of type declarations has a side comment, cleanly lined up to the right of the field. Function bodies are commented as well, extensively so, with nearly every loop or condition annotated as to purpose.

The code itself is dense, visually and otherwise. The use of two spaces does much to create this effect, as does the stylistic choice, which is rare, to connect control structures to parentheses and braces, as if they were function calls.

That is, a for loop appears like this:

for(j=0; j<p->nLookahead; j++){

And else appears thus:

}else{

Conditionals are truly unusual, but in a way which I’ve come to find beautiful, and appreciate the intention behind. An if or while statement has space within the parentheses, but not between compared values, or between the closing parenthesis and the opening brace, like so:

if( rp->precsym==0 ){

This close-vs-far pattern is also used in mixed precedence rules, so we might see 2 + i*5 or the like. The spaces within the parentheses of conditionals, and the yoking of pairs in comparisons, gives them a distinct visual texture, which helps with the other choices leading to the dense feel already mentioned. This sparing use of whitespace gets more code on the screen, which was surely a priority in times where 25x80 was a normal allotment of textual real estate.

It could be argued that this kind of separation compensates for the almost claustrophobic use of indentation, and the closeness of keywords to syntactic separators. But I would say, rather, that it justifies it, allowing the trained eye to find the elements of the code while putting more of it in view. This favors the experienced reader, but why should it not?

The above is actually why I devoted a chapter to explaining these wholly stylistic elements of the program text. I just think it’s neat, drh uses this style consistently in all code of his I’ve read, and I think this variation deserves to be better known.

Nomenclature

C hackers, as a community of practice, value concision. When they err, it is on that side of the question. This is a point I will return to more that once as we get to the translation itself, and it becomes higher stakes; this love of the terse extends to the language itself, sometimes to its detriment.

Lemon, by and large, is willing to spare enough characters to avoid befuddlement. We have the occasional gnomic identifier, probably the extreme of this is the phrase cfp->fplp, which is a config pointer’s forward propagation link pointer. I don’t think anyone has a good handle on how to deal with a name like that, to be frank.

The variables use a sort of lightweight Hungarian, in that sometimes they have a prefix or suffix letter which indicates type information, or type meaning. I say ‘lightweight’ because he doesn’t make a vice of it: variables and field names are allowed to be themselves when that’s clearer. Inasmuch as the convention is used, this is more of an Apps Hungarian, for instance using n for a count and i for a stored index. Both are int, but you do want to keep them separated, and the data-type information isn’t blindly reduplicated in the name, but rather the role. We see a b prefix for a ‘Boolean’, although not as often as in Pikchr where that use is quite consistent. This too is consistent with role-based naming, since they’re merely integers which will be evaluated for their truthyness.

There’s some hefty logic which has a lot of doubled data types, one for Lookahead and one for Action. These make good use of prefixes, we have aLookahead and aAction for the arrays, with mn for minimum, mx for max, n to track the bounds of a, and this kind of thing. This works, I would add the e to creat3 and say min and max but, it’s C, where letters are expensive.

To foreshadow a bit, the pairing of aAction and nAction4, or foo and nfoo more generally, is what I started to call a “synthetic slice”. In C, an array ‘decays’ into a pointer, one which is by convention alone addressed with subscripting: every valid pointer has a value of its type at [0], after that, you’re on your own. The valid bounds of these subscripts must be tracked somehow. In Zig we use a ‘fat pointer’, which contains the bounds on a .len field, although we also have [*] which is exactly equivalent (not in syntax, but in expressive power) to the decayed array. So those pairs are a slice, but synthesized from two parts.

z is an interesting one, it seems to refer to a ‘live’ char * buffer, or data associated with one. String manipulation in C is devilishly tricky stuff, and Lemon is obliged to do quite a bit of it. The z prefix is used when strings are being built out of parts and buffers, but when the pointer is stored for later, it just gets a name like code or destructor, this kind of thing. It’s helpful, I just want to know where z comes from. A conjecture: this is related to the need to keep the buffer zero-terminated, or at least to restore it to that form before any context is lost.

Then there’s p. Pointer? Of course pointer. But nearly everything is a pointer, although they aren’t all postfixed with p, and C makes it very clear when you have a pointer, what with all the ->, &, *, and [n] floating around. I’m not convinced this one pulls its weight. However, p doesn’t just mean pointer, it means singular pointer, it’s never used for a decayed array, those get a prefix a and that only sometimes. In C, perhaps that’s worth keeping track of.

Capitalization is somewhat load-bearing in familiar ways, such as SHOUTY_CASE for macros and enums, lead-lower for identifiers, and so on. This code was not linted for a rigid style, however: there is some signal in whether a function is capitalized or not, but I couldn’t elucidate a rule for it. If pressed, I’d say that capitalized functions are more ‘important’ in some way, with the lowercased ones serving some fragment of the logic contained in the uppercased functions. Whether a variable or function is PascalCased, camelCased, under_scored, or two words merely smooshedtogether, seems to be a bit arbitrary, but coming down to readability. Fields at least are invariably lower case, and either smooshed or pascalCased, the seeming basis being whether it fits to draw attention to the second or subsequent word or words, in which case they get capitalized.

There is a form Noun_verb() which serves as a sort of conventional namespacing, these are often associated with some structure stored in static memory, but not always. This, or something like it, is a frequent resort in C, which has no other mechanism for namespaceing. These stylistic choices have the net effect of making code easier to follow, but do not represent a rule by which one can simply convert the form of a name to something true about what it refers to. Especially now, when we have language servers to show us the type of an identifier on demand, this balance is somewhere in the sweet spot.

Types are all lowercase, but since typedef is only seldom used, this doesn’t become confusing, since struct is present for any declaration of one of these types. I infer that both the lowercase typing and the sparing use of typedef are due to the newness of the latter, and the style of the time, because in Pikchr everything is typedef‘ed, and struct types are invariably capitalized. Worth noting that Linus is firmly opposed to trivial typedefs, and the Linux style does not capitalize types, so it can’t even be said that the more common styling I refer to is universally agreed to be a good one.

C was and is the great battleground of these stylistic trivia, and the scars earned in these great and holy wars have influenced languages which follow to avoid the conflict ever arising. Zig, like Go, has a built-in official formatter, and a blessed convention for the use of capitals and underscores, one widely but not universally adhered to by the community.

I hope this was of some faint interest to some of you, and thank the rest for bearing with me. These were the qualities I was tracking during the initial read-through, the overall schema of meaning being an aid to comprehension for everything which follows.

  1. drh styles his own name D. Richard Hipp, but I not infrequently see this written Dr. Richard Hipp, and have done so myself. This is, in fact, technically correct, or at least, Dr. D. Richard Hipp would be so. I will, for the most part, use drh, which is the hacker-honorific form, in the modern miniscule style.

    1

  2. Block actually, but mostly at the top of the function in point of fact.

    2

  3. Ken Thompson was once asked what he would do differently if he were redesigning the UNIX system. His reply: “I’d spell creat with an e.”

    3

  4. This specific pairing turned out to be a bad example. Foreshadowing!

    4

Part One: An Introduction
Part Three: A Faithful Translation