A Unit of Analogy

Porting Lemon to Zig: A YACC Shave

Porting C to Zig: Structures and Patterns

At the time of writing, I have a fully-functional port of Lemon, missing a few optional features, all of which will exist by the time I publish.

At time of publication, I also have Zitron, the completed port of Lemon to Zig. It’s new, but it’s in active use, and I’m pleased with how it turned out.

The general strategy of instrumenting the internal state of both programs, doing data-dumps to stderr, then diffing, served me quite well. It should be suitable for most ‘straight line’ programs, by which I mean, those which follow the common CLI pattern of starting at the top of main, proceeding to the bottom, and exiting. Suitably modified, it should serve for nearly anything, although truly interactive programs will need to come up with a strategy for running headless, and I would expect the technique to break down somewhat for complex GUI programs.

I expect many of my readers will know this, but it bears saying: you can gate all of the printing logic behind a comptime-known boolean, and when you make it false, that code will not even be semantically analyzed, and no trace of it will appear in your program. This is our replacement for C’s #ifdef statements, and better in at least one way: in addition to toggling true and false, you can delete the option entirely, and the compiler will find every place you’ve used it so you can be sure they’re removed. You can make this a build option if you want, I didn’t bother but it’s easy to convert if you decided to. It’s up to you how long to keep the scaffolding, but remember, it’s impossible to prove that two programs have the same visible outputs, assuming that the input universe is unbounded, which it almost always is. It’s difficult to even have justified confidence of it. If you find an edge case you missed, you’ll be happy to have the prints to turn back on again.

The Zig version of Lemon is very slightly, but consistently, faster. Most of that comes from one change: I use @embedFile to include the template in the program. This is cheating in a sense, but it’s also the case that until C23, embedding text from a separate file into C code was a real pain, while in Zig it’s easy and natural.

But not all of it. lemon.zig is incrementally, but measurably, faster, even when it reads the template from a file. Even performance-wise, it would seem that the differences between Zig and C come down in Zig’s favor.

The binary is noticeably larger, which I chalk up to all the specializations of print, Lemon being a very print-heavy program. At 03 for lemon and ReleaseFast for lemon.zig, the latter is around four times the size. In absolute terms, on my computer, the C program is 130KiB and the Zig one is 541KiB, and this size can be nearly halved to 252KiB with ReleaseSmall if one were inclined. 37KiB of this is the embedded template, which with the numbers we’re talking about is not a minor contribution. Interestingly, -Os has only minor effect on the lemon.c binary; I would like to explore why the -Fast and -Small modes in lemon.zig have a greater impact, but haven’t done so.

This shows that Zig is effective at what C is good at: fast, lean programs. The Zig version is larger, no question, but this isn’t an accident, and if it were necessary to slim it down to match, you can implement a respectable generic printf type of function using the C compatibility tools, as the documentation shows. This would result in comparable code: unsafe, containing latent risks of buffer overruns and runtime misinterpretation of typed values, but small, at one function per program rather than one per use of format. Regardless, the kind of bloat which we see from C++ and Rust is not apparent.

Last time, we did a deep dive on some of the fundamental ways in which Zig differs in approach from C, and I recommended approaches which porting can take to translating from what’s preferred in C to what Zig prefers. Using the exact Zig equivalents of C types will mostly cause more problems than it solves, but there is an ongoing up-front cost to making those decisions and carrying them across the translation.

Remember that every single thing you decide to change, however minor, will impose a contextual burden on the entire porting process. So I suggest rigorously sticking to identifier names in almost all cases, with the exception of some almost mechanical stylistic differences. Two of those are casing for types and enums: feel free to go ahead and capitalize types and lowercase enums, it’s never unclear what’s going on and it won’t slow you down. But fully translating everything into the approved Zig style can wait, possibly forever.

Memory is another one: the default strategy is to stick an allocator field on appropriate struct types, and thread it through accordingly, but you may encounter some identifiable strategies at play, and it can be worthwhile to find a Zig equivalent. I talked a bit about this in Part Five, Lemon uses some freelists and a MemoryPool is a functionally-identical version of this. If the program just allocates memory without freeing it, you could just make an arena and use cleanExit, it amounts to the same thing.

Refactoring once you have the port complete, and a solid regression test in the form of A/B comparison, is straightforward. In particular I suggest deferring any transformation of functions into namespaced member functions, I did one type that way right at the beginning and I’m glad I stopped there. A tip: you don’t have to move functions into a namespace, you can forward them by declaring pub const fnName = fnImpl;. If you happen to care about Git history then this will be preferable, and renaming someFunction to someFunctionImpl will help you find all the call sites and convert them to use member function syntax.

Another area to consider a more abstract translation is data structures, let’s talk about that a bit.

Data Structures, or, the Lack Thereof

C, famously, has no method for defining generic data structures of any sort. This ends up coloring how it’s written: some abstract data types are easier to implement than others, and since any such type must be implemented, or at least copy-pasted, they get heavily used.

Lemon has a few associative arrays, these were handily emulated using an ArrayHashMap. Interestingly, these were auto-generated from a specification, using a now-lost program. Big “we have templates at home” vibes. Just use caution before deciding a hand-rolled hash map can be replaced with a stdlib, because one consequence of implementing these things directly into the program is that the inner workings are more easily made load-bearing. Lemon has a hashing step to deduplicate strings, but the hash fingerprint ends up in the generated code, so that one I needed to port directly, so I could keep using the clean-diff success criterion.

You’re also likely to encounter many more linked lists than you might be used to, or at least, I did. They’re easy to allocate as-needed, C makes it very compact to traverse them, they can be stuck on a freelist and reused, and you don’t have to keep a separate length variable in sync with the amount of data, unlike a decayed-pointer array.

The majority of the linked lists used in Lemon are that way for a reason, but there are a few which surely could be arrays instead. Linked lists can get annoying because it’s easy to lose your data, and tricky to find the spot where you did that to yourself. Still, put off any change as substantial as switching from a linked list to a slice until you have a finished and functioning program, and at that point, do you really need to? Also, to echo the main theme of this section, you may decide that a translation from linked lists to an array is semantics-preserving, and you may find out that’s wrong.

For that matter, be careful translating the pair of an array-type pointer and an integer length directly into a slice. Often it’s entirely correct, but once again, a couple facts about C combine to make this less than guaranteed. Remember that free doesn’t take the length, just the pointer, so there’s no penalty to having a length variable describe only a subset of that array. In Zig if we trim a slice, we have to keep that knowledge in the program to convey it to allocator.free, and so cases where you might want to do things this way get handled in some other way. The majority of synthetic slices I spotted turned out to be exactly that, but not the vast majority.

It’s going to be a case-by-case basis. I made a serious mistake in thinking that I could model a triple of pointer-length-capacity as an ArrayList. Despite those being structurally identical, they were used in a completely different way and I had to back out of all that code. Often you can find a finite-length buffer in a C program, used to build strings, and replace it with an ArrayList writer initialized with a StackFallbackAllocator, and fix the latent risk of a stack-smash inherent in the original. The more local the scope of some potential translation, the more confident you can be that the transformation will be correct.

Be wary of making optimizations based on a substituted data structure. Lemon copies the symbols out of one of the aforementioned associative arrays into an ordinary array, then sorts them. I chose ArrayHashMap because it keeps keys and values in contiguous arrays (hence the name), but I made a mistake: instead of adding a comment about a possible optimization, I returned the value array directly, reasoning that Symbol creation only happens while parsing the input. While that’s true, the associative array is also an interning pool, and one line of code uses the array to retrieve the default symbol. Since I had scrambled the value array, this did not do what it was supposed to. Tracking this bug down lost me a day of my life, and we only get so many of those.

The optimization turned out to be both possible and easy, and would have taken me ten minutes once I had things working. It’s only an optimization in some Platonic and abstract sense, one @memcpy of at most a few hundred pointers is utterly irrelevant to any realistic accounting. We know what Dr. Knuth would say.

Generic Functions

Yes, C code has generic functions, and you might well find some. If you are looking at casts back and forth from (char *), and possibly some distressing offset-pointer calculations, maybe tidied up behind a macro, you’ve found a generic.

Guess what? Zig can translate this stuff directly, as-written, although the conventions are slightly different. Probably you don’t want to do that, and will want to write a comptime function which takes the needed type information and returns a function specialized to your type.

But both options are available, and idiomatic Zig code does have both runtime-generic interfaces and functions using type erasure. Comptime gives the advantage that the result is type-safe, so it will catch more transcription and translation errors. The other approach has the advantage that it makes the translation more mechanical.

A compiler is unlikely1 to be able to coalesce several comptime- specialized functions into a single call with some hidden parameters, even though the nature of this task of translation means that this is possible by construction. Conversely, the compiler will have more opportunities to optimize the type-safe variations. Chances are you should favor the latter set of tradeoffs.

(Un) Tagged Unions

Much like C code will frequently have ‘synthetic’ slices, it’s also fairly common to combine a union with a differentiating tag, canonically an enum.

I recommend defaulting to implementing these the way the code does, as a separate tag with a bare union. Zig’s tagged unions demand a very different control flow, which makes translation less straightforward, and once again, the lack of connection between the parts means the code is under no obligation to keep them in sync. In particular the C code might have one or more values which invalidate the union entirely, and it only needs to change the tag to express that.

Zig includes a safety tag with bare unions in safe modes, so the compiler will assist you with identifying mistaken use of such a construct. A later refactor to use a genuine-article tagged union is likely to be fairly straightforward, although you may or may not want to: if the pairing is embedded in a larger struct, you may find that separating the tag and union, at least in unsafe modes, leads to smaller data, for alignment reasons.

Then again, if it’s clear that you’re looking at a tagged union with extra steps, then sure, maybe make the change up front. But do look around and check if it’s as clear as you think it is.

Miscellaneous gotchas

Take care with for loop to while loop translation. The initial case goes outside the statement, instead of within it, and I found that makes it easy to switch things up, especially combined with the not-infrequent translation to a capturing while which optionals (linked lists, remember) require.

Similarly, take a second to think about the base case when turning a do/while into a while. You do want it to execute the same amount of times rather than one less, this is never a headache in and of itself but it’s not a fun bug to track down if you miss a step here.

You’re going to have a lot of optionals, especially if you have linked lists, and you probably will. That means making up a bunch of variable names which aren’t in the original, and a choice between giving the optional a different name, or giving the captured no-longer-optional the different name. Easy: it’s the optional which gets the new name, and the new name is just the old name with an m_ prefix. I’ve been using maybe_value, and it says too much for what it is, I’ll be switching to m_value in normal Zig code as well. Much like i for a local index, it’s enough to convey the distinction without drawing undue attention to itself.

Sorting: C code generally uses qsort, since it’s in the standard library. By the same token you’ll want to use Zig’s mem.sort, which as of v0.14 is a block sort, although there are a few more in std.sort. There’s a challenge here, more subtle than it appears: qsort uses a cmp function (negative, zero, positive) while all of Zig’s sorts expect a less-than function. This rewrite is generally straightforward, just bear in mind that your sort function needs to return false for an equality, because N is not less than N. Doesn’t matter if the data is unique, because it can be compared to itself2.

Also: the C standard does not impose an implementation for qsort, despite the name, and it is not required to be stable, which block sort is. If you’re unlucky you might encounter a sort which triggers instability, which will preclude printing out the sorted contents as one of the validity checks. Not the end of the world, since if subsequent use of the sorted data is robust to instability, a stable sort will work just as well.

goto: generally you can solve this with either a defer/errder or a labeled switch-continue. Case fall-through is always a labeled switch-continue. If, like me, you’re in the habit of labeling anything which gets a break or continue, watch out: Zig and C have the same break semantics for unlabeled breaks, but with labels in the mix you’ve given yourself the opportunity to break out of the wrong level of nesting. Might want to add labels after porting a function, rather than during.

For something like setjmp / longjmp, you’re just going to have to figure out what it’s doing and do it in some other way. There are very few C-isms which don’t translate, and that’s one of them. Normal gotos might mean cooking up an enum and adding a switch which isn’t in the original, but this is no big deal. The most common longjmp case is an isolated cleanup routine, and you can often solve that with an errdefer in the main function, or avoid it entirely by scope-specific errdefers which clean up as they unwind. Lemon doesn’t use setjmp/longjmp, for the record, this is the general-advice section.

Let global state be global state. If you want to make it local later, you can just delete it, stick it on something which should be able to transfer it where it’s needed, and clean up compile errors until it is. Doing it up front, and then discovering it’s needed somewhere which doesn’t happen to receive the structure you put it on, is less fun. Personally I always use threadlocal, as a principle-of-least-power decision which also makes the ‘global’ state easy to find.

Don’t shy away from simple style transformations such as factoring out very long functions into some subroutines. That’s unlikely to get you in trouble. Don’t rename stuff, though, you’ll regret it when you start jumping around both source files looking for things. With the already- noted exceptions of style changes for the likes of enums, which are purely mechanical and don’t interfere.

C will often declare and initialize variables separately, especially ANSI C where it’s required. You can emulate this with undefined, but it’s often worthwhile to move the variable where it’s used, and sometimes use things like labeled blocks to make it so the variable is never in an undefined state. Those kinds of purely local transformations rarely get you in trouble, if you find out that you were wrong about the actual scope of something it’s not so hard to back it out a level or two.

You may want to un-reuse variables, since C code, especially the older kind, tends to be fairly profligate in that respect. This may involve adding blocks which only serve to create scope. So much the better, I say: when you come back to the code the blocks serve as load-bearing documentation of a discrete step in some larger operation. They also make plausible chunks to refactor into separate functions, when that makes sense.

Keep an eye out for boolean type-punning. It’s not uncommon to use a zero value as no-errors, call something in a loop while adding return values to an int, and then branch on non-zero. Usually this can become a boolean, but not always. If you turn the err += ... into err = err and ..., you’ve short-circuited something which wasn’t a short circuit, so do your anding or oring on the next line. Making the branch a comparison against zero is the conservative option here.

Oddball one: it’s perfectly valid to index a pointer in C with a negative number, as long as that memory is defined. Generally you’re going to be doing something else entirely, but it’s worth keeping in mind that you can use a multipointer mp: [*]T and emulate a negative index like this: (mp - 2)[0]. Mostly you want to use an index instead of doing the kind of pointer-bumping which tends to lead to this, but say you have a stack represented as two pointers, you can make the root pointer []T and the top pointer [*]T and do top-of-stack manipulation this way. Respectable in greenfield Zig code as well, in my opinion, if less likely to arise. It means you can pass one pointer instead of a slice and an offset, with less register pressure, better prefetch, the negative offset can be baked into the instruction instead of calculated off the index. Might come in handy some day.

Conclusion: Try It, You Might Like It

Beginners to a new programming language, and/or a new style of programming, are often at a loss as to how to gain proficiency. Those with more experience in programming generally often have a specific sort of test program they fall back on, for me that’s been a basic terminal manipulation library. That sort of thing serves to get the basic syntax sorted, how code is built and run, all the 101 stuff.

The next project is often something greenfield and open-ended, which can be fine, but which can also peter off. It means combining the acquisition of intermediate proficiency with everything which goes into producing a quality library or application, from scratch.

Translating a program or library into your new language, in our case, Zig, is underrated as a journeyman project. Especially if you’re new to systems programming, or memory-managed languages. Choosing this approach means that your success criterion is built in, and, you have a way to determine when you’ve reached it.

I think this is a great approach, even if you don’t know C either. Perhaps especially if you don’t know C. Expertise in Zig practically demands a good understanding of C. Among other things, the language defines the ABI which we need to target to share code outside of the Zig ecosystem, which is one of the major value propositions of the language. As is using C code from within Zig.

It’s not a hard requirement to pick a library or application written in C, but I recommend it, or at least a language without memory management. Early on in my Zig journey, I completed a port of diff-match-patch. Although one of the official implementations is in C++, the Zig library I was starting from was based on the C# implementation, so I chose that as my source text for the port. Figuring out how to add correct memory management was pretty brutal and definitely imposed some strain on the process.

There are thousands of command line tools and programs written in C, and some enormous number of libraries. Applications have an edge here, because they have an exterior surface which can be used for testing the fidelity of the translation. But most of what I discuss here applies to libraries as well. If they have a good test suite (as diff-match-patch does), you may be able to complete the job by translating that with the library and getting it to pass. If not, you’ll likely want to write an application which exercises the library’s API. Zig does a great job of integrating with a C library, so you may well be able to write one application, in Zig, which uses both libraries.

This series of posts is about porting an application however, and I can heartily recommend giving it a try. Wouldn’t it be cool if there was a full collection of the canonical Unix tools written in Zig? Why not add one of these programs to that collection? We’ll have it done quicker than you might think!

A Word About Licenses

About that: a translation of some FOSS program is going to need to be licensed in a compatible way with the source program. It’s a derivative work, or at least you would have your work cut out for you arguing otherwise, and I strongly recommend not ever putting yourself in the position where you might have to so argue. That means if the source material is GPL, the translation must be as well.

I don’t use the GPL(s) as a matter of course. Perhaps you do3, in which case you do have more options. If you prefer a more permissive license, the various BSDs do as well, so, at least where core Unix utilities are concerned, there’s always a version with a compatible license which you can base yours on.

It’s worth observing that if you wanted to do something like add a GNU-only extension to some program translated from a BSD source, black-box testing does not oblige you to the copyright of that program. This would mean running the GPL version on the same input as your port and confirming that visible behavior is identical, and not reading the source and translating the feature directly. That’s a tough way to implement an entire program, that isn’t even a port, it’s a whole other art form. But for one or two flags added to an otherwise-complete port, it may be practical. The following is not legal advice, but the FSF in particular is well aware of how black-boxing works, and doing so will create visible evidence in your repo in the form of the testing scripts. So in the unlikely event that you’re contacted by the FSF, specifically, they’re likely to accept that what you did is licit and does not require you to GPL license the result.

That may not generalize, however. Lawsuits surrounding FOSS license issues are rare, but not impossible, and one reason I avoid GPLed code, and don’t care much to write code with that license, is because the terms carry more of a risk of creating situations which can end up in court if you’re unlucky enough to find out that the copyright holder is trigger-happy about that kind of thing. Permissive licenses don’t have this issue, just do include a copy of the original license in your repo (for most permissive licenses, not all) and you’re fine. A GPLed port of a GPLed program is also fine. I can’t make any confident statement about Affero, because most legal opinion is that it’s totally unclear what is and isn’t compliant with that license.

Lemon is dedicated to the public domain, like much of drh‘s works. In my IANAL opinion, that means lemon.zig is not eligible for copyright, but I also disclaimed any copyright on it I might have. The program I derived from it, Zitron, is licensed BSD-0, which is equivalent to the public domain in all practical ways. That’s enough about copyright, the central point is that a faithful port done from source is a derivative, not a transformative work. So the license should match the original.

Finale

So what are you waiting for, give it a shot! A lot of the advice in this series is going to be applicable to any program, no matter how it works. Just figure out up front how you’re going to generate textual dumps which you can diff to confirm success, start with whatever part of your target will get you dumping said test fastest, and keep it up until you have a feature complete and compliant port of the original.

If you do get to that point, drop me a DM on Ziggit, I definitely want to hear about it. Or post it in the Showcase so we can all congratulate you.

Even if you don’t want to take up the Port It To Zig Challenge, I hope you’ve enjoyed my long and rambling exposition of one such port. If you have any feedback, questions, or comments, get in touch, I’m easy to find.

  1. Less likely, perhaps. Reasoning too closely about what the compiler will or won’t do is a bad habit, albeit one I indulge in. If you’re not checking Godbolt, or otherwise dumping LLVM IR or object code, then you don’t actually care as much as you think.

    1

  2. Interesting Hyrum’s Law observation here: the C standard, as mentioned, does not require basically anything internal about a qsort implementation. But Quicksort itself will never compare an element against itself. This can pretty easily become load-bearing, for example, one might take the opportunity while sorting data to assert that each element is unique. So in practice, no variation on qsort will make such a comparison. They simply wouldn’t dare.

    2

  3. I certainly don’t mind. Unless you’ve picked up some bad habits from Stallman and feel the need to sneer about “pushover licenses” or whatever, in which case, I’m sneering right back. It’s the principle of the thing, solidary must go both ways. Don’t be a free software bully. People notice.

    3

Part Six: Nulls and Slices and Signs