Sidestepping an Issue to Make My LL(1) C Parser Work

Sometimes the works that give you real satisfaction are the ones that are incomplete and imperfect. Full of workarounds and sidestepping, not only do they work, but most importantly they save you from burning yourself out digging in a dead-end. A couple of days ago, I had this thrilling moment of satisfaction, control and relief, resulting from a quick decision to sidestep something.

I was writing a program to scan C header files and convert the function prototypes to nguigen, a language that I designed. These converted prototypes would serve as interfaces to existing C libraries from nguigen. Writing a C parser shouldn't be hard if you use flex and bison. But being obsessed with reinventing the wheel, I chose to use my own lexer and parser generators. Also, this seemed like a nice opportunity to test their usability for tasks other than the development of the nguigen compiler itself.

npargen, my parser generator, is able to handle LL(1) grammars only. This was sufficient for nguigen, but not for C. I won't claim C can never have an LL(1) representation, but any attempt to write one is really hard, based on my experience. Despite the limitations, I could make some real progress in converting parts of the C grammar to LL(1), until I hit the dead-end.

Before explaining what was problematic, let me first make it clear what were not problematic and why. Since I was writing this converter to auto-generate interfaces for C libraries, all I needed to get right were function prototypes. I still have to make it through function bodies since header files can have inline function definitions, and it is hard. But at least I don't have to get the semantics right. For example, I can afford to mess with the associativity rules. Basic transformations like left factoring weren't a problem either, since npargen2 can handle that.

The problem was with the production of parameters in function headers, which had declaration | abstract-declaration in the RHS. It can't be thrown away or truncated. Both declarations (e.g.: int *arr) and abstract declarations (e.g.: int *) are to be expected in function declarations. Unfortunately, both declaration and abstract-declaration have common prefixes, which means an LL(1) parser cannot decide which one to pick. Although I could easily resolve conflicts due to common prefixes in many other parts of the C grammar either manually or computationally, this one remained a real pain.

I spent at least two days on this. It felt like a dead end and I was about to save the challenge for later.

Then it hit.

My original goal was to produce a converter that does the job, not to create a museum-piece grammar specification. Can I sidestep the issue to achieve my primary goal? Yes. The solution was to simply fuse declaration and abstract-declaration together, incorrectly allowing both in wherever only one of them is expected, and to deal with the error later (for example, this spec would correctly allow both int *arr and int * as function parameters, at the cost of incorrectly allowing both as declaration statements).

There is no need for a parser specification to be 100% truthful to the final grammar. It can afford to accept programs with incorrect syntax or semantics as long as it doesn't reject any correct ones and there is handwritten code to catch the incorrect ones. This can make things easier. In fact, even the grammar snippets given in the ISO document require additional work in the semantic analysis phase to filter out erroneous input.

This is not just about parsers auto-generated from specifications. Even a 100% handwritten recursive descent parser can opt to accept incorrect syntax and semantics first in order to reduce the complexity of the initial phase of the parser. Such a phase will have less number of functions, reduced branches and shallow nesting. This makes maintenance, testing and debugging easier.

So what was the end result?

Once the workaround was in place, the converter could parse all 71k lines from preprocessed gtk.h. Of course it'd accept incorrect inputs too, but the important thing is, it accepts all correct inputs and stores info regarding each and every function declaration for the next phase of the converter to work on.

Was I being lazy? I believe not. I have spent enough time doing things one way instead of another without any reasonable benefit (not even the learning experience sometimes). Not this time. Let me sidestep in favour of the primary goal instead of spending eternity on a secondary goal. Of course I'll continue to do seemingly useless things in seemingly unproductive ways because I know there'll be surprises, but at the same time I want at least a fraction of my work to yield guaranteed output.

Links

Read more from Nandakumar at nandakumar.org/blog/