Is it not the case that most serious parser implementations are hand-written? In part because it makes it so much easier to provide good diagnostic messages. I feel like every other project moves from parser-generators to hand-rolled parsers and levels.
Community wisdom has long held that good error recovery is impossible for automatically generated parsers and I unquestioningly accepted that wisdom for years. When I was building a Yacc system in Rust, I eventually investigated further, and found that there's loads of work in this area over several decades. None of it ever found its way into real parsers, probably because it was too slow. I extended some of that work a bit, made it more efficient, and took advantage of the speed of modern machines and now have an approach to automatic error recovery that you have to do quite a lot of work to beat with a hand-rolled parser. More at https://softdevteam.github.io/grmtools/master/book/errorreco... if you're interested!
Both go and OCaml use LR parser generators, precisely because they make it easy to give excellent error messages.
Basically the key insight is that you can use the parsing automaton state to classify syntax errors, which makes it as easy as falling off a log to produce good messages. The paper originating this technique is Clinton Jeffery's TOPLAS 2003 paper “Generating LR Syntax Error Messages from Examples.” This is what go uses, and OCaml uses a more advanced version of this technique introduced by Francois Pottier in his CC 2016 paper "Reachability and Error Diagnosis in LR(1) Parsers".
If you look closely at the Go code you'll see it was once C code generated by some parser generater (which I assume was Bison). Believe it was Russ Cox that wrote a program that transformed the generated C code into Go.
Edit: I forked the compiler a while ago to add maybe types (a la Rust result). Looks like the compiler code has changed quite a bit from when I was playing w/ it.
That's a long time ago. You can see vestiges of it here and there (mostly in deeper areas of the compiler, which goes all the way back to Plan 9), but the parser now looks very clean.
FTR: The go/parser package was _never_ used by the Go compiler. It used to be a yacc generated one when the compiler itself was in C. Then it was all mechanically translated to Go. And finally, a hand written lexer and parser were adopted, but different from the ones in the stdlib.
Guy Steele said this in the Dynamic Languages Wizards series, in the panel on language design [1]:
Be sure that your language will parse. It seems stupid to sit down and start designing constructs and not worry how they will fit together. You can get a language that's difficult if not impossible to parse, not only for a computer, but for a person. I use YACC constantly as a check of all my language designs, but I very seldom use YACC in the implementation. I use it as a tester, to be sure that it's LR(1) ... because if a language is LR(1) it's more likely that a person can deal with it.
I've had great success with a parser-generator known as "Marpa" [1]. One of the things I particularly liked about it is how simple it made to emit accurate and useful error messages when parsing complex languages. In addition, it can handle anything that can be expressed in BNF, and it's quite fast.
For general-purpose languages, yes, but there are some exceptions. Ruby uses yacc, and Python uses a custom LL(1)-ish parser generator, which is in the process of being replaced by a custom PEG parser generator [0].
* See: "What do you think are the problems people will try to solve with ANTLR4?" question
I've used several different parser generators in the past.
But I've also transitioned to hand-rolled recursive decent parsers, being Lazy I've created a library to assist in hand-rolling a recursive decent parser: https://github.com/SAP/chevrotain
This is actually pretty true based upon what I've personally observed too.
While Apache Spark [0] and Presto [1] use ANTLR to implement their parsers, when I searched about TypeScript's parser and noticed that they implement their own parser [2] (in TypeScript itself), the reason I was able to find was that the nuanced error messages which a language like TypeScript has to provide is only feasible by hand-writing the parser.
That’s what I expected this article to be about - it looks like it’s actually a long-form sales pitch for ANTLR training (but still worth a read). I’ve never been fortunate (or unfortunate) enough to be called on to put together a real compiler since school, but I’ve spent a lot of time working with “lite” parsers like JAXB and Hibernate (or variants thereof), and I’ve come to the same conclusion: I’m better off just hand-rolling my unmarshalling code than dealing with the gaps left by code generators.
In my professional experience I build parsers, like at least 10 per year and 99% are written using ANTLR. I did not find a language that I was not able to parse with it, so far.
I think it does make sense to write manually parsers for performance and error messages, but it should be clear that this means raising cost by ~10 times. It is worth the effort if you are building a compiler for Java, for example, probably not if you want to process some DSL you developed.
Disclaimer: I am the brother of the author of this article
You can check out how CockroachDB uses special error symbols with goyacc to provide good error messages for SQL passing. It’s a very interesting read really.
The reasoning is pretty simple. Writing a parser for a sane language won't take significantly more than a month for your average developer. If you have a whole team working on the same compiler for decades then the pay off becomes pretty obvious.
One of the subtle things hidden in your statement (which I do somewhat agree with) is “sane language”.
One of the more memorable parsers I’ve worked on was a parser for SPICE netlists. I started out believing that it wasn’t going to be too big of a deal, and ended up sinking a ton of time into it. Ultimately the company (as far as I know) ended up just buying a $40k license to some obscure company that had one, because there was a constant cat-and-mouse game of getting it working right and then discovering a customer who did even more strange stuff that was somehow accepted by the 3rd-party SPICE sim, but wouldn’t be accepted by ours.
I’m assuming that PHP has evolved a ton since I encountered this, but IIRC at one point you couldn’t do `$foo[$baz]($zap)` to call an anonymous function stored in an array, rather you had to `$t = $foo[$baz]; $t($zap)`. At the time (young and naive), I just couldn’t comprehend how a sane parser wouldn’t parse the first form, and then I started looking at how it was implemented...
I think that hand-rolled parsers are good if:
1) You have a team that knows well what they are doing
2) You have a lot of resources
3) Your language is stable
I would argue that parser-generators are what made those projects to be started and prosper in the first place. Then once one is successful a custom solution could make sense but in my experience it is more expensive and potentially less maintanable, unless you know very welll your way around parsers and language tooling
For some definition of "serious" I think the gains from a readable implementation in your working language far outweigh performance or correctness concerns.
To put it another way, I'm rarely parsing data for it to be directly optimized into a machine language. I'm parsing data to extract parts I care about and then work with those parts. The more consistent this process is the easier it is to debug and fix. If my whole parsing/using-the-parsed-data pipeline is in the same language, say Javascript, then I am still only ever debugging the same language and runtime environment (eg: node v12 on whatever \*nix).
In a practical example, YARA[0] is (confusingly) used as both a format[1] and specific implementation[2] for sharing malware detection rules. ClamAV[3] is a popular open source antivirus engine that added YARA-the-format support a few years ago[4]. If you look at their grammar[5] file as well, you can see that one uses GNU Bison 3.0.4 the other uses 3.0.5. One is 3754 lines long, the other is 1849. We can expect these to behave differently. At a certain point, say because of how regular expressions are handled[6], it becomes easier to maintain your own parser than to deal with quirks/whims of someone else's implementation (generated or not).
[6]: "In previous versions of YARA, external libraries like PCRE and RE2 were used to perform regular expression matching, but starting with version 2.0 YARA uses its own regular expression engine. This new engine implements most features found in PCRE, except a few of them" from https://yara.readthedocs.io/en/latest/writingrules.html#regu... ; if the regular expression grammar and/or symbol set isn't consistent the things parsing the files won't necessarily be either
..wow, this is a bit strong. I would encourage you to be more respectful of persons who work in this very specific field and share their ideas. There are always persons behind some work you insult so easily.
You know, there are over 2M persons who read our articles. A few hundred also bought a book or a video-course from us, but the vast majority just got some information from free, and we like it in this way. I do not think we got so many persons interested in what we do by lying.
If you have a different professional experience I would happy to learn from it.
"Seibel: And are there development tools that just make you happy to program?
Thompson: I love yacc. I just love yacc. It just does exactly what you want done. Its complement, lex, is horrible. It does nothing you want done.
Seibel: Do you use it anyway or do you write your lexers by hand?
Thompson: I write my lexers by hand. Much easier."
I happen to like both bison and flex, which are relatively easy to use and bug-free in my experience. Yet your article spreads hundreds of lines of FUD about these tools, a strategy that many ANTLR people use.
I have used ANTLR. It is not intuitive, the documentation is horrible, if you happen to find some advice on Stackoverflow it is likely to be for another one of the incompatible versions.
I suppose if you use ANTLR long enough, these problems go away. But bison or Menhir don't have these problems in the first place.
It's been a while, but I had some fun using the "lemon" parser generator[1] used by SQLite. Worked alright for something where regexes weren't enough and I couldn't be bothered to write a custom parser. Probably has the same cons that lex/yacc would get, according to the post.
I do have the ANTLR book, but to be honest, if I would have the task of doing something more involved and where integration isn't as important (i.e. it could be an isolated command, not a module for huge Java/C# app), I'd probably be more inclined to get deeper into SML/Ocaml than ANTLR for this.
I have seen multiple projects that run into maintenance problems due to the large grammar with thousands lines and antlr 3/4 incompatibility.
Generally parser generator adds a layer of complexity/constraint which may be significant if you need full control of lexing, parsing, semantic processing, error recovery of your language.
After working on a few language projects (including core language, web-based editor with syntax coloring and auto-completion) myself I would strongly suggest hand-rolled parser for any serous language endeavour(general purpose or DSL).
I am a bit surprised by this, as I had the opposite experience. I find ANTLR grammars much more maintanable than the hand-written parsers I encountered. Indeed I was asked to port hand-written parsers to ANTLR for maintanability. Also, ANTLR4 seems to me to produce grammars which are clearer than ANTLR3.
The weak point is error recovering, in my opinion. While ANTLR offers a sort-of-decent error recovery strategy for free, one has to customize it if they want to get great error messages.
I think that many maintainability issues are due to poor usage of ANTLR. What we do is to:
1) Limit semantic actions
2) For complex languages do tree-transformations, after parsing
With this approach we got pretty decent results.
Personally I find hand-rolled parsers too costly to build and harder to maintain, but I have limited experience with them. I guess it also depends on the context: if you are designing a DSL while writing the parser you want to be able to evolve it very quickly and in that context I think that a parser generator is very useful. If instead I would need to build an industrial grade parser for a general purpose language, having a large budget, then I would go for an hand-rolled parser
My experience has been similar to yours. The company I work at has very complex business logic written as a custom DSL parsed/executed by a crazy pile of Perl code.
Recently we had requirements to come up with a way of providing on-the-fly validation in a web editor. This was only possible by ditching the old implementation and re-writing the grammar using ANTLR. While the old implementation is unmaintainable and (probably, who knows?) buggy, the ANTLR implementation is trivial to work on, test and add new features to.
If you're working with a limited time budget, which is common when your main job isn't to maintain the language, then parser generators such as ANTLR are a godsend. ANTLR even enables you to generate parsers in different languages depending on where it needs to be executed. Need something to run client-side, in the users browser? Generate a JavaScript parser and you're done. Need it to run in the Java backend? Generate it in Java and call it a day.
While it's true that error handling isn't the best, it's already better than nothing, and, as you say, can probably be improved with a bit of customization.
Second this. Recursive descent parsers are much easier in the long run, even if you can't immediately see your grammar after they have grown for a while.
From trying to understand parsing and RDPs I think I don't have the part of the brain required to understand it.
Not that it isn't simple, it is. But it seems examples (as usual) overexplain the simple things then overlook something that seems obvious but isn't.
The only time I managed to write a parser for simple math that wasn't an example was through the use of 'reverse production' parsing. Yes, it's the worse way, but it worked for me (this was a long time ago though)
Indeed PostgreSQL is a good example of fairly old software (>20 years) that is highly reliable and well used. How many people needing a database complain that they have to have something that has been written, and re-written, in recent years?
I first learned Antlr way back about 20 years ago and found it fairly impressive. So when I go to write a new parser or lexer I always go take a look again...
But every time I take a new look at Antlr there's a huge amount of issues related to the fact that they have a huge new rewrite deprecating old versions while the new version is not ready: entirely new ways of doing things with missing documentation and examples, incomplete or missing language backends, and missing packaging for various Linux distributions.
What I get from it is: Antlr is a very powerful project... if you're in the Java ecosystem and if you're willing to use old no longer supported versions.
Your experience mirrors my experience quite a bit. I remember a few years ago playing with ANTLR precisely as an alternative to flex/bison for a project I was working on. Just recently I had an embedded project (quite resource constrained, like 4kB of RAM) that needed to parse some data coming in a serial port, and I thought “hey, I should look at ANTLR again”.
I fire it up, play around a bit, and did a calculator example or something like that. The next step, I figure, is to get it to generate some C code so I can test it on the device. Turns out ANTLR4 has dropped the C backend entirely! C++ is, currently, a no-go for the project, so... I guess I’m out of luck there.
Pleasantly, in a discussion with a friend, he asked the silly question: “couldn’t you parse those strings with sscanf()?”. I blinked in disbelief, wrote the tiniest parser to split the input on newlines, and sscanf() did the trick.
Yeah sscanf can do quite a bit. But it doesn't have any informative error messages, can't recover (skip to the next semicolon or whatever), doesn't keep a symbol table, can't do any parsing more complex than "the next line looks exactly like this pattern" etc.
So after years of rolling my own, I finally decided I would never do that again. All the off-by-one errors, difficult-to-diagnose failures and endless fiddling is out of my life now. I just use a parser tool.
Rewriting perhaps not, but being able to fix the ebnf-flavor used to make it more powerful it is a great thing. If you have to stay 100% compatible with all the mistakes accumulated in 30 years you are tied
Last I checked, the interface between lex and yacc (resp flex and bison) is by mutating static global variables. Is that still the case? It always struck me as exceptionally bad design.
I don't think you are particularly restricted to doing this, but lex/yacc was conceived and designed at a time (1970s) when memory was limited, and speed and size of code that could be parsed was valued much higher than structured code and being able to provide good error messages. You simply can't do a recursive descent parser with an in memory AST in 64KiB of RAM while being able to parse any meaningful size source file. You had to serialize the problem and output a sequential AST/intermediate opcodes on the fly. FSA based lexers and parsers are beautiful in how little RAM they actually require and how fast they are, albeit clunky and very low level by modern standards.
I tried to read this article, but unfortunately the lack of grammar was too distracting and I just couldn't get through it.
A lot of these are things that Google Docs or Microsoft Word will notice and ask you to change. I highly recommend using one of those to write (especially if English is not your first language) and trying to understand the suggestions it makes. Your articles will come out much more readable to others.
Counterpoint: I used to "and I stopped reading" when I ran into grammar issues. I gradually realized that it's worth the effort to focus on the underlying ideas and evaluate that, rather than someone's 2nd language ability. People are smart, even when they talk funny, y'all.
(The worth/validity of the underlying ideas is a separate matter.)
Counterpoint 2: Automated grammar corrections are risky unless you're a native speaker. Ironic.
Most English speaking people everywhere speak with some variation on "formal" English. Even the received pronunciation in South england and london is relatively modern.
If anything, some American accents are closer to the English of the late 18th century than many British ones are, and the "official" received pronunciation is not without its fair share of critics.
Yeah I can't see your average C/C++ programmer pulling in a parser generator written in Java to get their job done in place of flex/byacc, I suspect most would rather roll their own parser ahead of that. Maybe it happens, but I can't picture it.
Same with Python, the performance was horrifying. Didn't seem usable for production to me, but seems better suited for prototyping a language without worrying about the parser until later.
I would have to check, but the resources on lex/yacc that I remember would make a point of their usability at scale. Generally it would be unlikely to wrote a faster lexer than what lex could do, but very likely to write a better parser than yacc. However writing a small lexer and/or parser _quickly_ would be easier with those tools than hand rolling one almost every time. Its great parser generators are still being developed and improved, and its fine to point out issues, but this article misses the point by a wide margin.
Just a few points that I'd like to answer to this article:
- One of the primary concerns of Bison is the speed of the generated parsers. In the section about performance, they say that ALL is 135x faster than GLR. Digging into the original paper, we find that:
- 135x is from their best case against the worst GLR they could find
- they compare different tools, not different algorithms in the same tool
- the grammar they use is optimized for their tool
- the best GLR tool they're testing against isn't even running on the same OS
- moreover, they mention that the biggest chunk of their performance comes from the fact that they cache temporary results, so theoretically including a cache in a GLR parser could make it even faster.
- since there is no grammar analysis, we get back to the classic "static vs dynamic" language debate, where Bison tells you at build time if you have conflicts, and ANTLR relies on your good test coverage.
- They only have a passing mention of the fact that their grammar is strictly less powerful than GLR since they don't handle non-direct left recursion.
I have used ANTLR to write a parser for a simple language and found it kind of confusing, but put it down to being new to ANTLR and to parser generators in general.
The reason I chose it was for its ability to output parsers in multiple implementation languages for a single grammar - lemon et al seem to be tied to C. Are there any other systems the HN crowd would recommend for this use case?
I've been poking at coco/r recently and it seems to have a nice collection of backends.
Not sure how maintained it is, spent an afternoon getting the Python version ported over to py3 (which wasn't really all that hard) to learn how it works.
As TFA states you can't reuse grammars for multiple languages because the actions are declared inline but a few of the other complaints aren't an issue due to the way the generated parser does it thing -- quite well designed IMHO.
I am sorry we had this impression. Sure, we use ANTLR a lot, for commercial and open-source projects, and we had to work with parsers written in (flex), yacc, and bison, so we shared our experiences as people who work all day long with parsers. We offer a lot of free resources on ANTLR and we have not specific interest in advertising ANTLR. For us it is a tool we use and love, just that
Almost anything fares better than flex/bison, this is shooting the proverbial fish in a barrel.
It comes across very strangely that your brother writes as if y'all are unaware of the existence of parser generators other than ANTLR. This makes me sad. In a fair comparison, ANTLR really is not a great tool; its capabilities are eclipsed by its marketing.
We have several articles comparing tens of other tools. That said of all the tools we worked with, ANTLR was the one we were most efficient with. It is true, there is a learning curve, but in our experience learning it was a great investment. I can understand other could have different preferences
I think they are learning tools for people to start and to play around. It is hard for an application developer with no background knowledge to understand what a parser does because it is quite abstract, for example lexer, tokens and expression tree.
However, to build a realistic and comprehensive compiler, I believe to build it manually is a much better option because it is less error prone, more flexible to tweak around and favours unit testing. Perhaps parser generator is better? I am investigating these kind of tools for work because we need to make our own unique formula implementation. I did use yacc for a school project, and I know its limitation. Since I am busy at my staff so I just let my colleague to make their decision and they decide to use jison. It turns out the product owners want see a much better error message in invalidating the formula, also we need to define our own function so we have to switch to another implementation in the next phrase.
Has the important detail why it scored good in benchmarks against other parsers:
"7.3 Effect of lookahead DFA on performance
The lookahead DFA cache is critical to ALL(*) performance.
To demonstrate the cache’s effect on parsing speed, we disabled
the DFA and repeated our Java experiments. Consider the 3.73s
parse time from Figure 9 to reparse the Java corpus with pure
cache hits. With the lookahead DFA cache disabled completely,
the parser took 12 minutes (717.6s)."
I'd still most probably hand-roll, at least when making something for a long-term project, and not some "demo" -- there the stability and ease of maintenance is more important than fast availability of initial results.
What I don't like about any of these parsers, including ANTLR is that for real languages you get not a clear EBNF grammar, but terrible mix of declarative and imperative statements. I was pretty sure PEG is the modern way to go, but looks like it's not. So, what is then?
I recollect using lex/bison for writing compiler for COBOL on mainframe using SAS/C to list variable names in COBOL programs (remember Y2K!). Later in life used ANTLR do something fun in Java. Never thought of comparing them. They both have their best part. If you are coming from Lex, Bison world - you will understand the philosophy behind ANTLR. But there is little more learning curve with ANTLR. Felt it is slightly complicated.
If the new thing is actually better, you should not have to spend the bulk of an article tearing down the old thing to make the new thing look better in comparison.
That ANTLR example with the sum does little to convince me that the code should be separate from the grammar. It being separate means that you need to look at the grammar to know what tokens you have access to anyway.