Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
CS143: Compilers (2011) (keithschwarz.com)
139 points by rspivak on March 13, 2016 | hide | past | favorite | 62 comments


Also, take a look at http://www.keithschwarz.com/interesting/

It is an amazing archive of all kinds of neat algorithms and data structures.


I'm pleasantly surprised. For academic work, this code is amazingly clean.


I took that class on the Stanford campus in the Summer of 2011. Keith was one of the most engaging instructors I've ever had. His enthusiasm and knowledge were boundless.

The course was video-recorded and I downloaded and saved most of the lectures (and I gave Keith a DVD with a copy of them.) I still have them, but I don't know if it's okay to put them on YouTube or not.


Of course it is not okay unless you have permission.


Currently taking CS 103 (Mathematical Foundations of Computing) under Keith at Stanford. Truly one of the best taught classes I've ever taken - the man has a love for teaching Computer Science that really makes all the difference.


Out of curiosity - the Mid-term exam paper says:

This midterm exam is open-book, open-note, open-computer, but closed-network. This means that if you want to have your laptop with you when you take the exam, that's perfectly fine, but you must not use a network connection. You should only use your computer to look at notes you've downloaded from online in advance. If you are an SCPD student, you should take this exam in any two-hour period between 11:00 AM Pacific time, July 20 and 11:00 AM Pacific time, July 21 . If you have any questions, please feel free to call me at _ any time in this window except between midnight and 8AM on July 21, Pacific time. You may submit the exam either by faxing it to _ or by scanning and emailing it

I'm not familiar with how these things work in the US - are the students trusted to take the exam away and not use the internet to get help?


At Stanford, faculty are prohibited from proctoring. So there is no professor or grad student that sits in the exam hall with the students while they are taking the test:

"The faculty on its part manifests its confidence in the honor of its students by refraining from proctoring examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty mentioned above."

https://communitystandards.stanford.edu/student-conduct-proc...

A professor or grad student walks into the room every 30 minutes or so just to make sure everything looks okay. They otherwise sit outside the exam hall so you can ask them questions, ask to go the bathroom, etc. But it's up to the students to police each other and ultimately themselves.

There are definitely other universities in the US that have this policy, but I would say it's more the exception than the norm.


That's really interesting. I did a CS degree in the UK and all exams were strictly supervised and we were never allowed computers.

The exception to this was programming exams for which the department would boot all the lab workstations into a locked-down mode with no internet & new homedir containing only the exam files. At the end of the exam everything got collected and marked by an automated test suite.

They went to quite some effort to ensure that no cheating could take place - http://pubs.doc.ic.ac.uk/Lexis/Lexis.pdf


I can't speak for other schools in the US but I am an undergraduate in the CS department at Stanford, and it's fairly common to have an open-computer test to refer to notes and the textbook. Other students sit around you and it's most likely fear of another student approaching a TA that makes this enforceable.


I've got a small collection of compiler resources on Github:

https://github.com/melling/ComputerLanguages/blob/master/com...

I'll add more as they are posted.


I'm taking my university's offering of this course in my final semester. It's fantastic. I'm using all of the seperate skills I've developed over the last four years: Mathematical thinking (via parsing), algorithms and data structures (ASTS),Programming Languages, software design. It's also the first course where big design decisions are non-trivial, and really effect things later down the pipeline, an experience that is rare in school, but, from what I can tell, ubiquitous in the field. I really recommend it to any students out there.


I'm going to be taking a Compilers course next quarter and I was speaking to the prof who told me that she doesn't plan on using the Dragon book, instead she plans to use the Louden book which seems a lot more approachable in a single quarter.

I was lucky enough to buy a used copy on amazon for $8.

http://www.amazon.com/Compiler-Construction-Principles-Kenne...


Also take Niklaus Wirth PDF and follow how to write an Oberon-0 compiler end-to-end.

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

It is a very instructive process.


This is a really neat book. The fact that it builds a complete compiler and virtual machine in ~100 pages is really amazing.


I remember while learning data structures and implementing them, I landed on http://www.keithschwarz.com/interesting/. It was enlightenment for me. By browsing through these examples I learned about good programming practices. Most importantly how commenting your code can help you even when you return after few years to fix something in it.


Took an algorithms course with Keith a few years ago. Really patient engaging lecturer!


.


There's self-reported data that some Stanford students have scrapped from CourseRank before it got shutdown. http://www.stanfordrank.com/cs143


And again, 90% about the parsing, and a couple of words about "code optimisation". That pitiful so called "Dragon Book" really keeps taking its toll of souls.

I'm terribly sorry to break the news for you, but no, compilers are not about parsing. Nowhere near that.


Compilers are complicated things -- too much to attempt to cover in a single quarter.

There is a followup course, CS243 which focuses on the optimization aspect.

http://suif.stanford.edu/~courses/cs243/


Right, but while parsing is important, it's the least interesting aspect of implementing a compiler. It would be better if more time was spend on starting with an AST and generating code. Then at the end of the course, implement a parser to generate the AST.


That's exactly how I used to teach compilers. But the weaker students had a really hard time understanding the purpose of ASTs ("why don't you give the program to the code generator as a string?") no matter how often I pointed out why ASTs are a better data structure than strings from the POV of efficient code generation. And some students complained about this "unnatural" sequence of explaining compilers in the course evaluations. So I changed to teaching compilers in the standard order, lexing -> parsing -> type-checking -> code gen -> optimisation. No complaints any more.

There is another dimension to this. While code gen and optimisations are more interesting than lexing/parsing almost no student will ever again write a code generator, let alone optimisations. OTOH, most programmers need to write little lexers and little parsers for little languages all the time. That's why a bit of facility with lexing/parsing and associated tools is hightly useful.


You should have done it in a homoiconic language like Scheme, where nobody would be puzzled by a notion of an AST.


Then my students would have been puzzled by Scheme, which they don't know.

Incidentally, Scheme has a syntax too, so you need lexing and parsing to handle Scheme. The only advantage is that Scheme's syntax is simple.

Moreover the problem of language representation doesn't go away. You still need to represent programs. The real problem is to understand that there are two level: programs and representations of programs. Avoiding to confuse the two is the real challenge, and it remains a challenge in Scheme.


Sorry, I'm under a slow-ban here (currently like 2 posts in 6 hours allowance), so I'll answer to all your posts at once here.

> Then my students would have been puzzled by Scheme, which they don't know.

Maybe they're not ready for a compilers course then? You could have saved a lot of time and effort for them this way.

> Incidentally, Scheme has a syntax too, so you need lexing and parsing to handle Scheme.

It can have as much syntax as you want. Nobody cares - there is already a parser, ready to use, and you don't have to roll out your own.

> The real problem is to understand that there are two level: programs and representations of programs.

This is the most important takeaway from a compilers course. It makes sense to get there as soon as possible.

Also, I'm a bit skeptical about your assessment of your students, from your words it sounds like you're teaching in a lunatics asylum.

> https://news.ycombinator.com/reply?id=11281976

> you need a platform independent memory model

Sorry, this is a wishful thinking.

A truly efficient concurrent language is always very much platform-specific (think of Occam on one side of the spectrum and OpenCL and Cuda on another).

> nobody really knows what exactly the memory model is

Really? I've seen a number of Coq/HOL/ACL2 precise memory models of certain architectures.

> https://news.ycombinator.com/reply?id=11282014

> idealised compilers have a beautiful pipeline structure

A very practical compiler can also be built as a beautiful and simple pipeline.

> By this reasoning, any computable problem is trivially simple, because I can translate it into ARM/x86/MIPS/... machine code, and machine code is just a linear list of trivial commands, i.e. I can "turn it into a laughably simple chain".

You did not get it at all.

You cannot translate "any code" into a linear trivial sequence with each step being nothing but some primitive form of term rewriting of a single input tree into a single output tree.

A trivial proof for how badly wrong you are: you do not need a Turing-complete language to write a compiler. Now try to apply this to your "any computable problem" reasoning.

> if you chain enough simple things, the result stops being simple.

Incorrect. Such a chain is as simple as its most complex building block. Complexity does not add up.

> Look for example at modern JIT compilers, tracing or otherwise.

Firstly, we're talking about the proper compilers here. JIT is a totally different beast, it got a very severe performance constraint.

Secondly, even JIT can be structured this way, and most of the existing JITs are idiotically, needlessly overcomplicated.

> They are not simple by any stretch of the imagination.

Sorry, but no, they are.

Yes, the non-Turing-complete property breaks there, as well as with any other interpreter, but a Turing-complete subset of a JIT compiler can be very easily isolated.

> https://news.ycombinator.com/reply?id=11279390

> Optimisations are easy only if you can accept the optimisation process to be slow, and/or the resulting code buggy.

And this is nothing but a FUD.

Firstly, nanopass compilers are not any slower than the fused ones. In some cases they're faster, due to some of the nice properties of the immutability. Also, you can fuse some passes automatically, while still keeping your code trivial.

Secondly, the simpler your passes are, the easier it is to reason about their correctness. Compare it to the notoriously shitty instcombine pass in LLVM, as it is an example of the opposite approach (the one you're apparently advocating).


> slow-ban

I, too, can't post as many messages as I'd like. Not sure why, or how to change it, or how even to see if there's some limit imposed on me. Maybe < 1000 rep users are throttled?

> Maybe they're not ready for a compilers course then?

We've been teaching compilers to 2nd years ever since the department was founded as far as I'm aware. The 2nd year is the right time to see how the programmer's most important tool works.

> you don't have to roll out your own.

I want students to write their own lexer and parser, and familiarise themselves with lexer/parser generators. This is an important part of a programmer's education. First, it let's them see the purpose of all that formal language theory they had to do in the first year. Second, and more important, working programmers have to design languages and write lexers/parsers all the time. A lot of commercial programming is about transforming data from one representation to another. In contrast, few programmers will ever write a real code generator.

> you're teaching in a lunatics asylum

I teach at a large, well-known and decidedly mediocre university. The kind of place that most working programmers graduate from. Needless to say that this is perfectly compatible with being a lunatics asylum.

> truly efficient concurrent language is always very much > platform-specific

Sure, but so what? That's like saying for truly efficient concurrent programming you need to program in assembly, because that's the only way to exploit all platform specific trickery.

The purpose of high-level languages is to abstract away from the target platform. The additional goal of languages like C is to abstract as little as possible, so as to be platform independent while still allowing fast executables.

In practise compilers for high-performance languages like C/C++ and Rust need to balance 3 contradictory goals.

- Platform independence. - Fast executables. - Low cost of compiler development and evolution.

Note that optimisations are by far the most expensive part of a compiler. Memory models are used in this balance: you want MMs that are platform agnostic, while still enabling powerful compiler optimisations. The CPU vendor has two more objectives (which is closely related to platform agnosticism):

- Not giving away too many CPU internals. - Allowing CPU evolution.

> Coq/HOL/ACL2 precise memory models

Are they low-level models of CPUs that can also be used to reason about read/write reordering? Or are they genuine memory models, coming from the CPU manufacturers?

As far as I'm aware, the construction of memory models these days is an empirical science like zoology, where people run experiments to confirm hypotheses about CPU behaviour. See eg. [1]

> A very practical compiler can also be built as a beautiful and simple pipeline.

Yes, and GHC is probably an example of this. But GHC is not simple by any stretch of the imagination.

M Odersky once gave a talk (sorry, no reference handy) where he said that scalac, the Scala compiler, is no longer a pipeline, but has a database structure. That's because scalac's many stages cannot just live on the data feed by the immediate preceeding stage, and need to go back serveral stages sometime. Interesting.

> you do not need a Turing-complete language to write a compiler.

Type checking / inference in Haskell, Scala and C++ (among others) is Turing complete. Haskell's compile-time meta-programming extension Template Haskell, or scala.meta allow you to perform arbitrary computations as part of the code generation process, and do so by recursively invoking itself.

On top of this, the control of a Turing machine is a finite state automata. So it's very simple too.

Finally, I don't see the significance of Turing completeness here. You can write highly complicated programs in small fragments of Turing complete languages (e.g. Calculus of Constructions). This is the point of "Total Functional Programming" [2]. Indeed almost everything the typical working programmer writes is primitive recursive. Why do we have Turing complete programming languages? Because using just primitive recursion is really inconvenient. Try writing a trivial 4-line program like the ancient algorithm computing the greatest common denominator using just primitive recursion [3]. It's already annoying. So unrestricted recursion is a convenience, making programs more readable and shorter. Imagine you had to write type inferencers, graph-colouring based register allocators or program analyses based on fix-point iteration using just primitive recursion. Ugh!

> some primitive form of term rewriting of a single input tree into a single output tree.

Term rewriting is Turing complete. So yes, you can transform any compiler into a term rewriting system. So what? If you transform a sophisticated, industrial-strength compiler like GHC or scalac into a TRS, you end up with a very complex TRS that is difficult to understand. You cannot hide intrinsic complexity. You can kick the can down the road, and change the form in which the complexity manifests itself.

> Such a chain is as simple as its most complex building block.

Consider the Turing complete language {S, K} of combinators. By your reasoning any program written as SK combinators should be simple, because the building block are trivial.

> JIT is a totally different beast

Not really. In a way, JITs are simpler than AOTs in that the optimisations they run are simpler (eg register alloc by linear scan rather than graph colouring), and tracing JIT optimisations are simpler still since they traces they work on have linear control flow, which drastically simplifies the optimisation. Indeed that's the point of traces. The main complication of JITs is the seamless switching between interpretation and execution of compiled fragments.

> existing JITs are idiotically, needlessly overcomplicated.

Extraordinary claims require extraordinary evidence.

The JIT teams at Oracle, Google, or Microsoft would like to hear from you if you can lower the cost of JIT compiler construction.

> nanopass compilers are ...

I have nothing against nano-pass compilers, and using a lot of immutable datastructures is fine. IIRC ocamlopt is heavily based on immutability, and it's one of the fastest compilers around. Whether nanopass is better for formal reasoning about compiler correctness remains to be seen. In formal reasoning it doesn't actually matter that much if information is passed explicitly by argument, or is global state, since either way, your logic has to keep track of it.

But whether nano- or otherwise, the complexity is intrinsic in the problem of compilation, which is about bridging the semantic gap between two languages. And the bigger the semantic gap, the more complicated the task.

[1] J. Alglave et al., Herding cats: Modelling, Simulation, Testing, and Data-mining for Weak Memory. http://www0.cs.ucl.ac.uk/staff/J.Alglave/papers/toplas14.pdf

[2] D. A. Turner, Total Functional Programming. http://sblp2004.ic.uff.br/papers/turner.pdf

[3] R. Harper, Old neglected theorems are still theorems. https://existentialtype.wordpress.com/2014/03/20/old-neglect...


> Needless to say that this is perfectly compatible with being a lunatics asylum.

I see... Yes, I'we been exposed to the quality of the intake of the so-called "new" universities (in the UK). Just did not think they're learning anything beyond Excel.

> I want students to write their own lexer and parser, and familiarise themselves with lexer/parser generators. This is an important part of a programmer's education.

I agree. It should follow the compilers course, once everything of importance there is properly absorbed.

> In contrast, few programmers will ever write a real code generator.

And this is where I cannot agree at all. I believe that everyone must build multiple compilers, routinely, on a daily basis.

The reason to think so is:

1) DSLs are an ultimate answer to a complexity problem. And compilation is the best way to implement them easily.

2) Compilers are often present in where you won't even expect to find anything like a compilation. Protocols, UIs, business logic, databases, whatever. And, given the property of a compiler that it can be reduced to a tiny complexity (we'll get back to it later), it is extremely important to be able to spot such patterns early and convert your architecture to a compiler.

And, of course, all such incidental compilers got absolutely nothing to do with any kind of a parsing. Their source languages are already some data structures (even graphs), almost never a plain boring stream of characters.

> The purpose of high-level languages is to abstract away from the target platform.

Impossible if your target platforms concurrency models are incompatible. You cannot fit message passing into a SPMD, and vice versa.

> Note that optimisations are by far the most expensive part of a compiler.

Not necessarily. For C++, for example, all the optimisations, even with a polyhedral vectorisation, abstract interpretation, partial specialisation and all the bells and whistles, are nothing compared to a mere template expansion. See the clang+llvm profiling data.

> Or are they genuine memory models, coming from the CPU manufacturers?

Not "coming" anywhere, unfortunately. They're all internal models. Not available even under an NDA.

> But GHC is not simple by any stretch of the imagination.

It got quite a few gotchas, to be honest. I'd definitely design it differently. And, Haskell is not the best language for this sort of things - see the "expression problem". A Scrap Your Boilerplate library is sort of fixing part of these issues, but it is not used inside GHC.

> That's because scalac's many stages cannot just live on the data feed by the immediate preceeding stage, and need to go back serveral stages sometime. Interesting.

It is exceptionally over-engineered. But, yes, there is a reason to have a backtracking in a compiler pipeline. While still having a linear pipeline (at the end of the day it is linear, but it may be branching and going back to try out certain hypothesis). I've been experimenting with this approach a lot. It does not increase complexity in any way.

> Type checking / inference in Haskell, Scala and C++ (among others) is Turing complete.

Firstly, for most of your DSLs you don't need such a type system. Secondly, all the implementations I've seen are extremely overengineered, and I've got no idea why.

In fact, any Hindley-Milner like type system is laughably trivial, and the Turing-complete part of it is very much isolated (and you don't have to code it, just use an existing library). Hindley-Milner unification is not any different from Prolog unification, so my way of implementing these type systems is to simply compile your AST into a flat list of Prolog equations (see the recent history of my posts for this method explained in depth).

> So it's very simple too.

It's not. You cannot reason about it.

> This is the point of "Total Functional Programming"

Exactly. You don't need anything beyond a simple total language for your entire compiler (or, if you're implementing some metaprogramming, abstract interpretation or a Turing-complete type system, a 99.99% of your compiler, with the rest being well-isolated and reused over and over again).

> Term rewriting is Turing complete.

You only need a subset of term rewriting, which you can express in a total language.

> In a way, JITs are simpler than AOTs in that the optimisations they run are simpler

Yet, they're interpreters, especially the tracing ones. And interpreters are inherently orders of magnitude more complex than compilers.

> By your reasoning any program written as SK combinators should be simple, because the building block are trivial.

Not really. You're combining them in a deep, horrible tree, not a linear pipeline of isolated things.

> Extraordinary claims require extraordinary evidence.

I do not see anything extraordinary in an obvious notion of corporate coding being thoroughly broken. It's a common knowledge.

> In formal reasoning it doesn't actually matter that much if information is passed explicitly by argument, or is global state, since either way, your logic has to keep track of it.

It matters. If all you have is a language as an input and another language as an output, without any global state, then every chunk of your pipeline is isolated and you can comprehend it without thinking about the rest. That's exactly why the complexity of the passes do not add up, and the total complexity is only a complexity of the worst of the passes.

> the complexity is intrinsic in the problem of compilation

I've never seen anything complex there. After over 20 years in the field.

> which is about bridging the semantic gap between two languages.

It's not a chasm, you don't have to cross it in one step. Just build a hundred of trivial passes in between, one step at a time.

> And the bigger the semantic gap, the more complicated the task.

No, complexity do not grow. Mind giving an example of the opposite?

P.S. Also, one of the nice properties of the compilers is that you can abstract out and reuse a lot of routine stuff. Register allocation? Write it once and reuse everywhere. SSA? Implement it once and reuse for any imaginable IR. Typing? See above. Constant folding, ADCE, inlining, etc? Write it once and reuse for all the possible IRs.


My point is exactly the opposite - compilers are easy, probably easier than pretty much anything else. And courses like this do not help by pretending this is a complex topic and by obscuring what matters with unimportant minutiae (like parsing and such).


   compilers are easy
No.

Modern, industrial-strength compilers are some of the most complicated pieces of technology we have.

Indeed almost every part of the compiler pipeline, in particular parsing, type-inference, and optimisations, are active areas of research.


None of what you ssid means writing compilers is hard: just that the first ones or techniques are hard to do. Once it's done, it's another building block that can be chained together to solve similar problems. Using a ML or LISP with right primitives makes that even easier. Designing the right language and implementing modularly makes that easier.

So, I'd say the industrial ones are as tough as they are because they support hard-to-compile languages with many dark corners while bring implemented in the same languages themselves. Make problem hard for yourself then it's hard. Make it easier then it's... still work but only a tiny fraction of what you alluded to.


It's far from trivial, but it is presented as more complex than it should. The parsing point is valid, when in AST land the rewrite / transform / optimize part, which could be seen as the most important one, seems more enjoyable and composable than the parsing stage which is often covered in great details (courses making you write BNF state matrices?).


Compilers are not really complicated compared to plenty of other hard topics. They're large, but they always separate out into different passes that you just run one after the other. You don't have to manage the program as a whole.

And optimizations don't even matter anyway. x86 runs your -O1 program as fast as your -O3, and nobody has put a real accurate model of an x86 into a production compiler for the last decade.


    nobody has put a real accurate 
    model of an x86
That's a pretty good indication that there is a big problem. Deep at the heart of it is that processor designers don't quite understand the performance characteristics of their processors any more, so they cannot produce e.g. an abstract memory model that compiler writers could base their optimisations on.

Optimisations for parallel processing, e.g. auto-parallelisation are not well developed either.


Oh, I think Intel knows it, but they consider it too much of a trade secret to even tell the compiler team about it.

There are actually numerous performance counters exposed by the CPU too, which are good for dynamic benchmarking, but hardly anyone cares to use them.

BTW, even though the ALU part of the CPU is extremely wide and dynamic, you still have to fit your instructions through the x86 decoder at the front, and it's worth writing a scheduler just to model that part.


Why would they keep it secret?

I understand that they need to keep the low level details of their CPU secret, but a memory model? All it says is when/how reads and write commute. If a CPU manufacturer were to provide a nice, high-level memory model, this would make the CPU more sellable. Why? Because compiler writers could more aggressively optimise when compiling for the CPU.

I suspect that the reason they don't give out memory models is that they don't know how to do it. That's also the vibe I've been getting when I talked to people who would know.


Tbh, it is really hard to have an "accurate" model of a very wide OoO core. It is far too dynamic, you instruction scheduler cannot predict the occupancy of the pipelines.


In order to determine the meaning of a programming language, you need a platform independent memory model. This is increasingly important as CPUs are more and more concurrent and programming languages are exposing some of that concurrency to the programmer.

Of course the language designer can always play it safe and choose a ridiculous memory model like sequential consistency, but that prevents many useful compiler optimisations. At the other extreme you could explicitly state that the memory model is a specific processor, but then

(a) the language becomes too platform dependent and program semantics will change with every processor upgrade.

(b) nobody really knows what exactly the memory model is in any meaningful sense because even the manufacturers don't know (other than saying: it is what the CPU does). That means the semantics of the language is unclear.

What you want is something between the Scylla of inefficiency and the Charybdis of platform dependence. This is a difficult problem, and unsolved even for as simple a language as C, see e.g. [1]. Indeed it is unclear if memory models possessing the required attributes even exist, maybe it's contradictory to assume that they can.

[1] M. Batty et al., The Problem of Programming Language Concurrency Semantics. https://www.cl.cam.ac.uk/~jp622/the_problem_of_programming_l...


Again, compilers are easy, easier than anything else.

If the entire thing is "complicated" (although it should not be), it can still be broken into a chain of laughably trivial passes. This is the most beautiful property of the compilers - an ability to break down complexity all the way down to trivial.

If you do not agree for some reason just name a part of a compiler you perceive as inherently complex, and I will show you how to turn it into a laughably simple chain.


I agree with you that idealised compilers have a beautiful pipeline structure. Finding this structure Is one of the great achievement of the early computing pioneers.

   turn it into a laughably simple chain.
By this reasoning, any computable problem is trivially simple, because I can translate it into ARM/x86/MIPS/... machine code, and machine code is just a linear list of trivial commands, i.e. I can "turn it into a laughably simple chain".

My point is: if you chain enough simple things, the result stops being simple.

Look for example at modern JIT compilers, tracing or otherwise. They are not simple by any stretch of the imagination. To see that just look at a program, and predict how long the warm-up will take. Whoops ... unknown [1].

[1] E. Barrett et al., Virtual Machine Warmup Blows Hot and Cold. http://arxiv.org/abs/1602.00602


True, but parsing is still a facinating and active research subject. It should really be its own course, perhaps taught using Grune Jacobs [1]

[1] http://www.springer.com/fr/book/9780387202488


That's true. There continues to be interesting work there. Most people dont want to improve parsers though. They just want to translate one programming language to another form. So, our focus should be on that stuff by default instead of the part that is mostly solved for most usecases even with automated tools.




I forgot how beautiful this was.


Yeah, and apparently the teachers still haven't heard about ANTLR, JavaCC/JJTree, SableCC, parsing expression grammars or attribute grammars.

How come that they are still teaching only yacc/lex as parsing generators in 2016?!


I agree that it's also better to keep the parsing part small and focus on the more "interesting" things like code generation and optimisation. Recursive descent/precedence climbing seems to be the preferred parsing technique in a lot of production-quality, modern compilers, and I'd argue that knowing about RD is more than sufficient for most of what anyone wanting to play around with compilers needs.

Niklaus Wirth would probably agree.


(I've never written an optimizing compiler).

I actually think many useful languages are mostly syntax. If you compile to another source language (like C or SQL) you might not need much optimization.

Parsing some syntaxes is difficult but that doesn't imply we should just ignore it and let libraries handle everything. Thinking about parsing is an opportunity to learn about tradeoffs and to develop a good taste designing syntaxes which are easy to parse (and, possibly correlated, manipulate) both for computers and humans.


Most languages are transforming one layer of semantics into another. Optimisations are pretty much irrelevant in this notion.

Parsing in most cases should either be trivial, or can be skipped altogether.


You have to start somewhere. Code optimization is utterly useless if you can't parse code.


> You have to start somewhere. Code optimization is utterly useless if you can't parse code.

It's absolutely not uncommon for architecture experts to plug into the optimization stage of an existing compiler; expecting them to understand the first stages is not useful at all. They need to understand the AST / IR, but not how to arrive at it.


So write up a grammar, feed it to a parser generator like GOLD, run result through a premade parser engine, and BOOM your done. Then, you can start on the stuff in the middle that really needs more people working on it.


The problem is that writing a good grammar requires understanding of parsers.

(Also, real parsers tend to be hand-written, judging from open source language implementations (some have actually migrated from generated parsers to recursive descent in the past). There must be a reason).


The reason basically boils down to generated parsers being very brittle. If you need to tweak it slightly (for example, to handle errors on missing semicolons better), it's difficult or maybe even impossible to get the generated parser to do it well.


The historical reason was error recovery and error messages. It is a solved problem now, you can have it all with generated PEG parsers.


Alarm bells ring in my head whenever someone claims "it's a solved problem" in the context of an engineering (as opposed to mathsy) problem. Unless the context is really narrow (and the problem becomes basically a math problem) that claim is usually wrong.

Thanks for pointing to PEGs. I will definitely try them out. But judging from a quick web search, it's obvious that there are perfectly valid reasons not to use a PEG grammar today -- unless someone already designed the grammar for you (and it's PEG!). http://lambda-the-ultimate.org/node/3039#comment-44356 http://stackoverflow.com/questions/1857022/limitations-of-pe...

Those of us who care about syntax (that is optimized for the user, not the programmer), will have to keep thinking about parsers for the foreseeable future.


PEG is basically a thin abstraction over recursive descent parsing. Everything you can do in a handwritten parser can be expressed mostly declaratively in PEG.

And, no, there is no single reason not to use PEG. Especially when the choice is between a handwritten parser and a PEG. There is a lot of stupid FUD about PEGs, you should ignore it. Your links are pointing to the ignorant, incompetent mumbling of those who never tried implementing a proper PEG.

And I never met a language for which writing a PEG parser was not totally trivial. A pro tip: use PEG alongside with a Pratt for the binary expressions.


Quite a few very usable languages are just fine without any dedicated syntax at all. Think of Lisp and Fort for example.

And backend is not just about optimisations, they're mostly insig igicant. Backend is about translating gradually one semantics to the other. An example of such is compiling a regular expression into a DFA and then a C array and a switch.


I hope not all future languages will have regular expression, or Forth, syntax...


I am talking of the languages that are built on top of Lisp/Forth. I.e., you do not really need any syntax if your underlying implementation language is homoiconic.


I believe the most difficult part of a compiler/language is actually the runtime environment, in case it contains a concurrent garbage collector. Code optimisation is easy compared to that, especially if done a priori.


Optimisations are easy only if you can accept the optimisation process to be slow, and/or the resulting code buggy.


The second half of the 2nd edition of the Dragon Book is pretty good, especially if you think of it as learning math.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: