Lexers, parsers, intermediate representations, optimisation, code generation. The Dragon Book. LLVM. JIT. From PCC to GCC to Clang — the long apprenticeship of translating one language into another.
A compiler is a program that translates a program. The input is source code in one language; the output is an executable program — usually in machine code, sometimes in another high-level language, sometimes in bytecode for a virtual machine.
The translation is not literal. A good compiler restructures the program along the way: laying out memory, allocating registers, eliminating dead code, unrolling loops, fusing operations into vector instructions. The output is faster than a literal transcription would be — sometimes orders of magnitude faster.
Compilers are the unsung infrastructure of computing. Every operating system, every browser, every game engine and database, was produced by a compiler. The history of programming languages is, in large part, the history of compilers learning new tricks.
The classical pipeline, the one taught in every undergraduate course:
The lexer (or scanner) reads characters and emits tokens — keywords, identifiers, literals, punctuation. The parser reads tokens and builds an abstract syntax tree. The semantic analyser resolves names, checks types, and decorates the tree with attributes. The intermediate representation (IR) is a language-neutral and target-neutral form on which optimisation is performed. The code generator walks the optimised IR and emits assembly or machine code for a specific architecture.
Real compilers add stages: macro expansion, monomorphisation, register allocation, instruction scheduling, link-time optimisation. The five-phase model is a teaching abstraction. The substance is in the details of each stage.
The first compiler — by anyone's loose definition of the word — was Grace Hopper's A-0 System for the UNIVAC I, written in 1951–52. A-0 was really a linker: it concatenated assembly subroutines from a tape library according to a pseudocode-driven recipe. Translation in the modern sense came later.
Hopper's larger contribution was FLOW-MATIC (1955), a business-oriented language whose syntax was deliberately English-like. It directly inspired COBOL, of which Hopper was a principal architect. Her insistence that programmers should not have to think in machine code is the founding ideology of high-level languages.
Hopper rose to the rank of Rear Admiral and remained an active proselytiser into the 1980s. Her famous nanosecond — a foot of wire, the distance light travels in a billionth of a second — was the most memorable visual aid in the history of computing.
The first real compiler — translating a high-level language into machine code competitive with hand-written assembly — was FORTRAN I, released by IBM in 1957. The team, led by John Backus, spent eighteen man-years on the project. It nearly killed them.
Backus's bet was that scientists would tolerate a high-level language only if the generated code was fast. So the FORTRAN I compiler was built around a then-radical optimiser: flow analysis, register allocation, basic loop optimisation. The result ran within ten percent of hand-tuned assembly on the IBM 704 — performance that converted skeptics overnight.
FORTRAN was an instant hit and made the case for compiler-based language design irreversibly. Backus would later receive the 1977 Turing Award for it, and (separately) for inventing BNF, the standard notation for context-free grammars.
The lexer reduces a stream of characters to a stream of tokens. if (x <= 10) becomes [IF, LPAREN, IDENT("x"), LE, INTLIT(10), RPAREN]. The lexer also discards whitespace and comments and tracks line numbers for error reporting.
The theory is regular languages. Each token class is defined by a regular expression; the union of these regexes is compiled to a deterministic finite automaton (DFA). The DFA reads the input one character at a time, transitioning between states; on accepting states, it emits a token.
lex (Mike Lesk, 1975) and its modern descendant flex automate this. You write the regexes; the tool emits a C function. The technique is so well-understood that hand-written lexers are usually written for performance reasons, not correctness — Clang's lexer is hand-written, and is among the fastest in any major compiler.
A grammar is a finite set of rewriting rules. Backus–Naur Form — invented for ALGOL 60 by John Backus and Peter Naur — is the standard notation:
The grammar generates a language: the set of all strings derivable by repeated rewriting from the start symbol. Parsing inverts this: given a string, find the (unique, ideally) derivation tree.
Context-free grammars are strictly more powerful than regular expressions but strictly weaker than context-sensitive grammars. They cover virtually every programming language's syntax, and the parsing problem for unambiguous CFGs is solvable in O(n³) for the general case, O(n) for the LL and LR subclasses that almost every real language belongs to.
Two parsing strategies dominate. Top-down (LL): start from the start symbol, predict productions one at a time. Bottom-up (LR): start from the input, reduce by productions until the start symbol is reached. Both run in linear time on suitable grammars.
LR(1) parsing — invented by Donald Knuth in 1965 — handles essentially all useful programming languages. The tables are large; LALR(1) (Frank DeRemer, 1969) compresses them at the cost of slightly more grammar restrictions, and is what yacc (Steve Johnson, 1975) and its descendants bison generate.
Modern compilers often use hand-written recursive descent (a top-down style) for control over error messages. GCC switched from a yacc-generated C parser to recursive descent in 2004; Clang has used recursive descent from the start. The gain in error-recovery and IDE-friendliness usually outweighs the loss of grammar formality.
The parser's output is rarely the parse tree — it is the abstract syntax tree, a stripped-down version with the irrelevant punctuation discarded. 2 + 3 * 4 becomes a tree with + at the root, 2 as left child, and a * subtree on the right. Parentheses do not appear; precedence is captured by tree shape.
The AST is the substrate for everything downstream. Type checkers walk it; optimisers transform it; code generators emit instructions from it. Most modern compilers represent ASTs as object graphs in their implementation language: a class hierarchy in C++ or Java, an algebraic data type in OCaml, Haskell, or Rust.
The visitor pattern — a class with one method per AST node type — is the standard way to walk an AST in object-oriented compilers. Functional compilers use pattern matching, which is the same idea expressed more concisely.
The semantic analyser checks that the program is well-typed: that x + y doesn't add an integer to a string, that f(a, b) calls a function with arguments of the right types, that the value returned from a function matches its declared return type.
For statically-typed languages with explicit annotations (C, Java), type checking is a tree walk. For languages with type inference (ML, Haskell, Rust, Swift), the analyser must derive types that the programmer left implicit. The classical algorithm is Hindley–Milner — published independently by Roger Hindley (1969) and Robin Milner (1978), who later won a Turing Award for it.
Hindley–Milner gives a most-general type to every well-typed expression in time linear in program size for typical inputs. Most modern type-inferencing languages descend from it. Polymorphism, generics, and traits or type-classes are all extensions of the Hindley–Milner core.
Optimisation works best on a representation that is close to machine code but free of target-specific details. Three-address code — every instruction has at most three operands — is the canonical pedagogical IR.
SSA — Static Single Assignment — augments three-address code with the rule that every variable is assigned exactly once. Where the same variable would have been re-assigned, a fresh name is introduced; control-flow merges use phi-functions to select the right name. SSA was developed at IBM by Cytron, Ferrante, Rosen, Wegman, and Zadeck in 1991.
SSA makes data-flow analyses simpler and more powerful. Almost every modern industrial compiler — GCC, LLVM, V8, HotSpot, Cranelift — uses SSA as its primary IR.
An optimiser is a pipeline of passes, each a small transformation that preserves meaning while improving some metric (usually runtime or code size).
Constant folding: evaluate 2 + 3 at compile time. Constant propagation: replace uses of a variable that holds a constant. Dead code elimination: remove computations whose results are never used. Common subexpression elimination: if a*b is computed twice with the same operands, compute it once.
Loop-invariant code motion: hoist computations out of loops if they don't depend on the loop variable. Strength reduction: replace expensive operations with cheap ones (x*2 → x+x or x<<1). Loop unrolling, function inlining, tail-call elimination, vectorisation: each is its own pass, each compounds with the others.
The art is in the ordering. Inlining exposes more opportunity for constant propagation, which exposes dead code, which exposes more inlining. Compilers run their pass pipelines twice or thrice for this reason.
The back end emits target machine code. Three classical sub-problems: instruction selection, instruction scheduling, register allocation.
Instruction selection: rewrite IR operations as target instructions. The naive approach gives one instruction per IR node; the good approach matches whole subtrees against a hand-written or generated table of target idioms (e.g. on x86, an array access can become a single mov rax, [rbx + 8*rcx]).
Instruction scheduling reorders independent instructions to keep the pipeline full and avoid stalls. Register allocation assigns each virtual register to a physical register or, when registers run out, to a stack slot — typically by graph colouring (Chaitin et al., 1981) or by linear scan (Poletto and Sarkar, 1999).
Register allocation is NP-hard in the general case; production allocators rely on heuristics that work well in practice.
Compilers: Principles, Techniques, and Tools — by Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman — has been the canonical compiler textbook for almost half a century. The first edition appeared in 1986; the second (with Lam added) in 2006. The cover features a knight battling a dragon labeled "complexity of compiler design," which gave the book its nickname.
The Dragon Book pioneered the canonical pipeline view of compilation and codified the names by which we still know the parts: lexer, parser, syntax-directed translation, type checking, code generation, optimisation. Subsequent textbooks — Appel's Modern Compiler Implementation, Cooper and Torczon's Engineering a Compiler, Muchnick's Advanced Compiler Design and Implementation — extend or contest the framework, but every one of them frames itself in relation to the Dragon Book.
Aho and Ullman shared the 2020 Turing Award.
The first C compiler was written by Dennis Ritchie at Bell Labs in 1972 to bootstrap Unix. It targeted only the PDP-11. Stephen C. Johnson's Portable C Compiler (PCC, 1978) was the next leap: a retargetable compiler whose front-end and back-end were cleanly separated, and which could be ported to a new architecture by writing a few hundred lines of machine description.
PCC was the standard Unix C compiler through the 1980s. It shipped with every Unix from AT&T and the Berkeley distributions. Its design — front-end emits a tree-shaped IR, back-end pattern-matches against target idioms — set the template for two generations of compilers.
By 1990 PCC had been overtaken by GCC for portability and code quality, and by vendor compilers (Sun, IBM, DEC) for native performance. But the architectural ideas survived: every modern retargetable compiler is, structurally, PCC's grandchild.
The GNU Compiler Collection began as Richard Stallman's GCC 1.0 (1987), a free C compiler for the GNU project. Within a decade it had become the de-facto compiler for Linux and most BSDs and a serious cross-platform engineering project — multiple front ends (C, C++, Fortran, Ada, Objective-C, later D and Go), multiple back ends (every architecture under the sun, from Intel x86 to obscure embedded controllers).
The EGCS fork (1997) — pushed by Cygnus Solutions and others frustrated with the slow pace of mainline GCC — accelerated development and was eventually merged back as GCC 2.95 (1999). The episode was a milestone in open-source governance.
GCC's IR went through three generations: the original RTL (Register Transfer Language); GIMPLE, an SSA-friendly tree IR added in 4.0 (2005); and a continued layered structure with both tree and RTL phases. By 2024, GCC remained the most widely-deployed compiler on the planet.
LLVM began in 2000 as Chris Lattner's master's project at the University of Illinois under Vikram Adve. The animating idea was a clean, well-specified intermediate representation — LLVM IR — that could be the lingua franca of multiple front-ends and back-ends, with optimisation as a separate, reusable library.
Apple hired Lattner in 2005 and bet the company's compiler stack on the project. Clang, the C/C++/Objective-C front-end for LLVM, shipped with Xcode 4 in 2011. By 2015, Apple had retired GCC entirely from macOS in favour of Clang. Google, Sony, ARM, AMD, NVIDIA, and a long tail of others adopted LLVM as their backend of choice.
LLVM's modular design made it the platform for a generation of new languages: Rust, Swift (also Lattner's creation), Julia, Zig. Without LLVM, none of these would exist as quickly or as polished as they are. It is the most influential compiler infrastructure since GCC, and arguably more so.
Clang is the LLVM project's C, C++, and Objective-C front-end. Its design priorities — a hand-written recursive-descent parser, a careful AST that preserves source-level information, fast compilation, and extraordinary error messages — made it a darling of tooling developers as much as of end users.
The error-message improvement deserves particular credit. Where GCC would say error: expected ';', Clang would point at the exact column with a fix-it hint and a caret. The bar moved permanently. By 2015, GCC had spent years catching up on diagnostics quality, partly in response.
Clang's library-first design enabled libclang, used by IDEs (Xcode, CLion, VS Code with clangd) for syntax highlighting, refactoring, and code completion. The language server protocol ecosystem around clangd is now the standard way to bring C++ tooling to any editor.
JIT compilers translate code at runtime, usually starting from a bytecode interpreter and promoting hot methods to native code. The technique was anticipated by Smalltalk in the 1980s but reached maturity with Self (David Ungar and Craig Chambers, 1987–95), whose dynamic-dispatch optimisations directly inspired the Java HotSpot JVM.
HotSpot (1999) brought JIT into the mainstream. It was followed by V8 (Google's JavaScript engine, 2008), which made browser JavaScript 10–100× faster overnight; JavaScriptCore in WebKit; SpiderMonkey in Firefox; PyPy for Python. .NET's CLR has been JIT-based since its 2002 launch.
JIT trades startup latency for steady-state speed, and works best on programs that run long enough to amortise the compilation cost. The tiered compilation trick — interpret first, JIT-compile hot methods at progressively higher optimisation levels — is now standard in every major managed runtime.
JavaScript was untyped, dynamic, and reputedly hopeless to compile efficiently. V8's 2008 launch in Chrome — led by Lars Bak, fresh from HotSpot and Self — disproved that. Within a few releases, hot JavaScript code ran at within 2× of equivalent C.
The trick was aggressive type speculation. The compiler observes the actual types passed at runtime, generates code specialised to those types, and inserts cheap guards that deoptimise back to the interpreter if the assumptions fail. Hidden classes (an idea borrowed from Self) make object-property lookups predictable.
The other engines followed: SpiderMonkey's IonMonkey, JavaScriptCore's FTL/B3, and a steady arms race. The discipline of writing fast JavaScript today — monomorphic call sites, stable object shapes — is essentially the discipline of staying on the JIT's fast path.
WebAssembly — Wasm — became a W3C standard in December 2019, after development at Mozilla, Google, Microsoft, and Apple. It is a portable binary instruction format with strict structural validation and a deterministic semantics. Code in Wasm can be JIT-compiled to native by the host browser at near-native speed.
Wasm's compile pipeline is a story in itself. C, C++, and Rust compile to Wasm via LLVM's Wasm back-end; the browser parses the binary, validates it, and re-compiles to native — twice-compiled code that nonetheless reaches the user faster than equivalent JavaScript.
The implications go beyond the web. Wasm runtimes (Wasmtime, Wasmer, WAMR) run Wasm binaries server-side as a sandboxed execution format — a successor, in some sense, to the JVM and CLR, with no language ownership. Edge platforms (Cloudflare Workers, Fastly Compute@Edge) make Wasm their native execution model.
The deepest theory in modern loop optimisation. The polyhedral model represents a loop nest as an integer polyhedron (the iteration domain), maps statements to points in it (the schedule), and treats loop transformations — interchange, tiling, fusion, parallelisation — as affine transformations of the polyhedron.
The mathematics goes back to Paul Feautrier's 1991 work and was developed by Alain Darte, Cédric Bastoul, Albert Cohen, and others through the 1990s and 2000s. Production implementations include Polly (Tobias Grosser, on top of LLVM) and isl (Sven Verdoolaege).
The model handles the affine fragment of a program — loop bounds and array indices that are linear in loop variables — exactly. Outside that fragment, fallback heuristics apply. The polyhedral model is what makes today's high-performance computing compilers extract parallelism that hand-tuning rarely catches.
Modern CPUs have SIMD units — SSE, AVX-2, AVX-512, ARM Neon, SVE — that operate on 4, 8, 16, or 32 lanes of data at once. The compiler's job is to recognise loops that can be rewritten to use these units, and to do it without programmer intervention.
The classical algorithm is loop vectorisation: take a loop iterating n times over scalars, transform it into a loop iterating n/k times over vectors of width k, plus a scalar epilogue for the remainder. Dependence analysis decides whether the rewrite preserves meaning.
LLVM's LoopVectorize pass and GCC's tree-vect pass do this for non-trivial loops. SLP vectorisation (Superword-Level Parallelism, Larsen and Amarasinghe 2000) catches straight-line code that vectorises without a loop. Both passes have improved year over year; vectorisation now extracts SIMD speedups out of code its authors never thought about.
NVIDIA's CUDA (2007) brought C-like programming to the GPU. Behind it sits nvcc, a compiler that splits a CUDA source file into host code (compiled by the system C++ compiler) and device code (compiled to PTX, NVIDIA's virtual ISA, then JIT-compiled to a specific GPU's binary at runtime).
PTX is to GPUs what LLVM IR is to CPUs: a target-neutral representation that decouples the language frontend from the rapidly-evolving hardware. OpenCL, HIP (AMD), SYCL, and oneAPI (Intel) chase the same model with different vendor flavours.
The compiler challenges are unusual. GPU programs run on thousands of threads in lockstep within a warp; divergent branches serialise. Memory has many levels (registers, shared memory, L1, L2, global, host) with vast bandwidth differences. Kernels are launched at runtime with shape parameters known only then. Modern GPU compilers do aggressive inlining, loop unrolling, and shared-memory layout optimisation to extract performance from this regime.
The 2010s machine-learning explosion produced a wave of tensor-computation frameworks (TensorFlow, PyTorch, JAX), each with its own compiler. MLIR — Multi-Level Intermediate Representation — was Lattner's response, started at Google in 2018 and contributed to LLVM in 2019.
MLIR generalises LLVM's IR concept by allowing multiple coexisting dialects in a single module. A graph of TensorFlow ops can be lowered through a chain of dialects — high-level tensor ops, then linear-algebra ops, then loop nests, then LLVM IR — with each lowering step a well-defined pass.
The framework is now the substrate of XLA, IREE, Mojo, and parts of TensorFlow and PyTorch. It also powers experimental compilers for hardware accelerators (TPUs, Cerebras, Graphcore). The pattern of "one IR, many dialects" looks likely to be the pattern of the next decade of compiler infrastructure.
A compiler written in language X compiling itself: the rite of passage of every serious language. FORTRAN bootstrapped first, in 1957, but the cleanest example is PL/I (1966). The Unix C compiler bootstrapped in 1973, and the discipline of self-hosting has been routine ever since.
Self-hosting forces the language designers to use the language they are designing. The benefits — sharpened taste, exposed ergonomic weaknesses, accumulated dogfood — usually outweigh the bootstrap cost. Modern languages (Rust, Go, Swift, TypeScript) typically self-host within their first three or four years.
Ken Thompson's 1984 Turing Award lecture, Reflections on Trusting Trust, exposed a darker corollary: a malicious modification to a self-hosting compiler can persist invisibly across rebuilds, recreating itself in each new compiler binary even after the source has been cleaned. The result is a foundational result in software supply-chain security.
Compiler error messages are a user interface. The 2010s saw a wave of attention to this fact. Clang set the bar with caret-anchored diagnostics, fix-it hints, and template-instantiation backtraces. Elm (Evan Czaplicki, 2012) made friendly errors a feature: each error includes a textual explanation of what went wrong and what the programmer probably meant.
Rust (Aaron Turon and Niko Matsakis, then a long line of contributors) inherited and extended the tradition. Rust's borrow-checker errors are some of the most thoughtfully-designed diagnostics in any compiler — they include code spans, lifetime annotations, suggested edits, and pointers to the relevant section of the Rust book.
The compiler-as-IDE trend is part of the same evolution. rust-analyzer, clangd, gopls, TypeScript Server, HLS (for Haskell): every modern language ships a language-server-protocol implementation. The compiler is no longer just a build-step program; it is the brain of the editor.
Can a compiler be proved correct? CompCert — by Xavier Leroy and collaborators at INRIA, started in 2003 — is a C compiler whose entire pipeline is proved, in the Coq proof assistant, to preserve program semantics from source to assembly.
CompCert is mainly used in safety-critical settings (avionics, medical devices). Its generated code is slower than GCC's -O2 but faster than GCC's -O0, and uniquely it carries a machine-checkable certificate of correctness. The 2011 paper Finding and Understanding Bugs in C Compilers (Yang et al.) found bugs in every major production C compiler; CompCert was the only one in which the only bugs found were in the unverified front-end's parsing, not in the verified middle-end.
The lesson generalises. Verified compilation is rare in mainstream practice but increasingly important for cryptography (where mistranslation is a security failure), and is a steady source of theoretical advances in semantics, type theory, and proof engineering.
By 2025, three compiler families dominate. GCC remains the default Linux toolchain and the canonical free Fortran/Ada compiler. LLVM/Clang is the platform of choice for new languages and for Apple, Google, and most of the embedded and game-development worlds. MSVC remains the default for Windows-native development, with its own SSA-based middle-end (UTC).
Beyond the C family: HotSpot and OpenJ9 for the JVM; V8, JavaScriptCore, and SpiderMonkey for JavaScript; RyuJIT for .NET; Cranelift as a fast non-LLVM Wasm and Rust backend; Cython, Numba, PyPy, and now Mojo for Python and its successors.
The frontier is in AI compilers: XLA, IREE, TVM, Triton, MLX. The frontier is in verified compilation: CompCert, CakeML, RustBelt. The frontier is in language-as-compiler tools: rust-analyzer, Clangd, the compiler-as-IDE. There has never been a more interesting decade for the field.
↑ How Compilers Work — overview of the classical pipeline
Watch · LLVM in 100 Seconds
Watch · Lex and Yacc in 5 Minutes
Aho, Lam, Sethi, Ullman, Compilers: Principles, Techniques, and Tools (2006) is still the right textbook to start with. Then Appel's Modern Compiler Implementation. For LLVM, Lattner's original 2004 master's thesis is short and lucid. Bob Nystrom's Crafting Interpreters (free online) is the best practical introduction written this century.
Compilers — Volume VII, Deck 11 of The Deck Catalog. Set in Iowan Old Style with monospaced accents in SF Mono and Menlo. Slate background #0b0d10; mint accent #7df9c4; warm amber for headings.
Thirty-two leaves on the long apprenticeship of compilation — from Hopper's A-0 in 1952 to MLIR in 2019, by way of FORTRAN, the Dragon Book, GCC, LLVM, V8, and the verified compilers of the present.
Vol. VII · Technology · Deck 11