Vol. VII · Deck 13 · The Deck Catalog

CPUs & Architecture.

Von Neumann to Apple Silicon. RISC versus CISC. Pipelining, branch prediction, caches, SIMD. The x86 dynasty, the ARM ascension, and the GPU's quiet conquest. The Spectre and Meltdown era. The chip you are reading this on.

Year zero1945

Transistors today~10¹¹

Pages32

LedeII

OpeningWhat a CPU is.

A central processing unit fetches an instruction from memory, decodes it, executes it, and writes the result. Then it does the same with the next instruction. It does this several billion times a second, on a die smaller than a fingernail, dissipating less power than a light bulb.

The conceptual model — fetch, decode, execute, write back — has not changed since the 1940s. Almost everything else has. A modern CPU runs hundreds of instructions in flight simultaneously, predicts which way each branch will go, executes both possibilities speculatively, retires only the right one, and exposes none of this complexity to the program.

This deck is the architectural genealogy of that machine — from von Neumann's draft to the wafer in your phone.

Vol. VII— ii —

Von NeumannIII

Chapter IVon Neumann.

The June 1945 First Draft of a Report on the EDVAC — circulated under John von Neumann's name, drafted from collaboration with the Moore School team (Eckert, Mauchly, Goldstine) — described an architecture in which programs and data lived in the same memory, accessed through the same word-addressable interface, and processed by a single arithmetic-logic unit under the direction of a control unit fetching instructions sequentially.

The architecture became canon. Almost every general-purpose computer built since 1948 — from the Manchester Baby to the chip in your phone — is a von Neumann machine. The alternative Harvard architecture, which gives instructions and data separate memories and buses, survives only at the cache level (the L1 cache is split into instruction and data halves on most modern CPUs) and in some embedded controllers.

The von Neumann bottleneck — the inability to fetch instructions and data simultaneously — was the original limit on processor throughput, and remains a structural force in modern CPU design.

Architecture · Von Neumann— iii —

TransistorIV

Chapter IIThe transistor and Moore's law.

The transistor was invented at Bell Labs in December 1947 by John Bardeen, Walter Brattain, and William Shockley. Jack Kilby (Texas Instruments, 1958) and Robert Noyce (Fairchild, 1959) independently invented the integrated circuit: many transistors on a single piece of silicon.

Gordon Moore's 1965 Cramming More Components onto Integrated Circuits observed that the number of transistors per chip had been doubling every year, and predicted the trend would continue. The exponent settled to roughly two years; Moore's law has been the most consequential empirical regularity in 20th-century industry.

The 2nm node arrived in 2024 (TSMC, then Samsung). Densities continue to improve, but the simple "shrink the transistor" recipe is exhausted. Today's gains come from FinFET and GAAFET three-dimensional gate structures, advanced packaging (chiplets, 3D stacking), and EUV lithography. Moore's law is not so much dead as transmuted.

Architecture · Transistor— iv —

4004V

Chapter IIIThe Intel 4004.

The first commercial microprocessor — a complete CPU on a single integrated circuit — was the Intel 4004, released November 1971. Federico Faggin, Ted Hoff, and Stanley Mazor designed it for a Japanese calculator company, Busicom; Intel reclaimed the rights from Busicom and sold it as a general-purpose part.

The 4004 was a 4-bit chip with 2,300 transistors clocked at 740 kHz. It executed about 92,000 instructions per second. By contrast the Apple M3 Max (2023) has roughly 92 billion transistors. The intervening five decades represent something on the order of a 10⁷ transistor-count increase and a 10⁹ performance increase per chip.

The 4004 was followed by the 8-bit Intel 8008 (1972), the 8080 (1974), the Zilog Z80 (1976), the Motorola 6800 and 68000, MOS 6502 — each playing a role in the personal-computer explosion of the late 1970s and early 1980s.

Architecture · 4004— v —

x86VI

Chapter IVThe x86 dynasty.

The Intel 8086 (1978) was a 16-bit successor to the 8080, designed under a tight deadline by Stephen Morse. The instruction set was a hasty extension of the 8080's; the segmented memory model was an awkward expedient. None of this would have mattered if IBM hadn't picked the cheaper 8088 variant for the IBM PC in 1981.

That decision — by an IBM committee under Don Estridge — locked the world's dominant computing platform to a particular instruction set. Every successive Intel chip (286, 386, 486, Pentium, Core, Xeon) extended the x86 ISA backward-compatibly. AMD's Athlon 64 (2003) extended it again to 64 bits; Intel adopted the AMD design, and x86-64 became the desktop and server standard.

x86 is famously baroque. Modern CPUs translate x86 instructions into internal RISC-like micro-ops at decode time. The ISA's complexity now lives almost entirely in the decoder; the rest of the processor is a streamlined high-IPC engine that happens to wear an x86 face.

Architecture · x86— vi —

RISCVII

Chapter VRISC.

By the late 1970s, ISAs had grown elaborate. The DEC VAX could execute a single instruction that ran an entire while loop; IBM 370 microcode was hundreds of thousands of lines. John Cocke at IBM (the 801 project, 1975), David Patterson at Berkeley, and John Hennessy at Stanford argued the opposite: a small, regular instruction set executed at high frequency would beat a large, complex one.

The Berkeley project produced RISC-I (1981) and RISC-II (1983), which became the basis of SPARC. Stanford's MIPS spun out as a company. IBM's 801 became POWER, then PowerPC. ARM (originally Acorn RISC Machine, 1985) and DEC Alpha (1992) followed.

The pure-RISC purist position lost some ground in the 1990s as x86 implementations adopted the same micro-architecture techniques internally. But the broader RISC argument — load/store architecture, fixed-length encoding, regular registers, no condition flags — has won. Every ISA designed after 1985 (ARM, MIPS, RISC-V, even WebAssembly) is recognisably RISC-influenced.

Wafer_(electronics) — A 300 mm silicon wafer, fresh from a foundry. Each diamond on its surface is a complete die — sometimes a single chip, sometimes one of a chiplet that will later be assembled into a multi-die package.

Architecture · RISC— vii —

PipeliningVIII

Chapter VIPipelining.

The fundamental performance trick. Treat instruction execution as a multi-stage assembly line: while one instruction is being decoded, the next is being fetched; while it is being executed, the previous one is being decoded; and so on. Throughput rises by a factor equal to the pipeline depth.

The classical five-stage pipeline — IF (fetch), ID (decode), EX (execute), MEM (memory), WB (writeback) — became canonical via MIPS and is the textbook model. Modern CPUs have far deeper pipelines: the Intel Pentium 4's NetBurst was 20 stages; today's high-performance cores typically run 14–18.

Pipelining isn't free. Hazards arise: data hazards (an instruction needs a result the previous one hasn't computed yet), structural hazards (two instructions want the same resource), control hazards (a branch's outcome isn't known yet). Forwarding, stalling, and prediction handle these — but the housekeeping eats a measurable fraction of the chip's transistor budget.

Architecture · Pipelining— viii —

SuperscalarIX

Chapter VIISuperscalar & out-of-order.

A superscalar processor issues more than one instruction per clock cycle. The first commercial superscalar microprocessor was the Intel i960CA (1989), followed quickly by the IBM RS/6000 (1990) and DEC Alpha 21064 (1992).

Out-of-order execution goes further: instructions are issued, executed, and completed in whatever order their operands become available, then committed in program order. Tomasulo's algorithm (Robert Tomasulo, IBM 360/91, 1967) is the foundational technique — register renaming and reservation stations resolve dependencies dynamically.

Modern high-performance cores fetch 4–8 instructions per cycle, decode them into micro-ops, rename registers, schedule the micro-ops onto multiple execution units, and retire 4–6 micro-ops per cycle. The reorder buffer holds hundreds of in-flight instructions. The fact that program order is preserved at retirement, despite ferocious internal reordering, is one of computer architecture's quiet triumphs.

Architecture · Superscalar— ix —

BranchesX

Chapter VIIIBranch prediction.

Roughly one in five instructions is a branch. The pipeline cannot wait until a branch's outcome is known — by then it would have stalled for a dozen cycles. So the CPU predicts which way the branch will go and speculatively fetches and executes that path. If the prediction is wrong, the speculative work is discarded and the pipeline restarts.

The first predictors were static (always taken, always not taken). The Pentium's two-bit saturating counter (1993) was the first widely-deployed dynamic predictor. Yeh and Patt's 1991 Two-Level Adaptive Branch Prediction introduced predictors that condition on recent branch history; TAGE (Seznec, 2006) and perceptron predictors (Jiménez and Lin, 2001) push the accuracy above 97% on most workloads.

Modern branch predictors are themselves miniature machine-learning models, trained online by the CPU. They are also, as Spectre revealed, an enormous and previously unsuspected attack surface.

Architecture · Branches— x —

CachesXI

Chapter IXCache hierarchy.

DRAM is hundreds of cycles away from the CPU. Caches — small, fast SRAM memories — bridge the gap. Modern CPUs run a multi-level hierarchy: L1 (32–64 KB per core, ~4 cycle access), L2 (256 KB – 2 MB per core, ~12 cycles), L3 (shared, tens of MB, ~40 cycles), then DRAM at hundreds of cycles.

Caches exploit locality. Temporal locality: a recently-accessed address is likely to be accessed again. Spatial locality: addresses near a recently-accessed address are likely to be accessed soon. Cache lines (64 bytes on x86 and ARM) are the unit of transfer; when you load one byte you load all 64 around it.

Cache-aware algorithm design is a deep art. Cache-oblivious algorithms (Frigo, Leiserson, Prokop, Ramachandran 1999) achieve near-optimal cache behaviour without knowing the cache parameters at compile time, by recursively dividing problems. Most modern numerical libraries — BLAS, FFTW — use cache-aware or cache-oblivious approaches internally.

Architecture · Caches— xi —

CoherenceXII

Chapter XCache coherence.

When multiple cores cache the same memory location, the system must guarantee that they all see a consistent view. The MESI protocol — Modified, Exclusive, Shared, Invalid — is the foundational cache-coherence scheme. Each cache line in each core's cache is in one of these four states; cores snoop or directory-look-up each other's reads and writes to maintain the invariants.

Modern multicore systems use elaborate variations: MESIF (Intel), MOESI (AMD), MESI plus directory state for many-core systems. The coherence protocol is the largest hidden complexity in shared-memory parallelism, and the source of much of its performance pathology.

Above coherence sits the memory model: the rules about what orderings of memory operations a program can observe. Sequential consistency is the textbook ideal; real CPUs implement weaker models (TSO on x86, weak on ARM and POWER). Programmers reason at the level of memory fences and atomic operations when correctness depends on ordering.

Architecture · Coherence— xii —

SIMDXIII

Chapter XISIMD.

Single Instruction, Multiple Data — one instruction operates on a vector of values in parallel. Intel's MMX (1997) was the first mainstream consumer SIMD extension; SSE (1999) and successors widened the registers from 64 to 128 bits. AVX (2011), AVX-2 (2013), and AVX-512 (2017) extended to 256 and 512 bits. ARM Neon (2005) is the equivalent for ARM; SVE (2017) introduces vector-length-agnostic semantics.

SIMD throughput is enormous. A 512-bit AVX-512 vector multiply does sixteen 32-bit floating-point multiplies in one instruction; a fused-multiply-add does sixteen multiplies and sixteen adds. Modern AI inference and high-performance scientific computing live and die by SIMD.

Auto-vectorising compilers (LLVM's LoopVectorize, GCC's tree-vect) extract SIMD parallelism without programmer intervention for many loops. Hand-vectorisation, via intrinsics or assembly, remains common for the inner loops of BLAS, video codecs, cryptographic primitives, and game engines.

Architecture · SIMD— xiii —

MulticoreXIV

Chapter XIIThe multicore turn.

Around 2005 single-core performance scaling stalled. Power dissipation grew faster than performance gains; the power wall had arrived. Manufacturers responded by shipping multiple cores per die. The Intel Core 2 Duo (2006) and AMD Athlon 64 X2 (2005) began the era of mainstream multicore.

The transition was painful. Programmers, used to free single-thread improvements year over year, suddenly had to write parallel code. Operating-system schedulers, language runtimes, lock-free data structures, threading libraries — all had to grow up fast. The 2010s decade was largely consumed by this adjustment.

Modern desktop chips have 8–24 cores; servers reach 96 (AMD EPYC Genoa) and beyond. Chiplet architectures (AMD's Infinity Fabric, Intel's tile-based Sapphire Rapids and Meteor Lake) split a logical CPU across multiple dies bonded onto a single package, working around lithography limits and yield economics.

Architecture · Multicore— xiv —

ARMXV

Chapter XIIIARM.

Acorn Computers' tiny CPU-design team — Sophie Wilson on the ISA, Steve Furber on the silicon — designed the original ARM in 1983–85 for the Acorn Archimedes. The chip was so power-efficient that an early prototype, plugged into the wrong test rig with no power supply, ran on residual leakage current from the I/O pins.

That accidental discovery foreshadowed the architecture's destiny. ARM (the company) was spun out in 1990 as a joint venture with Apple and VLSI Technology to design the Newton's CPU. The licensing model — sell ISA cores, not chips — let ARM saturate the embedded and mobile markets through the 2000s.

The modern ARMv8 ISA (2011) added 64-bit support and is what powers Apple Silicon, Snapdragon, AWS Graviton, and the Fugaku supercomputer (the world's fastest in 2020). By 2025 ARM-based chips ship in the billions per year — vastly more than x86 — and have decisively entered the laptop and server markets that Intel once owned.

Architecture · ARM— xv —

Apple SiliconXVI

Chapter XIVApple Silicon.

Apple began designing its own ARM-based system-on-chip with the A4 (2010), shipping in the iPhone 4 and original iPad. By the A7 (2013), Apple was first to ship a 64-bit ARM smartphone CPU — surprising the industry and forcing Qualcomm and Samsung to scramble.

The 2020 transition of the Mac line to Apple Silicon was a bigger turning point. The M1 matched or beat Intel's contemporary i7 in single-thread performance at a fraction of the power. The M1 Pro, M1 Max, M1 Ultra, then the M2 and M3 generations cemented the lead; by 2023, x86 had effectively lost the high-end laptop performance-per-watt race.

The architectural reasons are not magic but are well-executed. A wide front-end (8-wide decode), a deep reorder buffer (~600 entries), unified memory architecture (CPU and GPU share the same DRAM), and aggressive heterogeneous integration of CPU, GPU, Neural Engine, and other accelerators on one die. The package as a whole is a small computer, not a CPU with peripherals.

Architecture · Apple Silicon— xvi —

RISC-VXVII

Chapter XVRISC-V.

RISC-V began in 2010 at UC Berkeley as a teaching ISA (Krste Asanović, Andrew Waterman, Yunsup Lee, David Patterson). Patterson's history with the original Berkeley RISC made the project's pedigree unimpeachable. Within five years it had become the rallying flag for a movement: a free, open ISA that anyone could implement, modify, or extend.

Adoption has been steady. SiFive (founded 2015) sells RISC-V cores commercially. Western Digital and Nvidia ship RISC-V cores inside their products as auxiliary controllers. China has made RISC-V a strategic priority. The first major mobile chips with RISC-V application cores appeared in 2024.

The bet: that an open ISA, freed from the per-core licensing model of ARM and the proprietary lock-in of x86, will be where new processor design happens — especially in domains (machine learning, cryptography, embedded) where customisation matters and the ARM/Intel duopoly is least entrenched. Whether the bet pays off in the consumer and server mainstream is the open architecture story of the late 2020s.

Architecture · RISC-V— xvii —

GPUXVIII

Chapter XVIThe GPU.

The graphics processing unit began as a dedicated rasteriser. The Nvidia GeForce 256 (1999) was the first chip Nvidia called a "GPU"; it offloaded transform and lighting from the CPU. Programmable shaders (DirectX 8, 2001) made GPUs flexible enough to run general-purpose code on graphics primitives.

The general-purpose phase began with CUDA (Nvidia, 2007), which exposed GPUs as massively parallel compute engines for non-graphics workloads. Bitcoin mining (2010), scientific computing, and crucially deep learning (Alex Krizhevsky's 2012 ImageNet win on two GTX 580s) made GPUs central to the modern compute economy.

GPUs are not little CPUs. They have thousands of small in-order cores grouped into warps that execute in lockstep. Memory bandwidth (HBM2/HBM3) is enormous — terabytes per second per package. The architectural dichotomy — CPU as latency engine, GPU as throughput engine — is now the central organisational principle of modern systems.

Architecture · GPU— xviii —

NvidiaXIX

Chapter XVIINvidia.

Founded 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, Nvidia spent its first decade as a GPU vendor competing with ATI. The decisive bets came in the 2000s: CUDA (2007) made the GPU a general-purpose compute platform; the Tesla data-centre product line (2007) pursued HPC; the partnership with cuDNN and the deep-learning framework community (2014–) made Nvidia GPUs the de-facto AI training hardware.

By 2024 Nvidia was the world's most valuable company. The H100 (2022) and B100/B200 (2024) chips dominate large-model training; data centres buy them in tens of thousands at a unit cost approaching that of a luxury sedan.

The lesson, if there is one, is about ecosystems. CUDA's incumbency, refined over 17 years, is a moat as deep as the hardware itself. Competitors (AMD's ROCm, Intel's oneAPI) close the gap on hardware specs but cannot match the depth of CUDA's libraries, tools, training, and developer base. The processor war of the 2020s is, more than ever, a software war.

Architecture · Nvidia— xix —

SpectreXX

Chapter XVIIISpectre and Meltdown.

In January 2018, two related families of vulnerabilities — Spectre and Meltdown — were disclosed simultaneously by researchers at Google Project Zero, Graz University, and elsewhere. Both exploited speculative execution: the very mechanism by which modern CPUs achieve high IPC.

Meltdown exploited Intel's privilege-check timing to leak kernel memory into user space. Spectre tricked the branch predictor into speculatively executing code that loaded sensitive data into the cache, after which a separate timing measurement read the cache and recovered the data. The vulnerabilities were architectural, not implementation bugs — every CPU with branch prediction was, in some form, susceptible.

The mitigations cost real performance. KPTI (kernel page-table isolation) imposed a measurable overhead on system calls. Microcode updates changed branch-prediction behaviour. The 2018–2020 period saw a steady drumbeat of follow-on disclosures (Foreshadow, ZombieLoad, MDS, Retbleed). The era ended a long-standing assumption that micro-architectural state was invisible to software.

Architecture · Spectre— xx —

PowerXXI

Chapter XIXPower and thermals.

A modern CPU's performance is bounded not by transistor count but by power and heat. Dennard scaling — the rule that as transistors shrink, their power density stays constant — broke around 2006. Since then, every new process node delivers more transistors per square mm but the chip cannot run them all at full speed simultaneously: this is the dark silicon problem.

The mitigation: turbo boost, per-core frequency scaling, dynamic voltage and frequency scaling (DVFS), and heterogeneous cores (big.LITTLE on ARM, P-cores and E-cores on Intel). Different parts of the chip run at different speeds; idle units gate their power; the workload dictates the configuration.

Power is also why ARM took the laptop market. A core can be designed for performance-per-watt or for raw performance, but not both equally; ARM's mobile-first design heritage gave it years of head start on the metric that turned out to matter most.

Architecture · Power— xxi —

MemoryXXII

Chapter XXMemory and DRAM.

DRAM (dynamic random-access memory) was invented at IBM in 1966 by Robert Dennard. A modern DRAM cell is a single transistor and capacitor; it stores one bit, and the capacitor must be refreshed every few milliseconds before the charge leaks away.

The DRAM hierarchy: DDR4 (2014), DDR5 (2020) for general-purpose memory; HBM2/HBM3 (high-bandwidth memory) for GPUs and HPC, stacked vertically with through-silicon vias on a silicon interposer. LPDDR for mobile and laptop. 3D XPoint (Optane, 2017–2022, since cancelled) attempted persistent memory at near-DRAM speeds.

The bandwidth gap remains the central system-level challenge. CPU compute throughput grows faster than DRAM bandwidth; the gap is bridged by ever-larger caches and ever-cleverer prefetchers. CXL — Compute Express Link (2019) is the emerging standard for memory pooling and disaggregation, promising to let a CPU access memory on a different chip in a coherent, low-latency way.

Architecture · Memory— xxii —

SoCXXIII

Chapter XXIThe SoC.

A system on a chip integrates a CPU, GPU, memory controller, peripheral interfaces, and special-purpose accelerators on a single die. Mobile chips were SoCs from the start; laptop and desktop CPUs became increasingly SoC-like through the 2010s; Apple Silicon completed the trajectory.

The accelerators are increasingly the story. The Apple Neural Engine runs ML inference at low power. The Image Signal Processor handles camera RAW pipelines. The Secure Enclave isolates cryptographic key material. The Media Engine encodes and decodes H.265, AV1, ProRes. Each is a tiny custom processor, hundreds of millions of transistors, doing one job vastly better than a general-purpose core could.

This is the post-Dennard frontier. With dark silicon abundant and general-purpose performance scaling slow, every chip is gradually becoming a constellation of specialised engines, with the CPU as orchestrator rather than centrepiece. Domain-specific architecture — Patterson and Hennessy's 2017 Turing lecture topic — is the design philosophy of the 2020s.

Microprocessor — An IC die from the 1970s — a few thousand transistors. Today's leading-edge package contains a hundred billion. The micrograph aesthetic has not changed; the densities, by a factor of ten million, have.

Architecture · SoC— xxiii —

TSMCXXIV

Chapter XXIITSMC and the foundries.

Taiwan Semiconductor Manufacturing Company was founded in 1987 by Morris Chang on a deceptively simple premise: separate chip design from chip manufacturing. Customers (the "fabless" semiconductor companies) would design the chips; TSMC would build them.

The model proved transformative. Nvidia, AMD (after spinning off GlobalFoundries in 2009), Apple, Qualcomm, MediaTek, Broadcom — most of the top chip designers of the 2020s do not own a fab. TSMC, Samsung Foundry, and Intel Foundry Services compete for their business at the leading edge.

By 2025, TSMC manufactured something like 60% of the world's logic chips by revenue and the vast majority of leading-edge nodes (3 nm, 2 nm). The geopolitics of this concentration — TSMC is in Taiwan, which the People's Republic of China claims — has become a first-order issue in international relations. The CHIPS Act (2022) and analogous programs in the EU and Japan have flooded the industry with subsidies for diversification, with effects that will play out over the coming decade.

Architecture · TSMC— xxiv —

QuantumXXV

Chapter XXIIIQuantum and neuromorphic.

Two architectures pursue computation outside the classical CMOS framework. Quantum processors — superconducting qubits (IBM, Google), trapped ions (IonQ, Quantinuum), photonic systems (PsiQuantum), neutral atoms (QuEra) — exploit superposition and entanglement. They are excellent at a small number of problems (Shor's factoring, Grover's search, certain quantum simulations) and currently terrible at most others.

Neuromorphic chips — Intel's Loihi, IBM's TrueNorth, BrainChip's Akida — emulate the spike-based dynamics of biological neurons. They are exquisitely energy-efficient for certain pattern-recognition workloads but lack the universal applicability of conventional silicon.

Neither is yet a general-purpose threat to classical CPUs. The next decade will probably see both grow into useful niches — quantum for cryptanalysis and material simulation, neuromorphic for sensor edge computing — without displacing the von Neumann mainstream. Computer architecture's pluralistic future is becoming visible.

Architecture · Quantum— xxv —

AI acceleratorsXXVI

Chapter XXIVAI accelerators.

The last decade has been the great age of AI accelerators. Google's TPU (2015 — Jouppi et al., In-Datacenter Performance Analysis of a Tensor Processing Unit) was the first publicly-disclosed inference accelerator at hyperscaler scale. Cerebras's wafer-scale engine (2019) exposes 850,000 cores on a single 215 mm × 215 mm die — the largest chip ever fabricated.

The architectural pattern: matrix engines that do Y = XW + b at extraordinarily high throughput, with most of the die area devoted to a systolic array or its descendants, and most of the off-die budget devoted to memory bandwidth (HBM3, HBM3e). Nvidia's H100 has a 4th-generation Tensor Core; AMD's MI300 is its rival; Intel's Gaudi 3, Google's TPU v5p, and a Cambrian explosion of startups (Tenstorrent, Groq, Graphcore, SambaNova) are the rest of the field.

The economics are fierce. A single H100 retails for $30,000+; a training cluster costs hundreds of millions of dollars. The architectural innovations are happening at the intersection of memory hierarchy, networking (NVLink, InfiniBand), and packaging — not so much in the cores themselves.

Architecture · AI— xxvi —

Hennessy & PattersonXXVII

Chapter XXVHennessy and Patterson.

Two computer architects share the field's intellectual gravitational pull. John L. Hennessy (Stanford, MIPS) and David A. Patterson (Berkeley, RISC-I, RAID, RISC-V) co-authored Computer Organization and Design and Computer Architecture: A Quantitative Approach — the textbooks that have trained two generations of computer engineers.

The "quantitative approach" itself was a methodological contribution: design decisions should be justified by measurement and benchmark, not architectural taste. The SPEC benchmark suite, the formalisation of Amdahl's law as a design constraint, and the discipline of reading micro-architecture at the level of cycles-per-instruction breakdowns — all bear the Hennessy-Patterson stamp.

The 2017 Turing Award went jointly to both for "pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact." Their joint Turing lecture, A New Golden Age for Computer Architecture, articulated the domain-specific architecture thesis that has dominated the field since.

Architecture · H & P— xxvii —

FutureXXVIII

Chapter XXVIThe next decade.

Where is the field going? Several visible vectors:

Heterogeneous integration. Chiplets, 3D stacking, silicon photonics, advanced packaging. The package, not the die, is the new unit of design. TSMC's CoWoS and Intel's Foveros are leading examples.

Domain-specific accelerators. AI is the obvious case; codecs, cryptography, ray tracing, networking, and database query engines are following. The CPU shrinks to coordinator-of-accelerators in domain after domain.

Memory disaggregation and CXL. Pooled memory across nodes, accessible coherently. Memory is becoming a tier-able, swappable resource rather than a fixed per-server quantity.

Open ISAs. RISC-V's growth; the loosening of ARM's licensing model in response. The age of one or two ISAs ruling everything is probably ending.

The von Neumann model survives at the centre. Around it, the architecture of computation is more diverse, more specialised, and more interesting than at any time since the late 1980s.

Architecture · Future— xxviii —

Reading listXXIX

Chapter XXVIITwenty essentials.

1990Computer Architecture: A Quantitative ApproachHennessy & Patterson
1994Computer Organization and DesignPatterson & Hennessy
2008The Datacenter as a ComputerBarroso & Hölzle
2018A New Golden Age for Computer Architecture (Turing lecture)Hennessy & Patterson
2017In-Datacenter Performance Analysis of a TPUJouppi et al.
2018Spectre Attacks: Exploiting Speculative ExecutionKocher et al.
1991The CRAY-1 Computer SystemRussell
1996The Pentium ChroniclesColwell
2014What Every Programmer Should Know About MemoryDrepper
2009Computer Systems: A Programmer's PerspectiveBryant & O'Hallaron
2008The Elements of Computing Systems (Nand2Tetris)Nisan & Schocken
2015Patterson & Hennessy ARM EditionPatterson & Hennessy
2010The Anatomy of a Large-Scale Hypertextual Web Search EngineBrin & Page
2017The RISC-V ReaderPatterson & Waterman
2007The Soul of a New MachineKidder
2014The InnovatorsIsaacson
2009Mythical Man-Month (background reading)Brooks
2018The Cambridge Book of ComputingCoopey et al.
2024Chip WarMiller
2023Inside Apple Silicon (online series)community

Architecture · Reading list— xxix —

Watch & ReadXXX

Chapter XXVIIIWatch & read.

↑ CPU Architecture Explained — the canonical introduction

More on YouTube

Watch · 9.2.3 The von Neumann Model
Watch · CPU Pipeline (Computerphile)

And read

Hennessy and Patterson's Computer Architecture: A Quantitative Approach is unavoidable; it is the book on the subject. Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective bridges architecture and systems-level programming better than any rival. For the modern industrial story, Chris Miller's Chip War (2022) is the best historical narrative.

Architecture · Watch & Read— xxx —

PioneersXXXI

Chapter XXIXNames to know.

John von Neumann for the architecture itself. Maurice Wilkes for EDSAC and microcode. Gene Amdahl for the IBM 360 and his eponymous law. Seymour Cray for the supercomputer as design philosophy. Gordon Moore and Robert Noyce for Intel and the integrated circuit. Federico Faggin for the 4004 and silicon-gate technology.

John Hennessy and David Patterson for RISC and the textbooks. John Cocke for the IBM 801 and modern compiler-architecture co-design. Sophie Wilson and Steve Furber for ARM. Jim Keller — chief architect on AMD K8, K12, Apple A4/A5, Tesla FSD, and AMD Zen — for the through-line of every major CPU breakthrough since 2000.

And the company-builders: Andy Grove, Jensen Huang, Lisa Su, Morris Chang. The names that turn architecture into industry.

Architecture · Pioneers— xxxi —

ColophonXXXII

The end of the deck.

CPUs and Architecture — Volume VII, Deck 13 of The Deck Catalog. Set in IBM Plex Sans and IBM Plex Serif on a fine-grid background, with crimson accents (#b91c4b). The grid is a nod to the schematic and the floorplan — the underlying texture of the discipline.

Thirty-two leaves on the most engineered object on Earth. The von Neumann draft is eighty years old; the chip is still being designed.

FINIS

Vol. VII · Technology · Deck 13