Vol. III · Deck 12 · The Deck Catalog

Databases._

Sixty years of how we store and retrieve structured data. From IMS hierarchies through Codd's relational model and SQL, the NoSQL turn, and the vector databases of the AI moment.


IMS1968
Codd's paper1970
Pages32
LedeII
┌── opening ──

/* opening */What a database is.

A system for organising structured data so that it can be stored durably, queried flexibly, and modified safely by many programs at once.

Each of those four words — durable, flexible, safe, concurrent — names a problem that took decades of work to solve well. The history of databases is the history of those problems being attacked, partially solved, and then re-opened by changes in the underlying hardware, the workloads, and the kinds of data being stored.

This deck moves through the sequence: pre-database file systems, the hierarchical and network models, Codd's relational revolution, SQL's commercial dominance, the NoSQL response to web scale, the CAP theorem and its consequences, and the vector and AI-aware databases of the contemporary moment.

Vol. III— ii —
Pre-databaseIII
┌── 1960s ──

/* chapter i */Files, records, and pain.

Before databases there were files. Mainframe applications of the 1950s and early 1960s — payroll, inventory, banking — read and wrote sequential records, typically in COBOL, on tape or on early disk.

Each application defined its own record format. A customer's address might live in three different layouts in three different systems, with three different update procedures. Duplication was rampant; a customer's name change required coordinated edits across files; consistency was a matter of programmer discipline rather than system guarantee.

The structural problems — data redundancy, update anomalies, the absence of any general query mechanism, the entanglement of physical layout with application logic — were the pressure that produced database management systems as a category. The first systems addressed the problems by stepping a level above raw files; the price was that programmers had to learn a new layer of abstraction.

Databases · Pre-DB— iii —
IMSIV
┌── 1968 ──

/* chapter ii */Hierarchies and IBM IMS.

IMS (Information Management System) was developed by IBM and Rockwell for the Apollo programme — the bill of materials for a Saturn V is a deeply nested hierarchy — and released to commercial customers in 1968. It was the first widely deployed commercial database management system.

The data model was hierarchical: each record had exactly one parent and any number of children, forming trees. A query traversed the tree using a procedural language (DL/I). Performance on a single dominant access path could be excellent; the cost was that any non-hierarchical access pattern (a query that crossed branches) was awkward.

The competing network model formalised by Charles Bachman at Honeywell — and codified by the CODASYL committee in 1971 — generalised the hierarchy to a directed graph: a record could have multiple parents. Bachman won the 1973 Turing Award for the work. The network model fixed the hierarchy's most painful limitations but inherited its central problem: the application programmer had to know the physical structure of the database to write queries against it.

Databases · IMS— iv —
Codd 1970V
┌── 1970 ──

/* chapter iii */Codd's paper.

"A Relational Model of Data for Large Shared Data Banks," Edgar F. Codd, Communications of the ACM, June 1970. Twenty pages. The most consequential paper in the history of data management.

Codd, an English-born IBM researcher, proposed that data should be represented as a collection of relations — what users would later call tables — with a precisely defined mathematical foundation in set theory and predicate logic. Queries would be expressed declaratively, in terms of what data was wanted, not how to fetch it. The system would be free to choose any execution strategy that produced the right answer.

The radical move was the separation of logical schema from physical storage. Application code could be written against the relational model without knowing how the data was laid out on disk. The DBMS could change its storage strategy without breaking applications. This is the foundation of every relational database since.

IBM was slow to adopt Codd's ideas. The company had a substantial existing investment in IMS. Codd's model would be commercialised first by others.

Databases · Codd— v —
Relational algebraVI
┌── theory ──

/* chapter iv */The relational algebra.

Codd's 1972 follow-up paper formalised the operations on relations. The core algebra has six primitive operators: selection (rows where a predicate holds), projection (a chosen subset of columns), cartesian product, union, difference, and rename. From these, every other operator — join, intersection, division — can be derived.

The companion relational calculus, in two flavours (tuple and domain), gave a logic-based specification of what queries could be expressed. Codd's completeness theorem (1972) showed that the algebra and the calculus had the same expressive power. Any query expressible in the calculus could be evaluated by an algebraic plan.

This is why query optimisers work. The optimiser takes a declarative query (calculus or SQL), converts it to an algebraic expression, and then transforms the expression — using algebraic equivalences — into a faster equivalent plan. Sixty years of database research has produced increasingly sophisticated optimisers operating on this foundation. The algebra is the bedrock.

Databases · Algebra— vi —
SQLVII
┌── 1974 ──

/* chapter v */SQL.

SQL began life as SEQUEL (Structured English Query Language), designed by Donald Chamberlin and Raymond Boyce at IBM's San Jose lab in 1974, as the query language for IBM's relational research prototype System R. The "English" part was deliberate: the language was designed to be written by analysts, not just programmers.

The trademark SELECT col FROM table WHERE pred shape — declarative, set-oriented, close to natural language — has been remarkably durable. The 1986 ANSI standard (SQL-86) and the 1992 standard (SQL-92) defined the language at the level most developers know. SQL:1999 added recursive queries (the WITH RECURSIVE clause), object-relational extensions, and triggers. SQL:2003 added window functions; SQL:2016 added JSON support; SQL:2023 added property-graph queries.

SQL outlasted every alternative — QUEL, OQL, the various NoSQL query languages — even as the underlying database engines diversified. Most modern non-relational systems eventually grow a SQL dialect. The language's apparent ugliness conceals an extraordinary track record.

Databases · SQL— vii —
OracleVIII
┌── 1977 ──

/* chapter vi */Oracle.

Larry Ellison, with Bob Miner and Ed Oates, founded Software Development Laboratories (later Relational Software, later Oracle Corporation) in 1977. Reading Codd's papers and observing IBM's slow commercialisation, the founders moved to ship a relational database before IBM did. Oracle V2 shipped in 1979 — the first commercial SQL relational database.

The early product was famously buggy (Ellison's autobiographical comments on the early years are unsparing) but the strategy was correct. By the time IBM shipped SQL/DS in 1981 and DB2 in 1983, Oracle had a head start in the rapidly growing market.

Oracle's commercial dominance through the 1990s was built on hard engineering — write-ahead logging, multi-version concurrency control, parallel query execution, the cost-based optimiser — combined with aggressive sales and a near-religious commitment to backward compatibility. The 1995 release of Oracle 7, the 2001 release of Oracle 9i (the "internet" release), and the 2007 release of Oracle 11g were each commercially decisive.

By 2024, Oracle remained the most-used relational database for high-end OLTP workloads worldwide, despite four decades of would-be replacements.

Databases · Oracle— viii —
DB2IX
┌── 1983 ──

/* chapter vii */IBM's response.

IBM's System R research prototype (San Jose, 1974–1979) had pioneered most of the techniques that became standard in commercial relational databases — query optimisation by cost estimation, two-phase locking, write-ahead logging, the SQL language itself. The corporate question through the late 1970s was whether to ship it.

The answer came in stages. SQL/DS for the DOS/VSE and VM operating systems shipped in 1981. DB2 for the dominant MVS mainframe environment shipped in June 1983. By the late 1980s DB2 was the leading mainframe relational database. The Unix and Windows ports — DB2 UDB — followed in the 1990s.

The technical legacy was profound. The System R papers (Selinger's 1979 query-optimisation paper; Gray's 1981 transaction paper) became the foundational reading for every database course. Most subsequent relational engines either descended from System R's design or had to explain why they did not.

IBM rebranded the product line in 2017 to "Db2." The original Codd-derived engine remains in active use in banking and insurance backends worldwide.

Databases · DB2— ix —
Edgar_F._Codd
Codd (1923-2003) — invented the relational model (1970). The theoretical foundation of all major business databases.
PostgresX
┌── 1986 ──

/* chapter viii */Stonebraker and Berkeley.

Michael Stonebraker at Berkeley led the INGRES research project (1973–1985) — a competing relational implementation to System R, with its own query language QUEL and its own optimisation strategies. The project produced a generation of database researchers and several commercial spinoffs.

The successor project, POSTGRES (1986), aimed at a more ambitious "object-relational" model — user-defined types, inheritance, rules, extensibility. The 1995 effort to port the system from QUEL to SQL produced Postgres95, soon renamed PostgreSQL. The Berkeley source release was taken over by a global volunteer community, which has shipped a major version annually for almost three decades.

By the 2020s, Postgres had become the open-source relational database of choice for a substantial fraction of new development. Its extensibility — particularly the pgvector extension for vector search, the PostGIS spatial extensions, and the foreign-data-wrapper system — let it absorb workloads that had moved to specialised stores during the NoSQL years.

Stonebraker's 2014 Turing Award recognised the work. The award citation observed, accurately, that he had "been responsible for many of the technical advances at the heart of modern database systems."

Databases · Postgres— x —
MySQLXI
┌── 1995 ──

/* chapter ix */MySQL.

MySQL began as a Swedish project — Michael "Monty" Widenius, David Axmark, Allan Larsson, MySQL AB, 1995 — explicitly aimed at small-to-medium web applications that could not afford Oracle licences. The combination of GPL licensing, easy installation, and adequate performance for read-heavy web workloads made MySQL the database half of the early-2000s "LAMP" stack (Linux, Apache, MySQL, PHP/Perl/Python).

The technical compromises were significant. Early MySQL versions lacked transactions, foreign keys, and many features standard in serious relational databases. The default MyISAM storage engine prioritised read performance over consistency. The pluggable-engine architecture eventually allowed InnoDB (acquired with Innobase Oy in 2005) to provide ACID transactions. By the late 2000s MySQL was a serious database; by the 2020s, the default engine and feature set were comparable to Postgres on most metrics.

Sun acquired MySQL AB in 2008 for $1B. Oracle acquired Sun in 2010, an acquisition that made the corporate steward of MySQL the same company as the leading commercial-database vendor. The community fork MariaDB emerged in response and remains a meaningful alternative.

Databases · MySQL— xi —
OLTP & OLAPXII
┌── 1990s ──

/* chapter x */The OLTP/OLAP split.

By the early 1990s the same database engine was being asked to run two very different workloads. OLTP (online transaction processing) — orders, payments, inventory updates — meant short transactions touching small amounts of data, with a hard requirement for consistency and durability. OLAP (online analytical processing) — sales reports, customer segmentation, executive dashboards — meant long-running queries scanning large amounts of historical data.

The two workloads conflict. OLTP wants row-oriented storage, narrow indexes, and fast point lookups. OLAP wants column-oriented storage, wide scans, and aggregate optimisations. Running both on the same database produced contention; the analytical query and the order-entry transaction got in each other's way.

The 1990s solution was the data warehouse: a separate analytical database, periodically populated from the OLTP system via ETL (extract-transform-load) pipelines. Bill Inmon's 1992 Building the Data Warehouse and Ralph Kimball's 1996 The Data Warehouse Toolkit defined the discipline. The architectural split between operational and analytical stores has persisted, with refinements, into the present.

Databases · OLTP/OLAP— xii —
SchemasXIII
┌── modelling ──

/* chapter xi */Star and snowflake.

Kimball's dimensional modelling was the dominant schema design pattern for analytical databases through the 2000s. A fact table at the centre held the granular events being analysed (sales, web visits, claims) with foreign keys out to dimension tables (customer, product, time, store). The radial geometry gave the pattern its name: the star schema.

The snowflake schema normalises the dimensions further, splitting each into multiple related tables. Storage is more compact; queries require more joins. Most warehouses chose star for query simplicity and accepted the storage overhead.

The 2010s saw a partial retreat from rigid dimensional modelling toward data lakes (raw data stored in object stores; schemas applied at query time) and the medallion architecture (bronze raw, silver cleansed, gold curated layers). The 2020s have seen the pendulum swing back toward more structured approaches with the rise of the lakehouse — Databricks's Delta Lake, Apache Iceberg, Apache Hudi — which combines lake-style storage with warehouse-style query semantics and ACID guarantees.

Databases · Schemas— xiii —
NoSQLXIV
┌── 2009 ──

/* chapter xii */The NoSQL explosion.

The term NoSQL was popularised by a 2009 San Francisco meetup organised by Johan Oskarsson. The label covered a heterogeneous group of new databases — MongoDB, CouchDB, Cassandra, Riak, Redis — united more by what they were against than by any shared technical design.

The motivation was web scale. Companies like Google, Amazon, Facebook, and Twitter were running workloads that exceeded what a single relational database could handle, with availability requirements that the strict consistency of traditional RDBMSs made awkward. The papers of the late 2000s — Google's Bigtable (2006), Amazon's Dynamo (2007), Facebook's Cassandra origin paper (2008), Google's MapReduce (2004) — articulated a different design space.

The trade-off was sharp. NoSQL systems gave up some combination of strong consistency, ACID transactions, the relational model, and SQL itself in exchange for horizontal scalability and high availability. The early sales pitch ("schema-less," "infinitely scalable") sometimes outran the engineering reality. By the mid-2010s the field had matured into a stable taxonomy of four main families.

Databases · NoSQL— xiv —
DocumentXV
┌── document ──

/* chapter xiii */Document databases.

The dominant document database is MongoDB, developed by 10gen (later renamed MongoDB, Inc.) from 2007 and released as open source in 2009. The data model is a collection of JSON-like documents (BSON internally), each typically representing one logical entity, with no enforced schema across documents.

The appeal to web developers was immediate. Storing a JSON object directly, without first defining a tabular schema and writing object-relational mapping code, matched the way they were already thinking about data. Mongo's defaults favoured developer ergonomics over operational rigour — early versions had silent write loss in some failure modes; the 2014 transition to the WiredTiger storage engine and subsequent work on consistency models substantially improved this.

The IPO in 2017 confirmed Mongo as a major commercial database vendor. By 2024 the system supported multi-document ACID transactions, time-series collections, vector search, and substantial relational query capability. The category line between "document database" and "relational database with rich JSON support" has become indistinct.

Databases · Document— xv —
Key–valueXVI
┌── kv ──

/* chapter xiv */Key–value stores.

The simplest non-relational data model: a hash map persisted to disk. Redis (Salvatore Sanfilippo, 2009) was originally an in-memory key–value store with rich value types — strings, lists, sets, sorted sets, hashes, streams. The simplicity hides the impact: Redis has become the default cache and ephemeral data layer for a substantial fraction of contemporary web applications.

DynamoDB (Amazon, 2012) is the commercial descendant of the 2007 Dynamo paper. It is a managed, single-digit-millisecond key–value and document store designed for predictable performance at scale. AWS uses it internally for shopping carts, sessions, and other latency-critical workloads. Its pricing model — per-request rather than per-server — was instructional for the industry.

The trade-off of key–value stores is what they cannot do. There is no general query language; access is by key or, in the document variants, by secondary index. Joins must be done in application code. The systems are excellent at the workloads they are good at and unsuitable for everything else.

Databases · KV— xvi —
Wide-columnXVII
┌── wide-column ──

/* chapter xv */Bigtable's children.

The Bigtable paper (Chang et al., OSDI 2006) described the storage system Google used internally for indexing the web, Google Earth, and dozens of other workloads. The data model: a sparse, distributed, multi-dimensional sorted map indexed by row key, column key, and timestamp. The implementation: an LSM-tree-based storage layer running on top of the Google File System.

Cassandra (Facebook, 2008; Apache project from 2009) combined the Bigtable data model with Dynamo's eventually consistent replication. HBase (open-source Bigtable clone, 2008) ran on top of HDFS as part of the Hadoop ecosystem. Google Cloud Bigtable (2015) made the original system commercially available.

The wide-column model excels at write-heavy time-series and event-stream workloads. Reads by primary key are fast; analytical queries crossing many rows are awkward. The category has stabilised around a smaller set of operational stores — Cassandra/ScyllaDB and Bigtable's commercial descendants — with the analytical workloads having migrated to columnar warehouse engines.

Databases · Wide-column— xvii —
FIG. 2
SQL query.
SQL — the universal database language since the 1980s. Despite many alternatives proposed, the relational-SQL combination dominates.
GraphXVIII
┌── graph ──

/* chapter xvi */Graph databases.

For data whose primary structure is a network of relationships — social graphs, knowledge graphs, fraud rings, supply chains, recommendation systems — the relational model can be awkward. Multi-hop traversals translate to chains of joins; performance degrades with graph depth.

Neo4j (Swedish company Neo Technology, 2007) was the first widely deployed property-graph database. Nodes and edges are first-class citizens, each with properties; the Cypher query language expresses traversals concisely. The 2024 release of the SQL:2023 property-graph extensions reflects the maturity of the model.

The competitive landscape includes Amazon Neptune, Microsoft Azure Cosmos DB (multi-model, with graph as one mode), TigerGraph, and the open-source JanusGraph. The W3C's RDF triple-store tradition is a separate but related lineage focused on semantic-web and linked-data applications.

The category remains specialised. Most workloads do not need it. The ones that do — Panama Papers analysis, real-time fraud detection, large-scale knowledge graphs — get dramatic gains from purpose-built storage and traversal engines.

Databases · Graph— xviii —
CAPXIX
┌── 2000 ──

/* chapter xvii */The CAP theorem.

Eric Brewer's keynote at the 2000 Symposium on Principles of Distributed Computing articulated a conjecture: a distributed data system cannot simultaneously provide consistency, availability, and partition tolerance. In the presence of a network partition, the system must choose between consistency (refuse to serve stale data) and availability (serve what it has, possibly stale).

Seth Gilbert and Nancy Lynch at MIT proved the formal version in 2002. The theorem became the most-cited piece of distributed-systems folklore of the NoSQL years. It also became the most-misunderstood. The C, A, and P of CAP have specific technical meanings, narrower than their English-language counterparts. Real-world systems sit on a spectrum, not at the corners of a triangle.

Brewer's own 2012 essay "CAP Twelve Years Later" emphasised that the theorem describes behaviour during partitions; in normal operation, all three can be provided. The practical question is what the system does when a partition occurs — and how partitions are detected. The theorem has aged well; the design space has widened around it.

Databases · CAP— xix —
ConsistencyXX
┌── consistency ──

/* chapter xviii */Eventually vs. strongly consistent.

The early NoSQL systems often defaulted to eventual consistency: writes propagate to replicas asynchronously, and reads may see stale values, but in the absence of further writes the system converges. Werner Vogels's 2009 essay "Eventually Consistent" gave the category its canonical statement.

The model is sufficient for many workloads — shopping carts that reconcile at checkout, social-media feeds that tolerate brief inconsistency, analytics that summarise across windows. It is insufficient for others — financial transactions, inventory at low stock, allocation of unique identifiers. The application developer's job became deciding which workloads needed which guarantees.

Strong consistency (linearizability) requires that operations appear to take effect in a single global order consistent with their real-time order. Achieving this across a distributed system requires coordination — Paxos, Raft, or related consensus protocols — at the cost of latency and availability during partitions.

The contemporary picture: most serious systems offer multiple consistency levels per operation, letting the developer choose the trade-off explicitly. The default has trended back toward stronger consistency as engineering capability has improved.

Databases · Consistency— xx —
NewSQLXXI
┌── 2012+ ──

/* chapter xix */The return of consistency.

By 2012, a counter-movement was visible. Google Spanner (Corbett et al., OSDI 2012) demonstrated that a globally distributed database could provide externally consistent transactions across data centres, using GPS- and atomic-clock-based TrueTime to bound clock uncertainty. The paper made it respectable to want strong consistency at scale again.

The NewSQL category collected the systems that aimed at relational semantics, ACID guarantees, and SQL — at the kind of scale NoSQL had previously dominated. CockroachDB (founded 2015 by ex-Google engineers; named after the famous resilience), YugabyteDB, TiDB (PingCAP, China), and VoltDB are the leading examples. Each takes a different path to the same goal.

By the 2020s, the category line between NewSQL and traditional relational had blurred. PostgreSQL extensions such as Citus (acquired by Microsoft) provided horizontal scaling for Postgres without leaving the SQL ecosystem. The pendulum, having swung from relational to NoSQL in the late 2000s, swung back to relational-with-distribution in the late 2010s.

Databases · NewSQL— xxi —
ACID/BASEXXII
┌── transactions ──

/* chapter xx */ACID and BASE.

Jim Gray's 1981 paper "The Transaction Concept" defined the durable abstractions that became the relational orthodoxy: Atomicity (a transaction is all or nothing), Consistency (it preserves invariants), Isolation (concurrent transactions appear serial), Durability (committed work survives crashes). The 1983 ACID acronym was Theo Härder and Andreas Reuter's coinage.

The opposing acronym BASE (Basically Available, Soft state, Eventual consistency), articulated by Dan Pritchett at eBay around 2008, framed the NoSQL trade-offs. Where ACID systems prioritised correctness even at the cost of availability under failure, BASE systems chose availability and accepted weaker guarantees.

The 2020s synthesis is that ACID and BASE are not opposing camps but a spectrum of design points. Modern distributed databases offer per-transaction consistency levels, snapshot isolation as a sane default, multi-region deployments with explicit latency/consistency trade-offs, and serializable isolation for the workloads that demand it. Gray's 1981 abstractions have outlasted every system that tried to abandon them.

Databases · ACID— xxii —
DistributedXXIII
┌── distributed ──

/* chapter xxi */Distributed databases.

The central problem of a distributed database is replication. Data must exist on multiple machines for durability and availability; those copies must be kept in sync. The two main approaches are leader-based replication (one node accepts writes; followers replicate) and leaderless replication (any replica can accept any write; conflicts are reconciled later).

Leader-based systems use consensus protocols — Paxos (Lamport, 1989, published 1998) or its more comprehensible cousin Raft (Ongaro and Ousterhout, 2014) — to agree on the order of writes. Most modern strongly consistent distributed databases use Raft or a Paxos variant: CockroachDB, etcd, Consul, TiDB.

Leaderless systems — Cassandra, Riak — use quorum-based reads and writes (R + W > N), with vector clocks or last-write-wins to resolve concurrent updates. The model trades strong consistency for higher availability.

The full design space includes partitioning (sharding strategies, consistent hashing), failure detection, repair protocols, and the recovery semantics for partial failures. Forty years of distributed-systems research underlies every modern multi-node database.

Databases · Distributed— xxiii —
VectorXXIV
┌── 2023+ ──

/* chapter xxii */The AI moment.

The 2022–2024 large-language-model boom created a new database category. Embeddings — high-dimensional vectors representing text, images, or other content — needed efficient nearest-neighbour search at scale. Vector databases were the answer.

The technical problem is approximate nearest-neighbour (ANN) search in high-dimensional spaces. The core algorithms — HNSW (Hierarchical Navigable Small World; Malkov and Yashunin, 2018), IVF (inverted file with product quantisation), DiskANN — predate the AI boom but found explosive new application.

The major systems: Pinecone (managed service, 2019), Weaviate, Qdrant, Chroma, Milvus. The relational counter-move was rapid. The pgvector extension for PostgreSQL (2021, with HNSW support added in 2023) made vector search a feature of an existing database rather than a category requiring its own system. Most major relational and NoSQL databases now ship vector-search capabilities.

The 2024–2026 question — whether vector databases remain a separate category or get absorbed into general-purpose databases — appears to be settling toward absorption. Most production retrieval-augmented-generation systems are running on Postgres with pgvector.

Databases · Vector— xxiv —
FIG. 3
NoSQL.
NoSQL databases — MongoDB, Cassandra, Redis. The 2010s alternative to relational; both ecosystems coexist in modern architecture.
Time-seriesXXV
┌── time-series ──

/* chapter xxiii */Time-series databases.

Time-series workloads — sensor data, financial market data, application metrics, observability telemetry — share a structure that general-purpose databases handle inefficiently: very high write rates of timestamped, mostly append-only data, with queries that aggregate over time windows.

The dedicated systems exploit the structure. InfluxDB (2013, InfluxData) was the first widely deployed time-series database with a SQL-adjacent query language. TimescaleDB (2017) is a Postgres extension that adds automatic partitioning by time and continuous aggregates while staying inside the Postgres ecosystem. Prometheus (SoundCloud, 2012) is the dominant open-source observability time-series database, embedded in the Kubernetes/CNCF stack.

The commercial monitoring-focused vendors (Datadog, New Relic, Splunk) run substantial proprietary time-series engines internally. QuestDB, VictoriaMetrics, and ClickHouse (originally a Yandex columnar warehouse, now widely used for time-series and observability) round out the contemporary landscape.

The category has consolidated. Most new deployments choose either TimescaleDB (for SQL workloads) or a Prometheus-compatible engine (for observability) unless specific scale or feature requirements push elsewhere.

Databases · Time-series— xxv —
InternalsXXVI
┌── internals ──

/* chapter xxiv */B-trees and LSM trees.

Two on-disk data structures dominate database storage. The B-tree, invented by Rudolf Bayer and Edward McCreight at Boeing in 1972, has been the default for traditional relational databases. Pages are kept sorted by key; updates are made in place; reads are O(log n); the structure handles range queries efficiently. The variant B+ tree stores all data in leaf pages and links them for efficient sequential scans.

The LSM tree (Log-Structured Merge tree; Patrick O'Neil et al., 1996) inverts the trade-off. Writes go to an in-memory buffer, which is periodically flushed as a sorted run to disk; older runs are merged in the background. Writes are very fast (sequential); reads may have to consult multiple runs (slower); space amplification depends on the merge strategy.

LSM trees power most modern wide-column and key-value stores: Bigtable, Cassandra, HBase, RocksDB, LevelDB, ScyllaDB. B-trees power Postgres, MySQL/InnoDB, SQLite, Oracle, DB2. The choice reflects the workload: write-heavy → LSM; read-heavy with mixed updates → B-tree. Modern engines often use combinations.

Databases · Internals— xxvi —
OptimisationXXVII
┌── optimisation ──

/* chapter xxv */Query optimisation.

The query optimiser is the part of a database that decides how to execute a SQL query. Given a declarative query, the optimiser enumerates equivalent plans (different join orders, different index choices, different physical operators), estimates the cost of each, and picks one to run.

The foundational paper is Pat Selinger et al.'s "Access Path Selection in a Relational Database Management System" (System R, SIGMOD 1979). The Selinger-style cost-based optimiser — bottom-up dynamic programming over join orders, with cardinality estimates from histograms — is the basis of most production optimisers. Postgres, DB2, SQL Server, and Oracle all descend from it.

The hard problem is cardinality estimation: predicting how many rows each operator in a plan will produce. Bad estimates produce bad plans. Forty years of research has produced histograms, sketches, sampling-based estimators, and recently learned cost models — none of which solves the problem completely. The 2015 "How Good Are Query Optimizers, Really?" paper (Leis et al.) documented that even mature optimisers produce dramatically wrong estimates for complex queries.

Modern adaptive systems (Microsoft SQL Server's adaptive query processing, Oracle's adaptive plans) re-plan during execution when estimates prove wrong.

Databases · Optimisation— xxvii —
Indexes & shardingXXVIII
┌── operations ──

/* chapter xxvi */Indexes, sharding, replication.

Indexes are auxiliary data structures that speed up reads at the cost of slowed writes and additional storage. The standard kinds: B-tree indexes for range and equality queries; hash indexes for equality; bitmap indexes for low-cardinality columns; inverted indexes for full-text and array search; spatial indexes (R-trees, geohash) for geographic queries; vector indexes (HNSW, IVF) for similarity search.

Sharding — partitioning data across multiple machines by some key — is the standard approach to scaling beyond a single machine's capacity. Hash sharding distributes evenly but loses range-query locality; range sharding preserves locality but can produce hot shards; consistent hashing reduces the cost of adding or removing nodes.

Replication exists on top of sharding. Each shard typically has multiple replicas for durability and availability. The replication protocol — synchronous or asynchronous, single-leader or multi-leader — determines the consistency semantics.

The combinations get intricate. Production systems are stacks of choices about indexes, partitions, replication, and isolation, each with operational consequences.

Databases · Operations— xxviii —
OLAP at scaleXXIX
┌── 2010s+ ──

/* chapter xxvii */Snowflake, BigQuery, Databricks.

The 2010s reshaped the analytical-database market. Three vendors emerged as the dominant cloud-era warehouses, each with a distinct architectural bet.

Snowflake (founded 2012 by ex-Oracle engineers Benoit Dageville and Thierry Cruanes; IPO 2020) separated storage from compute, allowed multiple isolated compute clusters to access shared cloud-object-storage data, and made warehousing feel managed in a way Teradata and Vertica never had. The "data cloud" framing and zero-copy data sharing were architectural innovations as well as marketing.

Google BigQuery (2010, descended from Google's internal Dremel system; Dremel paper 2010) was the first widely available serverless analytical database. The user submits SQL; Google runs it; the user pays per byte scanned. The simplicity hides extraordinary engineering.

Databricks (founded 2013 by the creators of Apache Spark; preparing IPO at writing) evolved from a Spark-as-a-service offering into the leading lakehouse vendor. The 2020 acquisition of Redash and the 2022 launch of Delta Live Tables, combined with strong ML/AI integration, made Databricks the principal alternative to Snowflake for organisations that wanted to treat their data warehouse and ML platform as a single system.

Databases · OLAP at scale— xxix —
State of the fieldXXX
┌── 2024+ ──

/* chapter xxviii */Where things stand.

The contemporary landscape, briefly: Postgres has emerged as the closest thing to a default operational database, absorbing vector search, time-series workloads, and JSON-document use cases. Snowflake, BigQuery, and Databricks are the dominant analytical platforms. SQLite is the world's most-deployed database (it ships in every smartphone). Redis is the default cache. Cassandra, MongoDB, and DynamoDB have stable shares of the operational NoSQL market.

The architectural trends: separation of storage and compute (cloud warehouses, lakehouses, even some operational stores); table formats as a layer (Iceberg, Delta, Hudi); SQL re-asserting itself as the universal query language across all categories; hybrid transactional-analytical processing (HTAP) approaches blurring the OLTP/OLAP line; AI-aware features (vector search, generative-AI integration) becoming standard rather than specialised.

The unfashionable observation: most of what makes a serious database serious — the optimiser, the transaction manager, the storage engine, the replication protocol — was substantially worked out by 1990. The 2020s have reshaped deployment, scale, and ergonomics. The intellectual core is mostly Codd, Gray, and Selinger.

Databases · State of the field— xxx —
Reading listXXXI
┌── reading list ──

/* chapter xxix */Twenty-five works.

Databases · Reading list— xxxi —
Watch & ReadXXXII
┌── watch ──

/* chapter xxx */Watch & read.

↑ MIT 6.824 · Cloud Replicated DB, Aurora

More on YouTube

Watch · PostgreSQL in 100 Seconds
Watch · SQL vs. NoSQL: What's the difference?

If you read three books

Kleppmann's Designing Data-Intensive Applications (the modern bible), Gray and Reuter's Transaction Processing (the classic foundations), and Petrov's Database Internals (the engine-builder's view). Then re-read Codd's 1970 paper. It is twenty pages and remains the most consequential thing ever written about data.

Databases · Watch & Read— xxxii —
ColophonXXXIII

// end of deck

Databases — Volume III, Deck 12 of The Deck Catalog. Set in ui-monospace headings with Source Serif Pro body. Dark #0e0e14 with lime #7adb6e and cyan #5fd3ff accents.

Thirty-two leaves on six decades of how we store and query structured data. The intellectual core is older than the labels make it look.

[ EOF ]

↑ Vol. III · Technology · Deck 12

i / iSpace · ↓ · ↑