A 13-slide history

AI
The road to general intelligence.

From the first artificial neuron sketched in 1943 to large language models trained on most of the public internet — eight decades of one idea, scaled.

Press → to begin · F for fullscreen

Origin

1943

McCulloch & Pitts: the artificial neuron.

Neurophysiologist Warren McCulloch and logician Walter Pitts publish "A Logical Calculus of the Ideas Immanent in Nervous Activity." They prove that simple threshold units, wired together, can compute any logical proposition.

Binary inputs → weighted sum → threshold → binary output.
Networks of these units are Turing-complete in the limit.
The conceptual seed of every neural net that follows.

A single threshold unit

If Σ w·x ≥ θ, fire. Otherwise, stay silent.

Naming the field

1956

The Dartmouth Workshop coins "artificial intelligence."

A summer gathering at Dartmouth College, organized by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon, declared a new field. The proposal claimed that "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."

Premise

Intelligence as computation.

Attendees

10 researchers, 8 weeks, no guarantees.

Outcome

A name, a community, and 70 years of follow-up.

Learning

1957

Rosenblatt's Perceptron: a learning machine.

Frank Rosenblatt, at Cornell Aeronautical Laboratory, builds the Mark I Perceptron — a physical machine of motors, potentiometers, and a 20×20 photocell array. Crucially, it adjusts its own weights from examples.

The first algorithm with a convergence proof for linearly separable data.
The New York Times reported a machine that would "walk, talk, see, write, reproduce itself."
Reality was more modest. The idea endured.

Perceptron update rule

w ← w + η(y − ŷ)x

Setback

1969

Minsky & Papert show perceptron limits.
The first AI winter follows.

The book Perceptrons proves a single-layer perceptron cannot learn XOR — or any non-linearly-separable function. The technical result was narrow. Its cultural effect was not.

Funding shifts toward symbolic AI and expert systems.
Connectionism enters a long fallow period.
Multilayer networks could fix this — but no one yet knew how to train them.

Revival

1986

Backpropagation re-emerges.

Rumelhart, Hinton, and Williams publish "Learning representations by back-propagating errors" in Nature. Multilayer networks become trainable: the chain rule, applied recursively, sends a useful gradient back through every layer.

Hidden units learn distributed representations.
XOR, suddenly, is trivial.
The path to deep learning is now technically open — but compute is the bottleneck.

A 3-layer network

Public moment

1997

Deep Blue beats Garry Kasparov.

IBM's Deep Blue defeats the reigning world chess champion 3.5 to 2.5 in a six-game match. It is a triumph of brute-force search — 200 million positions per second — paired with hand-crafted evaluation. Symbolic AI's last great stand.

200M

positions evaluated per second.

3.5 – 2.5

match score, May 11, 1997.

0

neural networks involved.

Inflection

2012

AlexNet wins ImageNet.
The deep learning era begins.

Krizhevsky, Sutskever, and Hinton train an 8-layer convolutional network on two consumer GPUs. Top-5 error on ImageNet drops from 26% to 15.3%. Within five years, every serious computer-vision system is a deep neural network.

GPUs turn out to be embarrassingly well suited to matrix multiplication.
ReLU activations and dropout fix old training pathologies.
The "scale + data + compute" recipe is now legible.

Architecture

2017

"Attention Is All You Need."

Vaswani et al. at Google publish the Transformer: a sequence model built entirely from self-attention, with no recurrence and no convolution. It parallelizes beautifully on GPUs, scales gracefully with data, and quietly becomes the backbone of nearly every modern AI system.

Self-attention: every token can look at every other token.
Positional encodings replace recurrence.
Within five years: language, vision, audio, code, biology — all Transformers.

Attention matrix (toy)

Each row: how much one token attends to every other.

Scale

2020

GPT-3: 175B parameters and the scaling laws.

OpenAI shows that an autoregressive Transformer, trained on much of the public internet, becomes a capable few-shot learner without any task-specific fine-tuning. Kaplan et al. quantify the trend: loss falls as a clean power law in compute, data, and parameters.

Capability emerges from scale, not from clever architecture tweaks.
"Just train a bigger one" becomes a research program.
The economics of AI flip from labs to data centers.

Loss vs. compute (log-log)

Public moment

2022

ChatGPT launches.
100 million users in two months.

Released as a "research preview" on November 30, 2022, ChatGPT becomes the fastest-growing consumer application in history. The breakthrough was not raw capability — GPT-3.5 had existed for months — but interface: a chat box, free, with a model tuned by RLHF to be useful and to refuse less.

5 days

to 1 million users.

2 months

to 100 million users.

$0

cost to try, at launch.

Now

The open questions.

Eight decades in, the field is louder than ever and less certain than it sounds. Three threads worth watching:

Alignment

How do you make a system that pursues the goal you actually meant — including in situations its training data did not cover?

Generality

Are LLMs a stepping stone to AGI, or a powerful but bounded technology that needs new ingredients (memory, planning, embodiment)?

Takeoff speed

Does capability accelerate smoothly, or does recursive self-improvement create a discontinuous jump? The answer changes everything else.