Deck 01 / Modern editorial

The long path from neuron to network.

A 70-year arc that stalled twice, then accelerated past most of its critics. Below: dates, names, and the equations that built modern AI.

abstract circuit
Figure 1. Procedurally seeded image, picsum.photos. Decorative.

1943–1958: The neuron, formalized.

In 1943, Warren McCulloch and Walter Pitts proposed a binary threshold model of the neuron — a logic gate with weighted inputs. Fifteen years later Frank Rosenblatt built the Mark I Perceptron at the Cornell Aeronautical Laboratory, a 400-photocell machine that could learn to distinguish marked cards. The New York Times announced an "embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself."

x₁ x₂ x₃ w₁ w₂ w₃ Σ step() y
Figure 2. The Rosenblatt perceptron: y = step(Σ wᵢxᵢ + b).

The two AI winters.

In 1969 Marvin Minsky and Seymour Papert published Perceptrons, proving the single-layer model could not learn XOR. Funding dried up. A second winter followed in the late 1980s and early 1990s when expert systems failed to scale economically.

"There is no reason to suppose that any of these virtues carry over to the many-layered version." — Minsky & Papert, Perceptrons, 1969 (later revised)

1986: Backpropagation, popularized.

Rumelhart, Hinton, and Williams' Nature paper "Learning representations by back-propagating errors" showed that gradient descent through a chain rule could train multi-layer networks. The math had been derived by Seppo Linnainmaa in 1970 and applied to NNs by Werbos in 1974 — but the 1986 paper made it stick.

# a tiny pure-python sketch
for epoch in range(N):
    y_hat = forward(x, W)
    loss  = mse(y_hat, y)
    grads = backward(loss, W)        # chain rule
    W    -= lr * grads               # gradient descent
Alan_Turing

Convolutions and the GPU.

Yann LeCun's LeNet-5 (1998) read postal codes with convolutional layers — local receptive fields, weight sharing, pooling. The technique waited for hardware: in 2009 Raina, Madhavan and Ng showed GPUs could train deep networks 70× faster than CPUs.

input 32×32 conv 5×5 pool 2×2 conv pool FC softmax
Figure 3. The classical convolutional pipeline (LeNet-5 family).

2012: AlexNet and the spark.

Krizhevsky, Sutskever, and Hinton's AlexNet halved the ImageNet top-5 error rate to 15.3%. Two NVIDIA GTX 580s, ReLU activations, dropout, and 60M parameters. The result was so far ahead of the field that the deep-learning revolution effectively dates from this paper.

YearTop-5 errorModel
201028.2%NEC-UIUC (SIFT + SVM)
201125.8%Xerox
201215.3%AlexNet
20146.7%GoogLeNet
20153.6%ResNet-152

2017: Attention is all you need.

Vaswani et al. dropped recurrence entirely. Self-attention computes a weighted average of values, with weights from scaled dot-products of queries and keys.

Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V

Parallelizable across sequence positions, the transformer scaled to GPT-3's 175B parameters by 2020 and beyond. Every modern frontier model — GPT, Gemini, Claude, Llama — is a transformer or close descendant.

FIG. 2
Neural network.
Neural networks — the dominant ML architecture since ~2010. Deep learning revolution: Krizhevsky 2012, transformers 2017, GPT 2022.

Scaling laws.

Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") found loss falls as a power law in parameters, data, and compute. The Chinchilla update: for a fixed compute budget, you want roughly equal scaling of parameters and tokens (~20 tokens per parameter).

loss log(compute) GPT-2 GPT-3 PaLM frontier '24
Figure 4. Stylized loss vs. compute (Kaplan/Hoffmann scaling).

The modern LLM stack.

Pretraining

Self-supervised next-token prediction on web text, code, books, and licensed corpora. Trillions of tokens.

SFT

Supervised fine-tuning on curated demonstrations. Teaches the model the desired output format and tone.

RLHF / RLAIF

Reinforcement learning from human or AI preferences. PPO, DPO, or constitutional methods.

Inference

KV-cache, speculative decoding, quantization, MoE routing. The serving layer is now a research field of its own.

FIG. 3
OpenAI.
OpenAI — the lab that brought LLMs to mass attention. ChatGPT (2022) reframed AI's cultural and economic position.

Multimodal & tool use.

CLIP (2021) tied images and text into a shared embedding space. By 2024 frontier models were natively multimodal: text in, text-image-audio-video out. Tool use — function calling, browsing, code execution — turned chatbots into agents that can act.

CLIPDALL-ESoraGeminiClaude

Agents.

An agent is a model in a loop with tools and memory. The 2025–2026 wave — Claude with computer use, OpenAI Operator, Devin, AutoGPT descendants — pushed reliability past the threshold for real work: software engineering, research, customer support, ops.

while not done:
    obs   = env.observe()
    thought, action = model(obs, history)
    obs   = env.act(action)
    history.append((obs, action))

Alignment & safety.

The technical problem: train a system whose behavior matches human intent across distribution shift. Key concepts include reward hacking, deceptive alignment, eval-gaming, and scalable oversight. The field draws from RL, mechanistic interpretability, and formal verification.

"The genie does what you ask, not what you want."— folk maxim of the alignment community

Watch this.

Watch: transformers explained

Open problems