The long path from neuron to network.
A 70-year arc that stalled twice, then accelerated past most of its critics. Below: dates, names, and the equations that built modern AI.
1943–1958: The neuron, formalized.
In 1943, Warren McCulloch and Walter Pitts proposed a binary threshold model of the neuron — a logic gate with weighted inputs. Fifteen years later Frank Rosenblatt built the Mark I Perceptron at the Cornell Aeronautical Laboratory, a 400-photocell machine that could learn to distinguish marked cards. The New York Times announced an "embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself."
The two AI winters.
In 1969 Marvin Minsky and Seymour Papert published Perceptrons, proving the single-layer model could not learn XOR. Funding dried up. A second winter followed in the late 1980s and early 1990s when expert systems failed to scale economically.
"There is no reason to suppose that any of these virtues carry over to the many-layered version." — Minsky & Papert, Perceptrons, 1969 (later revised)
1986: Backpropagation, popularized.
Rumelhart, Hinton, and Williams' Nature paper "Learning representations by back-propagating errors" showed that gradient descent through a chain rule could train multi-layer networks. The math had been derived by Seppo Linnainmaa in 1970 and applied to NNs by Werbos in 1974 — but the 1986 paper made it stick.
# a tiny pure-python sketch
for epoch in range(N):
y_hat = forward(x, W)
loss = mse(y_hat, y)
grads = backward(loss, W) # chain rule
W -= lr * grads # gradient descent
Convolutions and the GPU.
Yann LeCun's LeNet-5 (1998) read postal codes with convolutional layers — local receptive fields, weight sharing, pooling. The technique waited for hardware: in 2009 Raina, Madhavan and Ng showed GPUs could train deep networks 70× faster than CPUs.
2012: AlexNet and the spark.
Krizhevsky, Sutskever, and Hinton's AlexNet halved the ImageNet top-5 error rate to 15.3%. Two NVIDIA GTX 580s, ReLU activations, dropout, and 60M parameters. The result was so far ahead of the field that the deep-learning revolution effectively dates from this paper.
| Year | Top-5 error | Model |
|---|---|---|
| 2010 | 28.2% | NEC-UIUC (SIFT + SVM) |
| 2011 | 25.8% | Xerox |
| 2012 | 15.3% | AlexNet |
| 2014 | 6.7% | GoogLeNet |
| 2015 | 3.6% | ResNet-152 |
2017: Attention is all you need.
Vaswani et al. dropped recurrence entirely. Self-attention computes a weighted average of values, with weights from scaled dot-products of queries and keys.
Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V
Parallelizable across sequence positions, the transformer scaled to GPT-3's 175B parameters by 2020 and beyond. Every modern frontier model — GPT, Gemini, Claude, Llama — is a transformer or close descendant.
Scaling laws.
Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") found loss falls as a power law in parameters, data, and compute. The Chinchilla update: for a fixed compute budget, you want roughly equal scaling of parameters and tokens (~20 tokens per parameter).
The modern LLM stack.
Pretraining
Self-supervised next-token prediction on web text, code, books, and licensed corpora. Trillions of tokens.
SFT
Supervised fine-tuning on curated demonstrations. Teaches the model the desired output format and tone.
RLHF / RLAIF
Reinforcement learning from human or AI preferences. PPO, DPO, or constitutional methods.
Inference
KV-cache, speculative decoding, quantization, MoE routing. The serving layer is now a research field of its own.
Multimodal & tool use.
CLIP (2021) tied images and text into a shared embedding space. By 2024 frontier models were natively multimodal: text in, text-image-audio-video out. Tool use — function calling, browsing, code execution — turned chatbots into agents that can act.
CLIPDALL-ESoraGeminiClaude
Agents.
An agent is a model in a loop with tools and memory. The 2025–2026 wave — Claude with computer use, OpenAI Operator, Devin, AutoGPT descendants — pushed reliability past the threshold for real work: software engineering, research, customer support, ops.
while not done:
obs = env.observe()
thought, action = model(obs, history)
obs = env.act(action)
history.append((obs, action))
Alignment & safety.
The technical problem: train a system whose behavior matches human intent across distribution shift. Key concepts include reward hacking, deceptive alignment, eval-gaming, and scalable oversight. The field draws from RL, mechanistic interpretability, and formal verification.
"The genie does what you ask, not what you want."— folk maxim of the alignment community
Watch this.
Open problems
- Sample-efficient continual learning without catastrophic forgetting.
- Robust mechanistic interpretability of large transformers.
- Scalable oversight of superhuman models.
- Energy and water cost of inference at planetary scale.