Computational linguistics — from Shannon's bigrams and Chomsky's hierarchy to the GPU age. The strange convergence of statistics, syntax, and silicon that produced machines which now write, translate, and converse.
Computational linguistics is the discipline of making language tractable for machines — and, almost accidentally, of making machines that have changed what language is.
The field began with Warren Weaver's 1949 memorandum on machine translation, in which he proposed that translating Russian into English might be approached as a cryptanalytic problem: "When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode."
That framing — language as code, decoding as inference — set the tone for seventy-five years. Whether the model is symbolic (Chomsky's generative grammar), statistical (IBM's noisy-channel translation), or neural (the transformer), the working assumption is the same: there is a function from string to meaning, and our job is to learn it.
This deck traces the long arc. From Shannon's information theory and Chomsky's hierarchy through finite-state morphology, treebanks, statistical MT, recurrent nets, attention, and the large language model era. It ends with the question now hanging over the field: what happens when a model trained only on form starts behaving as though it understands.
Claude Shannon's A Mathematical Theory of Communication (Bell Labs, 1948) is the document at the root of every line of work that followed. Shannon defined entropy of a source, channel capacity, and the noisy-channel framework. Crucially, he treated English as a stochastic process and estimated its per-letter entropy at roughly 1.0–1.5 bits.
The 1951 follow-up — Shannon's Prediction and Entropy of Printed English — used human guessing experiments to measure the redundancy of natural language. The technique was elementary: subjects predicted each letter of a covered passage. The implication was profound: language is highly redundant, which means it is highly predictable, which means a machine that learns to predict the next token has, in some statistical sense, learned the language.
Every modern language model — n-gram, RNN, transformer — is a direct descendant of this framing. Next-token prediction is the field's foundational task, and Shannon's bigram tables are the great-grandparent of GPT-4.
The early symbolic linguists viewed this with suspicion. Chomsky's 1957 Syntactic Structures argued explicitly that probability cannot capture grammaticality: colorless green ideas sleep furiously is fully grammatical and statistically vanishing. The argument shaped a generation. It also turned out, when the data scaled, to be wrong about what mattered.
Noam Chomsky's 1956 paper Three Models for the Description of Language established the formal-language hierarchy that bears his name. Four classes, nested:
Type 3 — regular languages. Recognised by finite-state automata. Match patterns; cannot count. The grammar of a single morpheme, or of phone numbers.
Type 2 — context-free languages. Recognised by pushdown automata. Match nested brackets. Most of programming-language syntax. Most (not all) of natural-language syntax.
Type 1 — context-sensitive languages. Recognised by linear-bounded automata. Cross-serial dependencies, as in Swiss German subordinate clauses, exceed CFL.
Type 0 — recursively enumerable languages. Recognised by Turing machines. Anything computable.
The hierarchy is the lingua franca for asking what kind of computation does this phenomenon require. It still organises modern questions about transformer expressivity — recent work (Hahn 2020; Merrill 2023) shows transformers without chain-of-thought sit somewhere between TC⁰ and CFL recognition, weaker than RNNs in formal terms despite their empirical superiority.
Chomsky's later programs — Government and Binding (1981), the Minimalist Program (1995) — moved further from anything implementable. But the hierarchy itself survived. It is the durable contribution.
On January 7, 1954, IBM and Georgetown University publicly demonstrated automatic Russian-to-English translation of 60 sentences using a 250-word vocabulary and six grammar rules. The press conference promised mature MT within five years.
It did not arrive. By 1966 the Automatic Language Processing Advisory Committee (ALPAC) report — commissioned by the National Academy of Sciences — concluded that machine translation was slower, less accurate, and twice as expensive as human translation, and recommended cutting funding. US MT research effectively collapsed for a decade.
The lesson, in retrospect, was not that MT was impossible. It was that purely rule-based MT was impossible at the resource level of the 1960s, and that the field had massively over-promised. Both halves repeated themselves in the 1970s symbolic AI wave and the 1980s expert-systems boom.
The next serious progress came from a different direction entirely. In 1988 the IBM Candide group — Brown, Cocke, Della Pietra, Della Pietra, Jelinek, Lafferty, Mercer, Roossin — proposed translating French to English by training statistical alignment models on the bilingual proceedings of the Canadian parliament. The Mathematics of Statistical Machine Translation (1993) is the founding paper of the statistical NLP era.
Frederick Jelinek, who led the IBM speech and MT work, said the field's most quoted line: "Every time I fire a linguist, the performance of the speech recognizer goes up." He later qualified it. The implication stood.
Mitchell Marcus and colleagues at the University of Pennsylvania released the Penn Treebank in 1993: roughly 4.5 million words of Wall Street Journal text, hand-annotated with part-of-speech tags and syntactic phrase-structure trees. It was the first large gold-standard dataset for English parsing.
The effect was field-shaping. Statistical parsers — Collins (1996), Charniak (1997), Klein and Manning (2003) — trained on the treebank and posted accuracy numbers that no symbolic parser had matched. By 2005 statistical parsing was standard. Rule-based parsing survived only as a teaching tool.
The pattern repeated across every NLP task. WordNet (Miller, Princeton, 1985 onward) — the lexical database. FrameNet (Fillmore, Berkeley, 1997) — frame semantics. PropBank (Palmer, 2005) — predicate-argument structure. OntoNotes (2007, multilingual). Universal Dependencies (Nivre et al., 2014) — a single dependency annotation scheme across now 100+ languages.
Every annotated dataset is also a theory of language compressed into a tagging convention. The Penn Treebank's choices — flat NP structure, traces for movement, particular treatment of coordination — embedded a dialect of generative grammar into the empirical foundation of the field. Decades of subsequent results were partly results about the Penn Treebank.
Speech recognition, more than any other NLP task, drove the statistical turn. The hidden Markov model — a finite-state machine whose state emits a probability distribution over observations — is the workhorse algorithm of 1980s–2000s speech.
The Baum–Welch algorithm (1970) trains an HMM by expectation-maximisation. The Viterbi algorithm (1967, originally for convolutional decoding) finds the most-likely state sequence given an observation sequence. Together they make the HMM tractable.
By 1990, IBM's Sphinx system, Dragon NaturallySpeaking, and BBN's Byblos all ran HMMs over acoustic features (typically Mel-frequency cepstral coefficients) for the acoustic model, with n-gram language models on top. The architecture survived, with refinements, until deep neural acoustic models replaced it around 2012 (Hinton et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition).
HMMs also became the standard model for part-of-speech tagging — Church (1988), Brants's TnT tagger (2000) — achieving over 96% accuracy on Penn Treebank tags before any deep model was tried. The MaxEnt taggers (Ratnaparkhi 1996) and conditional random fields (Lafferty, McCallum, Pereira 2001) extended the framework to sequence labelling generally.
The lesson of the HMM era was that probabilistic graphical models, trained from data, beat hand-written rules at almost every NLP task. The rest was implementation detail — until the implementation detail became neural.
"You shall know a word by the company it keeps." J. R. Firth's 1957 dictum — actually Wittgenstein-adjacent and earlier than Firth in spirit — became the foundation of all vector-space approaches to meaning.
The technical path from slogan to system runs through Latent Semantic Analysis (Deerwester et al., 1990), which factorised a term-document matrix using SVD to produce dense word vectors that captured rough semantic similarity. Topic models — pLSA (Hofmann 1999), Latent Dirichlet Allocation (Blei, Ng, Jordan 2003) — gave the same idea a probabilistic foundation.
The breakthrough was word2vec (Mikolov, Chen, Corrado, Dean, Google 2013). Two simple architectures — Skip-gram and CBOW — trained on the Google News corpus produced word vectors with the now-famous arithmetic property: king − man + woman ≈ queen. The vectors were not just similarity; they encoded analogy, gender, number, capital-of-country, and a long list of regularities.
GloVe (Pennington, Socher, Manning, Stanford 2014) reformulated the same insight as a global matrix factorisation, slightly cleaner mathematically. Together word2vec and GloVe became the input layer of a generation of NLP systems.
The deeper claim was philosophical: that meaning, or at least an enormous amount of practical meaning, is nothing more than the structure of co-occurrence. The 2010s vindicated it. The 2020s — with contextualised embeddings — refined it. The basic insight is now field consensus.
The simple recurrent network (Elman, 1990) was the first neural architecture proposed seriously for language. A hidden state, updated at each timestep, would in principle remember arbitrary context. In practice it forgot quickly.
The vanishing gradient problem — Hochreiter's 1991 diploma thesis identified it; Bengio, Simard, and Frasconi 1994 published it — meant that gradients propagated back through long sequences shrank exponentially. The network couldn't learn long-distance dependencies.
Hochreiter and Schmidhuber's Long Short-Term Memory (1997) solved it with gating — a gated cell that could carry information across many timesteps without vanishing. LSTMs became the dominant sequence model for fifteen years. Cho et al.'s GRU (2014) was a simpler variant.
The 2014 sequence-to-sequence paper (Sutskever, Vinyals, Le, Google) proposed an LSTM encoder reading the source sentence into a fixed vector, then an LSTM decoder generating the target. End-to-end neural machine translation, with no phrase tables, no alignment, no separate language model. By 2016 Google had deployed it (Wu et al., GNMT) — replacing a statistical MT pipeline that had taken twenty years to build.
The fixed-vector bottleneck was the obvious flaw. Bahdanau, Cho, and Bengio's 2014 Neural Machine Translation by Jointly Learning to Align and Translate introduced attention: at each decoding step, look back at all encoder states with learned weights. The attention mechanism survived; the recurrence around it did not.
June 2017. Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin — Google Brain and Google Research — submitted a paper to NeurIPS titled Attention Is All You Need. It described an architecture, the Transformer, with no recurrence and no convolution. Just attention, layer norm, and feedforward blocks, stacked.
The core mechanism is multi-head self-attention. Every token attends to every other token in parallel; the attention pattern is learned. Computation is O(n²) in sequence length but trivially parallelisable on GPU — unlike LSTMs, which serialise across timesteps.
The original transformer was an encoder-decoder for machine translation, beating the previous best on WMT 2014 English-to-German. Within two years the architecture had absorbed essentially all of NLP:
BERT (Devlin, Chang, Lee, Toutanova, Google, 2018). Encoder-only. Pre-trained on masked language modelling and next-sentence prediction; fine-tuned for downstream tasks. SOTA on the GLUE benchmark on day one.
GPT (Radford et al., OpenAI, 2018). Decoder-only. Pre-trained on next-token prediction; fine-tuned for tasks. The unified-architecture argument.
GPT-2 (2019). 1.5B parameters. Demonstrated zero-shot task performance — completing a prompt was, in effect, doing the task.
GPT-3 (Brown et al., OpenAI, May 2020). 175B parameters, in-context learning. The paper that turned the field upside down.
The transformer was the architecture that closed the gap between symbolic and statistical NLP, then made the symbolic side largely irrelevant.
The single most important methodological change of the 2018–2020 period was the pre-train, then fine-tune paradigm. Train a large model once, on enormous unlabeled text, with a self-supervised objective. Then specialise for any downstream task with comparatively tiny labelled data.
The shift was visible already in word2vec (pre-trained word vectors used in downstream classifiers) and consolidated by ELMo (Peters et al., 2018, contextualised embeddings from a bidirectional LSTM language model). BERT made it the standard.
The fine-tuning era lasted roughly 2018 to 2020. Then GPT-3's in-context-learning result — that a sufficiently large pre-trained model could perform new tasks from a few examples in the prompt, without any gradient updates — opened a different mode. By 2023, with instruction-tuned and RLHF-aligned models, much of NLP had moved away from task-specific fine-tuning entirely. You wrote a prompt.
The 2024–2026 picture is more mixed. Pre-training on web-scale text. Mid-training (instruction-tuning, RLHF, constitutional AI). Post-training (RLHF, DPO, RLAIF, online reward modelling). The pipeline has become elaborate; the core insight — pre-train on a generative task at scale, then steer — has held.
Before a model sees text, it sees tokens. The tokenisation choice — which unit the model gets — turns out to matter enormously.
Word-level tokenisation has a vocabulary problem (millions of words, plus rare-word issues). Character-level has a sequence-length problem (everything is 4–5× longer). The compromise is subword tokenisation.
The dominant algorithm is Byte Pair Encoding (Sennrich, Haddow, Birch, 2016, for NMT — adapted from Gage 1994 compression). Start with characters; iteratively merge the most frequent pair into a new token. After ~32k merges you have a vocabulary that splits common words into one piece ("the") and rare words into several ("unhappiness" → "un", "happi", "ness").
SentencePiece (Kudo and Richardson, Google, 2018) is the implementation almost everyone uses. Modern frontier models use BPE variants with vocabularies of 50k–200k tokens.
The unit problem has consequences. Tokenisation is responsible for surprising failures: arithmetic errors on numbers that get split awkwardly, character-level reasoning failures (the model can't easily count letters in a word it sees as a single token), and dramatic differences in efficiency across languages — English costs ~1.3 tokens per word; Chinese ~1.0; some low-resource languages 5+ tokens per word.
Recent work (ByT5, MEGABYTE, byte-level transformers) has revisited tokenisation entirely. The field is not yet settled. How a model sees text is still partly an open question.
In January 2020, Jared Kaplan and colleagues at OpenAI published Scaling Laws for Neural Language Models. They showed that test loss falls as a power law in three quantities: parameters, dataset size, and compute. The relationship is smooth across seven orders of magnitude.
The Kaplan paper implied (with the data they had) that you should train relatively-large models on relatively-small data. GPT-3 was built on this prescription: 175B parameters, 300B tokens.
In March 2022, Hoffmann et al. (DeepMind) published Training Compute-Optimal Large Language Models — the Chinchilla paper. Re-doing the scaling experiments more carefully, they showed Kaplan had under-weighted data: for a given compute budget, you should roughly equally scale parameters and tokens. Their 70B Chinchilla model trained on 1.4T tokens beat the 280B Gopher trained on 300B tokens.
Every subsequent frontier model (LLaMA, Mistral, Claude, Gemini, GPT-4, GPT-5) has been Chinchilla-pilled. Most are trained well past Chinchilla-optimal — over-trained on data — because inference cost matters more than training cost in deployment.
The scaling laws have a strange status. They are empirical regularities, not theory. They predict average loss extremely well. They predict the emergence of specific capabilities — arithmetic, code, reasoning — much more poorly. The capability landscape inside a smoothly-falling loss curve is the field's most active open question.
Wei et al.'s 2022 paper Emergent Abilities of Large Language Models documented dozens of tasks where performance was near-zero up to some scale threshold, then jumped sharply. Multi-digit arithmetic. Word unscrambling. Some forms of logical inference. The pattern was clean enough that "emergence" became a working concept.
Schaeffer, Miranda, and Koyejo's 2023 rebuttal — Are Emergent Abilities of Large Language Models a Mirage? — argued that emergence is partly an artefact of metric choice. Discrete metrics (exact match) hide gradual underlying improvement; smooth metrics (per-token log-likelihood) reveal it.
Both papers are right about something. There are sharp thresholds in many useful metrics. The underlying improvement is often smooth. The user-visible capability — does the model do the task or not — has discontinuities that the smooth loss curve does not.
The 2023–2026 picture has added scaling-law extrapolation as a public concern. Frontier labs publish projected capabilities at GPT-5, GPT-6 scale. Policy bodies — the EU AI Act, the UK AI Safety Institute, the Bletchley Park process — increasingly take these projections as planning assumptions. The field has moved from describing what its models do to predicting what they will do.
An entire subfield — BERTology, then more broadly interpretability — emerged to ask what the parameters of a trained language model represent.
The probing literature (Tenney, Das, Pavlick 2019; Hewitt and Manning 2019) trained simple classifiers on top of frozen BERT representations to recover linguistic properties. Result: BERT's middle layers encode part-of-speech, syntactic dependency, named-entity span, semantic role labels — without ever being trained on any of these tasks.
The structural probe (Hewitt and Manning, 2019) showed that distances between BERT vectors approximate the dependency-tree distances in a sentence. Syntax, in some sense, falls out of next-token prediction at scale.
Mechanistic interpretability — Anthropic's 2021 A Mathematical Framework for Transformer Circuits (Elhage et al.) and the subsequent induction heads result — went further. Induction heads are a specific attention pattern (head A attends to the previous occurrence of the current token; head B copies what came after) that performs in-context pattern completion. The capability appears suddenly during training around 2B parameters.
Sparse autoencoders (2023–2024) decomposed transformer activations into thousands of monosemantic features — for the Golden Gate Bridge, for code-switching, for "concept of a wedding." The subfield is moving fast. The black box is starting to be transparent in places.
Most early NLP was English NLP. The treebank was English; the benchmarks were English; the published results were English.
The multilingual turn came in stages. mBERT (Devlin, 2019) was BERT trained on the Wikipedia of 104 languages with no parallel data. It transferred surprisingly well across languages — indicating that representations were partly language-agnostic.
XLM-R (Conneau et al., Facebook 2019) and mT5 (Xue et al., Google 2020) scaled this further. The NLLB project (Meta 2022) trained a single model to translate among 200 languages including many never previously machine-translated.
The structural problem is data inequality. English has hundreds of billions of tokens of high-quality text. Mandarin, Spanish, Hindi, French, Arabic — each has tens of billions. The next 50 languages have low-billions. Below that, the data falls off a cliff.
Of the world's ~7,000 languages, perhaps 100 have substantial digital corpora. The other 6,900 are in a permanent low-resource regime. The Masakhane project (African languages, 2019 onward), the AmericasNLP shared tasks (Indigenous American languages), and similar community efforts are working against the gap. Progress is real but uneven.
The largest 2026 frontier models are usefully fluent in 30–40 languages and partially competent in many more. The languages they don't speak well — many of them with millions of speakers — remain second-class citizens of the new linguistic infrastructure.
By 2026, machine translation between major language pairs is, for most practical purposes, solved at the sentence level. Google Translate, DeepL, Microsoft Translator, and the open-source NLLB-200 produce output that for English↔French, English↔Spanish, English↔Chinese is often indistinguishable from competent human translation on news and technical text.
The remaining problems are document-level coherence (consistent terminology, correct anaphora, register matching across long passages), low-resource pairs, and creative-text translation (poetry, literary prose, marketing copy with cultural register). Human translators still own the high end of the latter category.
The progression of MT metrics tells the story. BLEU (Papineni et al., 2002) — n-gram overlap with reference. Used for two decades; correlates with human judgement only loosely. METEOR (2005), chrF (2015) — refinements. BLEURT (2020), COMET (2020) — learned metrics that fine-tune BERT-class models on human ratings, correlating much more strongly. GEMBA (2023) — directly prompt a frontier LLM as the metric.
The metric problem matters because it gates research progress. For a long time MT optimised BLEU; for the last five years it has optimised something closer to actual translation quality.
Speech recognition tracked NLP's trajectory with a few years' lag. Deep neural acoustic models (2012). End-to-end attention-based ASR (Chan et al. 2016, Listen Attend and Spell). Connectionist temporal classification (Graves et al. 2006) for alignment-free training.
The 2022 Whisper paper (Radford et al., OpenAI) was a marker. A 1.5B-parameter encoder-decoder transformer trained on 680,000 hours of weakly-supervised multilingual web audio matched supervised ASR systems on standard benchmarks while transferring zero-shot to new languages and accents. Open-sourced, it became the default ASR almost overnight.
Speech synthesis has gone the same way. Tacotron (Wang et al., 2017), WaveNet (van den Oord et al., 2016), and now neural-codec language models (VALL-E, AudioLM, Suno, ElevenLab's models, Sesame's CSM-1B 2024) treat audio as a token stream and language-model it directly. The audio token is now a unit alongside the text token.
The 2024–2026 generation of multimodal models — GPT-4o, Gemini 2, Claude with voice — handle text, audio, and image as a unified token stream. The historical separation of speech and text NLP has effectively ended.
Wei et al.'s May 2022 paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models showed that prompting a sufficiently-large model with worked-example reasoning steps caused it to produce its own reasoning, dramatically improving performance on math and logical tasks.
The follow-up was the scratchpad family of techniques: tree-of-thought (Yao et al. 2023), self-consistency, ReAct, reflection. The pattern was that giving a model more tokens to think — at inference time — was a separate axis of improvement orthogonal to model size.
OpenAI's o1 (September 2024) and o3 (December 2024) productionised this. The model is trained with reinforcement learning to use a long internal "reasoning" trace before producing the final answer. On math and coding benchmarks the result is dramatic: AIME, Codeforces, FrontierMath move into ranges that earlier non-reasoning models did not approach.
DeepSeek-R1 (January 2025) replicated the architecture publicly. Anthropic's Claude 3.7 Sonnet (early 2025) with extended thinking offered another implementation.
The reasoning-model paradigm has changed the cost economics — frontier reasoning is expensive in tokens — and changed the research agenda. Inference-time compute is now a primary scaling axis, alongside training compute.
A pre-trained language model is not, by default, useful or polite. It will produce continuations of the prompt distribution, which on web-scale text includes toxic, false, and unhelpful continuations. Making the model behave is a separate stage.
The dominant technique is Reinforcement Learning from Human Feedback (Christiano et al., 2017; Stiennon et al., OpenAI 2020 for summarisation; Ouyang et al., InstructGPT, 2022).
The loop, in compressed form: collect prompts; have the model generate responses; have humans rank them; train a reward model on the rankings; use the reward model to fine-tune the LLM via PPO (proximal policy optimisation). InstructGPT showed that a 1.3B-parameter RLHF model was preferred over the 175B base GPT-3 by humans on most prompts.
Direct Preference Optimisation (Rafailov et al., 2023) bypasses the explicit reward model — train directly on the preference data with a contrastive loss. RLAIF (Bai et al., Anthropic 2022, Constitutional AI) uses an AI rather than humans for some preference judgements.
Alignment is now a multi-stage pipeline: instruction-tune on demonstrations, RLHF on preferences, iterate with red-teaming. Constitutional principles (Anthropic), spec-based training (OpenAI's Model Spec), and explicit reasoning about model behaviour have moved the field from "make the model nice" to a more architectural problem.
A pure language model is a function from text to text. Wrap it with the ability to call external tools — a calculator, a search engine, a code interpreter, a database — and you have an agent.
The seed work was Toolformer (Schick et al., Meta 2023), which trained a model to insert API calls in its own outputs. ReAct (Yao et al., 2022) interleaved reasoning steps with tool calls. The OpenAI function-calling API (June 2023) made tool use a first-class feature.
By 2024 the picture was: every frontier LLM ships with structured tool use. Anthropic's Model Context Protocol (November 2024) standardised the connection between LLMs and external tools. The "agent" — a loop of model-thinks, model-calls-tool, tool-returns, model-continues — has become the default deployment pattern.
What this looks like in practice: code-writing agents (Cursor, GitHub Copilot Workspace, Claude Code, OpenAI's Codex, Devin). Browser-using agents (Anthropic's computer use, OpenAI's Operator). Research agents (OpenAI Deep Research, Google Gemini's research mode). Each is a thin scaffolding around a frontier LLM with a specific tool palette.
The research questions the agent paradigm raises — long-horizon planning, error recovery, credit assignment over thousand-step trajectories — are now central to NLP. The field has shifted from "what does the model say next?" to "what does the model do over the next hour?"
The most striking demonstration that LLMs handle structured language is in code. GitHub Copilot (2021), trained on a transformer model fine-tuned on public GitHub code, was the first widely-deployed AI programming assistant. Codex (Chen et al., OpenAI 2021) was its underlying model.
By 2026, frontier general-purpose models (Claude 3.5/3.7 Sonnet, GPT-4o/o3, Gemini 2.5) outperform specialist code models on most benchmarks. The HumanEval benchmark — pass@1 on Python function completions — has gone from ~30% (Codex 2021) to >90% (frontier models 2024–2025). SWE-bench, which evaluates whole-repository pull-request resolution, has gone from 1.96% (GPT-3.5, late 2023) to over 70% (Claude Sonnet 4.5, late 2025).
What this means for computational linguistics specifically is that the same architecture that learns French and English is learning Python and Rust. The sub-field of programming-language NLP — the AST parsers, the type-inference systems, the syntax-aware models — has been substantially absorbed by general-purpose LLMs.
The remaining gap is reasoning at the architectural and design level. Models can write working functions; they are still partial substitutes for the engineering judgement that decides which functions to write and how the system should be put together.
A trained large language model is a strange entity, considered linguistically. It has no body, no embodiment, no community. It has read more text than any human ever has — perhaps 10–100 trillion tokens — and produced text that humans consistently judge as fluent and often as correct.
For linguistics this raises several questions of foundational interest. Is grammaticality learnable from form alone? Chomsky and the poverty-of-stimulus tradition said no; the LLM result is a serious empirical challenge. What does it know? Probing shows representations of syntax, semantics, factual world knowledge. Does it know what it knows?
The "octopus paper" (Bender and Koller, 2020) argued that a model trained only on form cannot learn meaning, because meaning is grounded in the world. The argument has been variously contested and refined. Recent work on emergent world models (Othello-GPT, Li et al. 2023) shows linear probes can recover board state from a model trained only on game transcripts — suggesting that some grounding emerges from form alone.
The LLM is not a model of how humans learn language. Children are exposed to ~10–50 million words; LLMs to ~10 trillion. But the LLM is now a real linguistic object — something that produces and processes language, that the field has to reckon with as data, not just as tool.
It is perhaps the most consequential thing computational linguistics has produced.
NLP advances on benchmarks. Each generation builds a benchmark, reports SOTA, watches the field saturate it within a year or two, then builds a harder benchmark.
WSJ part-of-speech tagging (1990s) — saturated by 2005 at ~97%.
Penn Treebank parsing (1990s–2000s) — saturated late 2010s.
SQuAD (Stanford Question Answering Dataset, 2016). Saturated by 2018 with BERT-class models.
GLUE and SuperGLUE (Wang et al., 2018, 2019). GLUE saturated within months; SuperGLUE within a year.
BIG-Bench (Srivastava et al., 2022, 200+ collaborators). 204 tasks, designed to be harder. Mostly saturated by 2024.
MMLU (Hendrycks et al., 2021) — 57 academic subjects, exam-style. From ~25% (random) at GPT-2 scale to ~90% (Claude 3.5 Sonnet, 2024).
HellaSwag, ARC, TruthfulQA, GSM8K, MATH, HumanEval, MBPP — all introduced as hard, all substantially saturated.
FrontierMath (Glazer et al., 2024) — Olympiad-level mathematics, designed to be unsaturable. As of late 2025, frontier reasoning models are approaching 30%.
The pattern has produced a cottage industry of benchmark design and a chronic worry: benchmarks measure what they measure, and what they measure increasingly diverges from what we want to know. The 2026 frontier of evaluation is in process-based and open-ended evaluation: asking what the model does in a complex environment, not what it scores on a multiple-choice test.
Until 2022, frontier NLP models were largely open. The Penn Treebank, mBERT, GPT-2 (released in stages out of stated safety concern, then fully), word2vec — all available to academic researchers.
GPT-3 marked the shift. OpenAI released a paper but not weights or training data. The frontier became closed.
The open response came from Meta. LLaMA (February 2023) was a 7B–65B model family, leaked then officially released; LLaMA 2 (July 2023) was openly licensed; LLaMA 3 (April 2024) and 3.1 (July 2024, including a 405B model) closed much of the frontier gap. Mistral (France, 2023 onward) and Qwen (Alibaba, 2023 onward) added strong open competitors.
DeepSeek (China) has been the 2025 surprise. DeepSeek-V3 (December 2024) and R1 (January 2025) released open-weight models with performance comparable to closed frontier models, at a fraction of reported training cost.
The 2026 picture: closed frontier (OpenAI, Anthropic, Google DeepMind, xAI) leads on the absolute frontier and on tooling. Open frontier (Meta, Mistral, Qwen, DeepSeek) is roughly 6–12 months behind on capability and dramatically ahead on customisability and on-prem deployability.
For computational linguistics as a research field, the open frontier matters enormously: it is the only LLM substrate that academic research can investigate at the parameter level. Without open weights, interpretability, mechanistic study, and many forms of safety research are impossible.
The current generation has a recognisable failure profile.
Long-horizon planning. Models that can solve a single hard problem often fail at sequencing many medium problems. This is being attacked by reasoning models and agent loops; partial progress.
Calibrated uncertainty. Frontier models hallucinate confidently. Calibration training helps marginally; deeper fixes remain open. The model's "I don't know" is unreliable.
Continual learning. Today's LLMs are static after training. There is no good architecture for adding new factual knowledge without retraining. RAG (retrieval-augmented generation) papers over this; doesn't solve it.
Genuine novelty. Models interpolate brilliantly within their training distribution. Producing genuinely novel mathematical conjectures, novel scientific hypotheses, novel literary forms — the evidence is thin. This may change with reasoning models; it has not yet.
Robustness to adversarial input. Prompt injection attacks, jailbreaks, and adversarial examples remain effective. Defence is slowly improving; the offence-defence asymmetry favours the offence.
True grounding. Despite multimodal training, today's models do not know what red looks like in the way a sighted child does, do not know what gravity feels like, do not know what hunger is. The grounding gap is partial, possibly closeable, possibly not.
None of these is necessarily permanent. All are real now.
Founders. Claude Shannon (information theory). Warren Weaver (machine translation). Noam Chomsky (formal language theory). Joseph Weizenbaum (ELIZA, 1966). Terry Winograd (SHRDLU, 1972).
The statistical generation. Frederick Jelinek (IBM speech, statistical MT). Mitch Marcus (Penn Treebank). Ken Church (statistical NLP foundations). Eugene Charniak (statistical parsing). Lillian Lee. Christopher Manning (Stanford). Dan Jurafsky (Stanford, co-author of the textbook). Yoshua Bengio (neural language models).
The deep-learning generation. Geoffrey Hinton (deep nets). Yann LeCun. Tomas Mikolov (word2vec). Ilya Sutskever (seq2seq, GPT). Quoc Le. Dzmitry Bahdanau, Kyunghyun Cho (attention). Ashish Vaswani et al. (transformer authors). Jacob Devlin (BERT). Alec Radford (GPT). Dario Amodei (OpenAI then Anthropic). Demis Hassabis (DeepMind).
Linguists in the loop. Emily Bender (LINGUIST list, FAccT, the octopus paper). Joyce Chai. Yejin Choi (commonsense reasoning). Mona Diab. Rachael Tatman. Christopher Potts. Kyle Mahowald.
Interpretability. Chris Olah (Distill, then Anthropic). Catherine Olsson. Jacob Steinhardt. Neel Nanda.
The field is unusually small for its impact. Most of the people listed know each other; many co-author across institutions; the major labs are perhaps 3,000 researchers globally. The work that defines the 2020s was done by a network you could fit in a stadium.
↑ Crash Course Linguistics — what computational linguistics actually does
Watch · What is NLP (Natural Language Processing)?
Watch · Transformers Step-by-Step (Attention Is All You Need)
Vaswani et al., Attention Is All You Need (2017). Brown et al., Language Models are Few-Shot Learners (2020). Jurafsky and Martin's Speech and Language Processing 3rd edition (free online drafts) for the textbook treatment.
Three trajectories visible from 2026.
Continued capability scaling. Frontier labs project another 1–2 orders of magnitude of compute through 2028. Whether capabilities continue to scale smoothly, plateau, or develop new emergence is the largest open empirical question in computer science.
The interpretability race. Sparse autoencoders, mechanistic interpretability, automated circuit discovery — the field is finally producing tools to understand its models. Whether interpretability scales to frontier models faster than capabilities do is partly a research question and partly a policy one.
The integration of NLP into everything else. The boundary between "natural language processing" and "computing" is dissolving. Code, data analysis, scientific writing, programming environments, search — all are being reorganised around language-model substrates. The field that began with Shannon's bigram counts is now the substrate for most of digital infrastructure.
The traditional research questions of computational linguistics — how to parse, how to translate, how to summarise, how to generate — are mostly engineering problems now. The new questions — what does the model understand, how do we steer it, what should it do — are partly empirical, partly philosophical, and partly political.
The field is in an extraordinary moment. The next ten years will decide what it has produced.
Computational Linguistics — Volume III, Deck 8 of The Deck Catalog. Set in IBM Plex on a terminal-dark grid; mint accent and magenta highlights; monospaced metadata throughout.
Twenty-eight chapters from Shannon's bigrams to GPT's terabytes. The strangest thing about the field is that the question it began with — what is language, computationally — has been overtaken by the artefact it produced.
↑ Vol. III · Lang. · Deck 8