Vol. XII · Deck 10 · The Deck Catalog

Psychometrics.

The measurement of psychological constructs. IQ, the Big Five, the Flynn effect, the bell-curve controversy, and why the MBTI is not a serious measurement instrument.


FoundedGalton, c. 1880
α target≥ 0.80
Pages26
LedeII

OpeningWhat psychometrics is.

The discipline that asks: when we say someone "has high anxiety" or "an IQ of 120," what does that statement mean, and how would we know if we were wrong?

Psychological constructs — intelligence, personality, anxiety, depression, self-esteem — are not directly observable. We measure them through proxies: questionnaire responses, behavioural tasks, ratings by others. Psychometrics is the theory and practice of doing this rigorously.

The deck covers the founding (Galton through Binet), the development of intelligence testing, the Big Five revolution in personality assessment, the technical core (reliability, validity, factor analysis, IRT), the major controversies (the bell-curve debate, the MBTI critique, the Flynn effect), and the contemporary frontier of computational and digital-footprint measurement.

The Deck Catalog · Vol. XII— ii —
GaltonIII

Chapter IThe dark origins.

The discipline began in Victorian England with Francis Galton, who attempted to measure individual differences in mental ability through proxy measures: reaction time, head circumference, sensory acuity. Galton's anthropometric laboratory measured roughly 9,000 people for a small fee. The premise was that mental ability could be inferred from simple physical and sensory measures.

The premise was wrong. The simple sensory and reaction-time measures correlated weakly at best with what we would now call intelligence. But Galton invented the statistical tools — correlation, regression, the use of percentiles — that subsequent psychometrics depended on.

Galton was also the founder of eugenics, a project that sought to improve the heritable qualities of the human race through selective reproduction. The eugenics movement, including its later coercive expressions in the US sterilisation laws of the 1907–1939 period and Nazi German racial hygiene, was directly influenced by Galton's writings. The discipline of psychometrics has, since its founding, carried this legacy. Contemporary practitioners are aware of it.

Psychometrics · Galton— iii —
BinetIV

Chapter IIThe first practical test.

Alfred Binet's 1905 test was the first that worked. Its premise differed from Galton's: rather than measuring elementary processes, Binet asked children directly to perform tasks of increasing complexity — naming objects, repeating digit sequences, completing sentences, defining words. The tasks were ranked by difficulty across age groups; a child's performance was scored against age-typical performance.

Mental age and IQ

Binet's mental age concept — a 7-year-old performing at the level typical of 9-year-olds had a mental age of 9. William Stern in 1912 proposed the intelligence quotient: IQ = (mental age / chronological age) × 100. The ratio IQ formula has since been replaced by the deviation IQ (current IQ scores represent standard deviations from the population mean), but the basic measurement concept persists.

Binet's caution

Binet himself warned against treating intelligence test scores as fixed measures of innate ability. He saw the scores as identifying children who needed help, not as ranking children by inherent worth. Subsequent users of his test — particularly the American hereditarians of the 1910s–20s — ignored the caveat.

Psychometrics · Binet— iv —
Stanford-Binet & WechslerV

Chapter IIIStanford-Binet and Wechsler.

Lewis Terman at Stanford translated and adapted Binet's test for American use, producing the Stanford-Binet in 1916. The test became the dominant US intelligence measure for decades. Terman also launched the Genetic Studies of Genius longitudinal study (1921–) — 1,528 high-IQ children followed across their lives. Subjects were called "Termites." The study is still running with descendants.

Wechsler

David Wechsler (1896–1981), at Bellevue Hospital in New York, developed the Wechsler-Bellevue Intelligence Scale in 1939. He argued that intelligence testing should yield not a single score but profiles across multiple cognitive domains. The Wechsler scales are now the dominant clinical instruments. The current adult version, the WAIS-IV (2008), produces four index scores — Verbal Comprehension, Perceptual Reasoning, Working Memory, Processing Speed — plus a Full Scale IQ.

The CHC model

The Cattell-Horn-Carroll model is the current dominant theoretical framework for intelligence testing. It posits a hierarchy: a general factor (g), broad abilities (fluid reasoning, crystallised intelligence, processing speed, working memory, visual processing, etc.), and narrow abilities. Modern test batteries are designed to assess the broad abilities.

Psychometrics · SB & Wechsler— v —
Spearman's gVI

Chapter IVThe general factor.

Charles Spearman observed that performance on different cognitive tests was always positively correlated — people who do well on vocabulary tests also do well on spatial-rotation tests, on arithmetic, on memory tasks. The correlations are not perfect (each test has unique variance) but the positive manifold is robust. Spearman called the common factor g (general intelligence).

Spearman's two-factor theory: any cognitive test reflects g plus an s (specific) component for that particular task. The model has been refined many times since but g has held up as a reliable empirical regularity.

Thurstone's challenge

L. L. Thurstone (1938) argued for primary mental abilities rather than a single g — verbal comprehension, word fluency, number, space, memory, perceptual speed, reasoning. Thurstone's analysis suggested intelligence was multidimensional and that g was an artefact of how the abilities were measured.

The contemporary CHC synthesis incorporates both: g exists as a higher-order factor; the broad abilities Thurstone identified exist as lower-order factors. Both are real; both matter.

Psychometrics · g— vi —
Flynn effectVII

Chapter VIQ has been rising.

James Flynn's 1984 paper documented a massive secular gain in IQ scores across 20th-century cohorts. Average IQ scores in the developed world rose by approximately 3 points per decade — about 30 points over the 20th century. The gains were largest on tests of fluid reasoning (Raven's Progressive Matrices); smaller on tests of crystallised intelligence and vocabulary.

The implications were striking. Either people were genuinely getting smarter at an unprecedented rate, or IQ tests were measuring something other than fixed innate ability, or both.

Causes

Improved nutrition. Reduced disease burden. Compulsory schooling. Increased educational complexity. The cognitive demands of an information-rich environment. Test sophistication and a "scientific spectacles" effect (Flynn's term — modern people are more comfortable with abstract classification, which test items reward).

Reverse Flynn?

Some recent samples (Norway, the Netherlands, Finland, France) have shown stagnation or modest reversal of the Flynn gains since the 1990s. Whether this reflects real cognitive change, sampling shifts, or test-content issues is debated. Bratsberg & Rogeberg's 2018 paper (Norwegian male conscripts) was the most-cited reverse-Flynn evidence.

Psychometrics · Flynn— vii —
Bell CurveVIII

Chapter VIThe bell-curve controversy.

Richard Herrnstein and Charles Murray's The Bell Curve (1994) made several empirical claims, some uncontroversial in psychometrics and some highly contested.

What was uncontroversial

That IQ tests are reliable. That they have predictive validity for educational and occupational outcomes. That cognitive demands have become more important to economic mobility in the second half of the 20th century. That measured average IQ varies between population groups.

What was contested

The strong claim that mean IQ differences between racial groups (specifically the historically observed roughly 1-SD gap between US Black and White samples on most IQ tests) have partly genetic causes. The empirical and methodological pushback was substantial. The 1995 American Psychological Association report ("Intelligence: Knowns and Unknowns") concluded that the available evidence did not support a genetic explanation for the Black-White gap.

What has happened since

The gap has narrowed (Dickens & Flynn 2006). Behaviour-genetic studies suggest within-group heritability of intelligence is high (~50–80% in adulthood), but within-group heritability does not license claims about the causes of between-group differences. The current consensus is more cautious than either side of the 1990s debate.

Psychometrics · Bell Curve— viii —
Stereotype threatIX

Chapter VIIStereotype threat.

Claude Steele and Joshua Aronson's 1995 paper showed that the experience of being evaluated through the lens of a negative stereotype could itself depress performance. Black participants performed worse on a verbal test framed as a measure of intellectual ability than on the same test framed as a problem-solving exercise. White participants showed no such effect.

The phenomenon — stereotype threat — has been shown for women in math, white students in athletic tasks, older adults in memory tests, and many other groups. It is a real psychological phenomenon that affects measured performance.

Effect size

The effect has been replicated many times but with smaller effect sizes than the original studies suggested. Recent meta-analyses (Picho, Rodriguez, Finnie 2013; Flore & Wicherts 2015) find effect sizes ranging from small to moderate, with substantial publication bias in the literature.

The implication for psychometrics: test performance is not a pure measure of underlying ability; it is also affected by the social and motivational context of the testing. Real-world decisions based on test scores carry the consequences of these context effects.

Psychometrics · Stereotype Threat— ix —
ReliabilityX

Chapter VIIIReliability.

A test is reliable if it produces consistent results. Three forms:

Test-retest reliability. Correlation between two administrations of the same test to the same people across time. For stable traits (intelligence, Big Five), test-retest r should be ≥ 0.7 across reasonable time periods. The MBTI's r ≈ 0.5 across five weeks is, for a categorical instrument, essentially random.

Internal consistency. The degree to which the items of a test all measure the same construct. Cronbach's alpha is the standard index. McDonald's omega is technically superior in many situations and is increasingly preferred in modern psychometrics.

Inter-rater reliability. The degree to which different scorers produce consistent ratings. Measured by Cohen's kappa (categorical) or intraclass correlation (continuous). Critical for clinical interview-based instruments and behavioural observation.

The reliability ceiling

A test cannot have validity greater than its reliability — you cannot measure something more accurately than your instrument is consistent. The classical formula: r_xy ≤ √(r_xx · r_yy).

Psychometrics · Reliability— x —
ValidityXI

Chapter IXValidity.

A test is valid to the extent that it measures what it purports to measure. The classical taxonomy:

Content validity. Does the test cover the relevant content domain? An achievement test should sample the curriculum it claims to measure.

Criterion validity. Does the test predict the relevant outcome? Two subtypes: concurrent (relationship to a current outcome — does the depression scale correlate with clinician ratings now?) and predictive (relationship to a future outcome — do SAT scores predict college GPA?).

Construct validity. Does the test measure the underlying psychological construct it is named for? Demonstrated through multiple converging lines of evidence: convergent validity (correlation with other measures of the same construct), discriminant validity (lack of correlation with measures of different constructs), and the broader nomological network in which the construct sits.

Campbell & Fiske, 1959

The multitrait-multimethod matrix formalised the convergent/discriminant distinction. Multiple traits should be measured by multiple methods; the construct-validity case is supported when same-trait correlations across methods exceed different-trait correlations within methods.

Psychometrics · Validity— xi —
Factor analysisXII

Chapter XFactor analysis.

The statistical workhorse of psychometrics. Factor analysis identifies latent dimensions that explain patterns of correlation among observed variables. If many test items correlate strongly with each other but weakly with another set of items, factor analysis can identify the underlying factors.

Spearman's 1904 g paper was the first factor-analytic study. Cattell's 16PF, Eysenck's PEN model, the Big Five, the structure of psychopathology (Krueger's HiTOP), the structure of cognitive abilities (CHC) — all are factor-analytic results.

The fit problem

Factor analysis does not yield a unique solution. Multiple rotations of the same data produce different factor structures. The choice between solutions involves both statistical fit (eigenvalues, scree plots, parallel analysis, fit indices like CFI, RMSEA, SRMR) and substantive interpretability. Different fields have settled on different conventional structures, and reasonable people can disagree about whether 3, 5, or 6 factors best capture personality variation.

The 5-factor model has substantial support across languages and methods, but the case is empirical and partial, not deductive.

Psychometrics · Factor Analysis— xii —
IRTXIII

Chapter XIItem Response Theory.

Classical Test Theory (the framework underlying Cronbach's alpha and traditional test scoring) treats a test score as the sum of true score plus error. Item Response Theory (IRT) models the probability that a person of a given ability will answer a particular item correctly, as a function of item-level parameters (difficulty, discrimination, guessing).

The most common IRT models: 1PL (Rasch) — items vary only in difficulty; 2PL — items vary in difficulty and discrimination; 3PL — adds a guessing parameter. For polytomous items (Likert ratings), graded response and partial credit models extend the framework.

Computer-adaptive testing

IRT enables tests that adapt to the test-taker. The GRE, GMAT, and many state-level standardised tests are now computer-adaptive: each item is selected based on the test-taker's performance on previous items. The result: more measurement information per minute of testing time, particularly at the extremes of the ability distribution where fixed-form tests have less precision.

Differential Item Functioning

IRT also provides the framework for testing whether items function differently across demographic groups. DIF analysis identifies items that produce different probability of correct response for different groups at the same underlying ability level — a major fairness check for high-stakes tests.

Psychometrics · IRT— xiii —
MMPIXIV

Chapter XIIThe MMPI.

The Minnesota Multiphasic Personality Inventory is the most-used clinical personality assessment. Developed in 1943 at the University of Minnesota by Starke Hathaway (psychologist) and J. Charnley McKinley (psychiatrist), originally to support psychiatric diagnosis.

The original MMPI used empirical keying: items were not selected for face validity but for their ability to differentiate clinical groups (depressed patients vs controls, schizophrenia patients vs controls, etc.) regardless of whether the items obviously related to the disorder. The approach minimised respondent strategising — you couldn't easily fake a depression score if the depression-keyed items included "I sometimes enjoy reading."

The original 567-item MMPI-2 produces 10 clinical scales (Depression, Hysteria, Hypochondriasis, Psychopathic Deviate, Masculinity-Femininity, Paranoia, Psychasthenia, Schizophrenia, Hypomania, Social Introversion) plus validity scales designed to detect inconsistent or strategic responding (L, F, K, VRIN, TRIN).

The MMPI-3 (2020) modernises the norms (a 1,620-person normative sample reflecting current US demographics) and integrates the dimensional restructured form approach. It remains the dominant US clinical personality assessment.

Psychometrics · MMPI— xiv —
NEO & the Big FiveXV

Chapter XIIINEO and Big Five inventories.

The major Big Five instruments:

NEO-PI-R (240 items, Costa & McCrae 1992; revised as NEO-PI-3 in 2010). 6 facets per trait, gold-standard for research use.

BFI-2 (60 items, Soto & John 2017). Public-domain. Three facets per trait. Briefer than NEO; broader use in survey research.

IPIP-NEO (Goldberg). Public-domain alternatives to the proprietary NEO. Versions of various lengths (50, 100, 120, 300 items). Available free online.

HEXACO-PI-R (Lee & Ashton). Six factors including Honesty-Humility.

Mini-IPIP, TIPI (Gosling, Rentfrow, Swann 2003) — very short scales (10-item TIPI, 20-item Mini-IPIP) for use when time is limited. Lower reliability than longer scales, but acceptable for some research purposes.

Differences in coverage

The Big Five inventories differ in their facet structures. Different inventories operationalise the Openness facet differently — some include intellectual openness, others aesthetic sensitivity, others both. Researchers should be cautious about comparing scores across instruments.

Psychometrics · NEO— xv —
The MBTI problemXVI

Chapter XIVWhy the MBTI fails as measurement.

The Myers-Briggs Type Indicator's psychometric problems are by now well-documented and widely (academically) acknowledged. Four major issues:

1. Bimodality assumption is wrong. The MBTI classifies people as I or E, S or N, T or F, J or P. Empirical distributions on the underlying dimensions are normal, not bimodal. Most people score near the middle and are classified arbitrarily.

2. Test-retest reliability is poor. About 50% of test-takers receive a different four-letter type on retest within five weeks. For a categorical measure that is essentially noise.

3. Predictive validity is weak. MBTI types do not predict job performance, leadership, relationship satisfaction, or other outcomes better than chance. Big Five traits do.

4. Construct validity is poor. Three of the four MBTI dimensions either overlap heavily with Big Five traits (E with extraversion, T/F with agreeableness, J/P with conscientiousness, N/S with openness) or are not coherent psychological constructs (the J/P dimension in particular).

The MBTI persists for non-scientific reasons: it feels meaningful; it produces flattering descriptions; it has a robust commercial ecosystem (CPP/The Myers-Briggs Company sells the official version; many derivatives are free). The empirical case against using it for any high-stakes decision is overwhelming.

Psychometrics · MBTI— xvi —
IATXVII

Chapter XVThe Implicit Association Test.

The IAT was introduced by Anthony Greenwald, Mahzarin Banaji, and Brian Nosek in 1998. It measures the strength of automatic associations through differences in reaction time on classification tasks. Subjects are asked to classify words and images using two response keys; the keys are paired with different concepts. If "Black" and "bad" share a key, and "White" and "good" share a key, subjects with stronger automatic White-good and Black-bad associations respond faster than when the pairings are reversed.

The Project Implicit website has administered the IAT to millions of people since 2002. The aggregate finding: most people, regardless of self-reported attitudes, show evidence of automatic biases favouring socially dominant groups.

The measurement debate

The IAT has substantial measurement-property concerns. Test-retest reliability is poor (r ≈ 0.4–0.5 for the race IAT). The relationship between IAT scores and actual discriminatory behaviour is weak (meta-analytic r ≈ 0.1, Oswald et al. 2013; Forscher et al. 2019). The construct of "implicit attitude" is contested.

The IAT's policy use — particularly in implicit-bias training programmes — has been criticised for outrunning the empirical foundation. The phenomenon of automatic associations is real; the IAT's specific psychometric properties are weaker than its widespread use suggests.

Psychometrics · IAT— xvii —
PolygraphXVIII

Chapter XVIThe polygraph problem.

The polygraph is a useful case study in failed psychometrics. The instrument was developed by William Marston in 1921 (Marston later co-created the Wonder Woman comic, in part using the polygraph as inspiration for the Lasso of Truth). It became widespread in US law enforcement, intelligence, and employment screening through the 20th century.

The fundamental measurement problem: physiological arousal is not specific to deception. Anxiety, anger, fear of the test itself, illness, medication effects, and many other conditions produce the same patterns the polygraph detects. The validity of the polygraph as a lie-detector has never been demonstrated to a scientific standard.

The 2003 National Research Council report The Polygraph and Lie Detection reviewed the available evidence and concluded that the polygraph's accuracy was poor, particularly for the high-stakes screening applications (employment, security clearances) for which it was widely used. The Employee Polygraph Protection Act (1988) had already banned most private-sector employment use; the federal-government use continues despite the negative scientific verdict.

The contemporary alternatives — fMRI-based deception detection (Langleben), guilty knowledge tests — have their own substantial measurement problems.

Psychometrics · Polygraph— xviii —
StandardsXIX

Chapter XVIITest ethics.

The use of psychological tests carries ethical and legal obligations that the discipline has formalised over decades. The Standards for Educational and Psychological Testing (most recent edition 2014) is the joint statement of the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education.

Core principles: tests should be used only for purposes for which their validity has been demonstrated; users should have appropriate training; test-takers should be informed of the test's purpose; results should be communicated with appropriate caveats; the use of tests in high-stakes decisions (admissions, employment, custody) carries elevated obligations for fairness and due-process review.

Legal framework

In US employment contexts, the Uniform Guidelines on Employee Selection Procedures (1978) regulate test use. Tests with disparate impact on protected groups must be shown to be job-related and consistent with business necessity. The legal-psychometric interface has produced substantial literature on test fairness, group-level equating, and the validity of selection procedures.

Most contemporary high-stakes psychological testing operates under elaborate quality-control regimes. The history of test misuse — particularly in the early 20th-century US immigration and IQ-testing era, where tests in English were administered to non-English speakers and the results used to argue against immigration — is a permanent cautionary tale.

Psychometrics · Standards— xix —
Cross-cultural fairnessXX

Chapter XVIIITest fairness across groups.

A test that is reliable and valid in one population may not be in another. The technical concept is measurement invariance: the property that a test measures the same underlying construct in the same way across groups (cultures, languages, age cohorts, genders).

Three levels of invariance: configural (same factor structure across groups); metric (same factor loadings); scalar (same intercepts). Scalar invariance is required for meaningful cross-group comparison of mean scores.

Many widely-used psychological tests do not achieve scalar invariance across cultures, which means cross-cultural mean comparisons are technically problematic even when the tests are reliable within each culture. The Big Five structure has held up reasonably well across many languages; cross-cultural mean comparisons of Big Five traits are nonetheless contested.

Differential Item Functioning

DIF analysis (above) identifies specific items that function differently across groups at the same underlying ability level. Tests used in high-stakes contexts (SAT, GRE, employment screening) routinely undergo DIF screening with items that show meaningful DIF being removed or revised.

Psychometrics · Cross-cultural— xx —
Digital footprintXXI

Chapter XIXMeasurement from digital traces.

Michal Kosinski and David Stillwell's myPersonality project (2007–2018) collected Facebook data and personality measures from millions of consenting users. The studies that followed established that digital footprints could predict personality traits at substantial accuracy — Facebook Likes (Kosinski, Stillwell, Graepel 2013), Twitter language (Park et al. 2015), smartphone sensor data (Stachl et al. 2020).

Stachl et al. (2020) used six months of smartphone-sensor data (call logs, app usage, music, location movement patterns) and predicted Big Five traits at correlations approaching r = 0.4 with self-report — comparable to informant report accuracy.

The implication: stable individual differences leave detectable traces in digital behaviour. The privacy and autonomy implications are substantial. The 2018 Cambridge Analytica scandal — using Kosinski-style methods, allegedly to target political advertising — accelerated public scrutiny of psychometric inference from digital data.

What this means for measurement

The traditional self-report inventory may be supplemented or partially replaced by passive measurement from digital traces. The validity may be comparable; the implications for consent, manipulation, and surveillance are more troubling than self-report's. The field's ethical apparatus has not yet caught up with the technical capabilities.

Psychometrics · Digital— xxi —
Replication pictureXXII

Chapter XXWhat has held up.

Findings strongly supported by replication and continuous accumulation: the existence and predictive validity of g; the Big Five factor structure across languages; the validity of the Wechsler intelligence batteries; the heritability of intelligence and Big Five traits; the basic Flynn effect; the reliability characteristics of well-designed inventories; the value of the multitrait-multimethod approach.

Findings substantially weakened: the strong-form Bell Curve claims about between-group genetic causes; the stereotype-threat literature in its strongest form (effects are real but smaller than originally reported); the IAT's predictive validity for actual discrimination; the polygraph as a lie-detection device; the MBTI as a serious measurement instrument.

The discipline's empirical core — that psychological constructs can be measured with care, that some constructs are well-supported and some are not, that the validity of any test depends on the use to which it is put — is robust. The applied claims have been more variable.

Psychometrics · Replication— xxii —
Reading ListXXIII

Chapter XXITwenty-five works.

Psychometrics · Reading List— xxiii —
Watch & ReadXXIV

Chapter XXIIWatch & read.

↑ Russell T. Warne · What is the Flynn Effect?

More on YouTube

Watch · Richard Haier · The Bell Curve controversy
Watch · 16 Personalities, the Big 5, and MBTI

Read

For an entry-level treatment: Earl Hunt's Human Intelligence (2011). For depth: Lord & Novick's Statistical Theories of Mental Test Scores (1968) — the classical-test-theory bible. For the political controversies: Stephen Jay Gould's The Mismeasure of Man (1981, revised 1996) — the major popular critique of IQ-testing practice; not without its own technical errors but the historical-critical case is still important. For practical test design: Test Theory by Suen, or de Ayala's Theory and Practice of Item Response Theory (2009). The Standards document (2014) is the working reference.

Psychometrics · Watch & Read— xxiv —
FrontiersXXV

Chapter XXIIIWhat's next.

Three frontiers shape the discipline in the late 2020s.

Machine-learning measurement

Predictive models trained on large naturalistic data (smartphone sensors, social-media language, voice recordings, video of facial expression) are increasingly competing with self-report inventories. The technical performance is reaching parity for some constructs. The privacy and consent implications are unsolved.

Network and dimensional psychopathology

The HiTOP framework (Robert Krueger and colleagues, 2017) reorganises psychopathology dimensionally rather than categorically, with empirical factor structure replacing DSM categorical boundaries. Network-analysis approaches (Borsboom and colleagues) treat disorders not as latent constructs but as networks of mutually-reinforcing symptoms. Both frameworks may eventually replace classical psychometric measurement of psychopathology.

Cross-cultural standardisation

The major personality and intelligence inventories were standardised on Western samples and then exported. The reverse process — building inventories from non-Western lexical and behavioural data — is slowly underway. The result will be a more globally calibrated measurement infrastructure and, possibly, a richer typology of constructs that the Western tradition has missed.

Psychometrics · Frontiers— xxv —
ColophonXXVI

The end of the deck.

Psychometrics — Volume XII, Deck 10 of The Deck Catalog. Set in Inter and Tiempos Text. Off-white #f6f6f4; navy ink with scientific orange and steel-blue accents. Mathematical notation in monospace.

Twenty-four leaves on the science of measuring psychological constructs. The framework is rigorous; the misuse is constant; the discipline carries its founder's eugenic legacy and works around it. Numbers are not neutral. Measurement is a moral act.

FINIS

↑ Vol. XII · Psy. · Deck 10 / 10

i / i Space · ↓ · ↑