Computing Machinery and Intelligence
Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.
Research papers, preprints, and technical reports on alignment, interpretability, and safety.
Browse this category in the interactive library →
Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.
Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.
The Puerto Rico letter unified the AI research community around the goal of building systems that are robust and beneficial, not merely capable.
Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.
PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.
Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.
Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.
Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.
Irving et al. proposed having AI systems adversarially debate each other to help human judges evaluate answers on questions too complex for direct human assessment.
Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.
Bostrom argues that some technologies are civilizational black balls, requiring unprecedented global governance to prevent collapse, with AI as a leading candidate.
GPT-2 showed that scaling data and parameters unlocks broad capabilities without task-specific supervision, raising early alarm about dual-use and misuse potential.
De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.
This proposal for sharing extreme AI profits aims to reduce competitive race dynamics and broaden societal benefit, addressing the governance gap around transformative AI wealth.
GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.
MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.
Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.
OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.
Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.
DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.
The Pile revealed how training corpus composition strongly shapes downstream capability and failure modes, making data curation a first-class safety concern.
TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.
Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.
Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.
Power et al. discovered delayed phase transitions where generalization appears suddenly after long memorization, suggesting dangerous capabilities could emerge without warning during training.
Chinchilla reframed scaling laws by showing optimal performance requires balancing model size and training tokens, redirecting how labs plan capability and safety investment.
Sparrow pioneered rule-constrained dialogue alignment with human feedback and targeted safety interventions, testing whether explicit behavioral rules can scale.
Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.
Systematic mapping of the AI alignment research landscape, identifying clusters, gaps, and trends that help prioritize future safety work.
Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.
Anthropic demonstrated that rule-guided AI self-critique can reduce harmful outputs with far less dependence on expensive human labeling.
Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.
Carlsmith builds a step-by-step argument for why sufficiently capable AI systems may converge on power-seeking behavior, making the x-risk case rigorous and actionable.
This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.
Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.
Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.
Schaeffer et al. argued apparent emergence can be a measurement artifact rather than a true phase change, complicating how we forecast dangerous capability thresholds.
DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.
Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.
This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.
Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.
Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.
Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.
Drexler challenges monolithic AGI assumptions and proposes that advanced AI could emerge as an ecosystem of specialized services, changing the risk landscape and governance strategies.