Academic Papers

Research papers, preprints, and technical reports on alignment, interpretability, and safety.

Browse this category in the interactive library →

Computing Machinery and Intelligence

Alan Turing

Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.

Intermediate1950

The Coming Technological Singularity

Vernor Vinge

Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.

Intermediate~25 min read1993

Research priorities for robust and beneficial AI

Stuart Russell, Daniel Dewey, Max Tegmark

The Puerto Rico letter unified the AI research community around the goal of building systems that are robust and beneficial, not merely capable.

Intermediate~15 min read2015

Concrete problems in AI safety

Dario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016

Proximal Policy Optimization (PPO)

Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017

Deep Reinforcement Learning from Human Preferences

Paul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

The Lottery Ticket Hypothesis

Jonathan Frankle, Michael Carbin

Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.

Advanced2018

Backdoor Attacks

Gu et al.

Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.

Advanced2017

AI Safety via Debate

Geoffrey Irving et al.

Irving et al. proposed having AI systems adversarially debate each other to help human judges evaluate answers on questions too complex for direct human assessment.

Advanced2018

Risks from Learned Optimization

Evan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019

The Vulnerable World Hypothesis

Nick Bostrom

Bostrom argues that some technologies are civilizational black balls, requiring unprecedented global governance to prevent collapse, with AI as a leading candidate.

Intermediate2019

Language Models are Unsupervised Multitask Learners (GPT-2)

OpenAI

GPT-2 showed that scaling data and parameters unlocks broad capabilities without task-specific supervision, raising early alarm about dual-use and misuse potential.

Advanced2019

Causal Confusion in Imitation Learning

Pim de Haan et al.

De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.

Advanced2019

The Windfall Clause

OpenAI, FHI

This proposal for sharing extreme AI profits aims to reduce competitive race dynamics and broaden societal benefit, addressing the governance gap around transformative AI wealth.

Intermediate2020

Language Models are Few-Shot Learners (GPT-3)

OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020

MMLU Benchmark

Dan Hendrycks et al.

MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.

Advanced2020

Scaling Laws for Neural Language Models

Jared Kaplan et al.

Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.

Advanced2020

Instruct-GPT-3

OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022

Training a Helpful and Harmless Assistant with RLHF

Anthropic

Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.

Advanced~2 hr read2022

GopherCite

DeepMind

DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.

Advanced~70 min read2022

The Pile

EleutherAI

The Pile revealed how training corpus composition strongly shapes downstream capability and failure modes, making data curation a first-class safety concern.

Advanced2021

TruthfulQA

Owain Evans et al.

TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.

Advanced2021

Unsolved Problems in ML Safety

Dan Hendrycks et al.

Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.

Intermediate2021

Chain-of-Thought Prompting

Jason Wei et al.

Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.

Advanced2022

Grokking

Power et al.

Power et al. discovered delayed phase transitions where generalization appears suddenly after long memorization, suggesting dangerous capabilities could emerge without warning during training.

Advanced2022

Training Compute-Optimal Large Language Models (Chinchilla)

DeepMind

Chinchilla reframed scaling laws by showing optimal performance requires balancing model size and training tokens, redirecting how labs plan capability and safety investment.

Advanced2022

Improving Alignment of Dialogue Agents (Sparrow)

DeepMind

Sparrow pioneered rule-constrained dialogue alignment with human feedback and targeted safety interventions, testing whether explicit behavioral rules can scale.

Advanced2022

Emergent Abilities of LLMs

Wei et al.

Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.

Advanced2022

Researching Alignment Research: Unsupervised Analysis

Kirchner et al.

Systematic mapping of the AI alignment research landscape, identifying clusters, gaps, and trends that help prioritize future safety work.

Advanced2022

Goal Misgeneralization

Rohin Shah et al.

Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.

Advanced2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al.

Anthropic demonstrated that rule-guided AI self-critique can reduce harmful outputs with far less dependence on expensive human labeling.

Advanced2022

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

Advanced2022

Is Power-Seeking AI an Existential Risk?

Joe Carlsmith

Carlsmith builds a step-by-step argument for why sufficiently capable AI systems may converge on power-seeking behavior, making the x-risk case rigorous and actionable.

Intermediate2022

Model Organisms of Misalignment

Evan Hubinger et al.

This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.

Intermediate2023

Red Teaming Language Models to Reduce Harms

Deep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022

Sparks of Artificial General Intelligence

Sebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023

Are Emergent Abilities a Mirage?

Schaeffer et al.

Schaeffer et al. argued apparent emergence can be a measurement artifact rather than a true phase change, complicating how we forecast dangerous capability thresholds.

Advanced2023

Direct Preference Optimization (DPO)

Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023

Let's Verify Step by Step

OpenAI

Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.

Advanced2023

Jailbroken

Alex Wei et al.

This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.

Advanced2023

Universal Adversarial Attacks

LLM security research community

Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.

Advanced2023

Weak-to-Strong Generalization

Collin Burns et al.

Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.

Advanced2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

Advanced2024

Reframing Superintelligence

Eric Drexler

Drexler challenges monolithic AGI assumptions and proposes that advanced AI could emerge as an ecosystem of specialized services, changing the risk landscape and governance strategies.

Intermediate~6.5 hr read2019