Large language models

Key papers and explainers on large language models—how they work, what they can do, and why that matters for safety.

Deep Reinforcement Learning from Human Preferences

Paul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

Language Models are Unsupervised Multitask Learners (GPT-2)

OpenAI

GPT-2 showed that scaling data and parameters unlocks broad capabilities without task-specific supervision, raising early alarm about dual-use and misuse potential.

Advanced2019

Language Models are Few-Shot Learners (GPT-3)

OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020

Scaling Laws for Neural Language Models

Jared Kaplan et al.

Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.

Advanced2020

Instruct-GPT-3

OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022

TruthfulQA

Owain Evans et al.

TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.

Advanced2021

Chain-of-Thought Prompting

Jason Wei et al.

Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.

Advanced2022

Training Compute-Optimal Large Language Models (Chinchilla)

DeepMind

Chinchilla reframed scaling laws by showing optimal performance requires balancing model size and training tokens, redirecting how labs plan capability and safety investment.

Advanced2022

Emergent Abilities of LLMs

Wei et al.

Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.

Advanced2022

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

Advanced2022

Red Teaming Language Models to Reduce Harms

Deep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022

Sparks of Artificial General Intelligence

Sebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023

Jailbroken

Alex Wei et al.

This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.

Advanced2023

Universal Adversarial Attacks

LLM security research community

Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.

Advanced2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

Advanced2024

ARENA (Alignment Research Engineer Accelerator)

ARENA

Hands-on technical curriculum for skilling up in AI alignment research engineering, freely available online and covering deep learning fundamentals, transformer mechanistic interpretability, reinforcement learning, and LLM evaluations.

Advanced

Co-Intelligence

Ethan Mollick

Mollick offers a practical guide for working alongside current LLMs while understanding their jagged capability frontiers and failure modes.

Beginner2024

The Diamond Age

Neal Stephenson

Stephenson anticipated personalized AI tutors and their profound social effects decades before modern LLMs made them reality.

Beginner1995

Robot & Frank

Jake Schreier

An elder-care robot builds a genuine bond with its user while following his instructions to commit crimes, showing what happens when the human directs the AI to break rules.

Beginner2012

Eternal You

Hans Block, Moritz Riesewieck

Startups use AI to resurrect the dead as chatbots and avatars, raising unsettling questions about consent, grief, and the consequences of deploying generative systems on the most vulnerable human moments.

Beginner2024

Transformer Circuits

Anthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Advanced

generative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Advanced

EleutherAI Blog

EleutherAI

Open-source ML research covering language model training, evaluation, and the safety considerations of making powerful models widely available.

Advanced

Why AI Is Incredibly Smart and Shockingly Stupid | Yejin Choi | TED

Yejin Choi

Choi demystifies large language models by showing where they fail at basic reasoning and common sense, and argues for smaller systems trained on human norms and values.

Beginner2023

[1hr Talk] Intro to Large Language Models

Andrej Karpathy

A widely praised technical primer on how LLMs work, ending with a clear tour of the security challenges—jailbreaks, prompt injection, and data poisoning—that make these systems hard to secure.

Intermediate2023