AI safety for technical people

The technical route into AI safety, for engineers and ML practitioners alike: the failure modes that motivate the field, the training techniques behind today's safety pipelines, and the hands-on work—interpretability, red teaming, evals—happening now.

  1. [1hr Talk] Intro to Large Language ModelsYouTube Intermediate

    Karpathy's one-hour grounding in how the systems you'll be studying actually work.

  2. Concrete problems in AI safetyAcademic Papers Advanced~45 min read

    The agenda that made safety a concrete engineering problem—and the failure modes that still frame it.

  3. Unsolved Problems in ML SafetyAcademic Papers Intermediate

    The updated research agenda: robustness, monitoring, alignment, and systemic safety.

  4. Scaling Laws for Neural Language ModelsAcademic Papers Advanced

    Why capabilities keep improving predictably—the trend line safety has to reckon with.

  5. Risks from Learned OptimizationAcademic Papers Advanced~70 min read

    Mesa-optimization and deceptive alignment, the core inner-alignment worry.

  6. Goal MisgeneralizationAcademic Papers Advanced

    How a capable model can pursue the wrong goal even with a correct training signal.

  7. Deep Reinforcement Learning from Human PreferencesAcademic Papers Advanced

    The preference-learning method RLHF is built on.

  8. Training a Helpful and Harmless Assistant with RLHFAcademic Papers Advanced~2 hr read

    The engineering of an RLHF safety pipeline, end to end.

  9. Direct Preference Optimization (DPO)Academic Papers Advanced

    The simpler alternative to RLHF that reframes what preference training is doing.

  10. Constitutional AI: Harmlessness from AI FeedbackAcademic Papers Advanced

    A current, deployed approach to scalable oversight.

  11. Weak-to-Strong GeneralizationAcademic Papers Advanced

    The core question of superalignment: can weaker supervisors align stronger models?

  12. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingAcademic Papers Advanced

    Empirical evidence that deceptive behavior can survive standard safety training.

  13. Red Teaming Language Models to Reduce HarmsAcademic Papers Advanced

    A repeatable methodology for finding model failures.

  14. JailbrokenAcademic Papers Advanced

    Why safety training fails: the two failure modes behind most jailbreaks.

  15. Universal Adversarial AttacksAcademic Papers Advanced

    Automatically generated attack suffixes that transfer across models.

  16. TruthfulQAAcademic Papers Advanced

    A benchmark that shows measuring truthfulness is harder than it looks.

  17. Discovering Latent Knowledge in Language Models Without SupervisionAcademic Papers Advanced

    An interpretability method aimed at detecting what a model 'believes'.

  18. Transformer CircuitsWebsites Advanced

    The running research thread reverse-engineering what transformers compute.

  19. ARENA (Alignment Research Engineer Accelerator)Courses Advanced

    Hands-on engineering curriculum—implement the methods instead of just reading about them.

  20. LessWrongWebsites Intermediate

    Where much of the technical alignment discussion happens in the open.

See all learning paths →