CU + ARL Seminar Series on Human-Guided Machine Learning

Mar 04 2024


About Pavel Izmailov, New York University

Pavel is currently working as a Research Scientist in the OpenAI working on reasoning in language models. He has previously worked on the superalignment team under Jeff Wu, Jan Leike and Ilya Sutskever. Starting in Fall 2024, he will be joining NYU as an Assistant Professor in the Tandon CSE department, and Courant CS department by courtesy. He is also a member of the NYU CILVR Group. Pavel is broadly excited about reasoning, out-of-distribution generalization, interpretability and probabilistic deep learning. Pavel defended his PhD in Computer Science at NYU in 2023, supervised by Andrew Gordon Wilson, after transferring from Cornell University where he studied Operations Research and Information Engineering (2017–2019) and earned an MSc. He holds a BSc in applied math and computer science from Lomonosov Moscow State University, working with the Bayesian Methods Research Group under the supervision of Dmitry Vetrov and Dmitry Kropotov. Pavel's research internships include Amazon AWS (2019) with Bernie Wang and Alex Smola, Google AI (2020) with Matt Hoffman, and at Google with Alex Alemi and Ben Poole (June 2021–February 2022), as well as with Lucas Beyer and Simon Kornblith at Google Brain in the summer of 2022.

Seminar Abstract

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.



The Human-Guided Machine Learning seminar series is part of the Columbia AI Initiative, particularly within the Columbia Program for Human-Guided Machine Adaptation led by the Laboratory for Intelligent Imaging and Neural Computing (LIINC), in collaboration with the U.S. Army Research Laboratory. This research program broadly aims to merge human and machine intelligence effectively. The seminar series is meant to bring together researchers, academics, and professionals in the field of machine learning to explore state-of-the-art advancements and challenges in incorporating human guidance into the learning process of machines. Through talks given by researchers in AI/ML/Robotics, human cognition, social interaction, cognitive neuroscience, and decision making, we seek to understand principles that facilitate mutual adaptation between humans and intelligent systems.

Click Here to Register for March 4

Stay up-to-date with the Columbia Engineering newsletter

* indicates required