1. 1.
    ​Proximal Policy Optimization (PPO) - an RL algorithm, PPO is better than state-of-the-art approaches while being much simpler to implement and tune and is the default reinforcement learning algorithm at OpenAI.
  2. 2.
    ​Learning from human preference (human in the loop) - a method used to infer what humans want by being told which of two proposed behaviors is better.
  3. 3.
    ​instructGPT - arguably better at following user intentions than GPT-3 while also making them more truthful and less toxic, using human in the loop.


  1. 1.
    ​what is chatGPT doing and why does it work? explaining next word prediction in detail.



  1. 1.
    Sentence Embeddings