    ​Proximal Policy Optimization (PPO) - an RL algorithm, PPO is better than state-of-the-art approaches while being much simpler to implement and tune and is the default reinforcement learning algorithm at OpenAI.
    ​Learning from human preference (human in the loop) - a method used to infer what humans want by being told which of two proposed behaviors is better.
    ​instructGPT - arguably better at following user intentions than GPT-3 while also making them more truthful and less toxic, using human in the loop.


    ​what is chatGPT doing and why does it work? explaining next word prediction in detail.



    Sentence Embeddings