GPT

Precursor

Proximal Policy Optimization (PPO) - an RL algorithm, PPO is better than state-of-the-art approaches while being much simpler to implement and tune and is the default reinforcement learning algorithm at OpenAI.
Learning from human preference (human in the loop) - a method used to infer what humans want by being told which of two proposed behaviors is better.
instructGPT - arguably better at following user intentions than GPT-3 while also making them more truthful and less toxic, using human in the loop.

what is chatGPT doing and why does it work? explaining next word prediction in detail.
Karpathy on building GPT
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study - "PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions."

Last updated 5 months ago