Comment on page
GPT
- 1.Proximal Policy Optimization (PPO) - an RL algorithm, PPO is better than state-of-the-art approaches while being much simpler to implement and tune and is the default reinforcement learning algorithm at OpenAI.
- 2.Learning from human preference (human in the loop) - a method used to infer what humans want by being told which of two proposed behaviors is better.
- 3.instructGPT - arguably better at following user intentions than GPT-3 while also making them more truthful and less toxic, using human in the loop.
- 1.
- 1.
- 1.Sentence Embeddings
Last modified 6mo ago