On Monday, 10 July 2023, I plan to work a bit on the technologies around ChatGPT.
The technology I am talking about is Reinforcement Learning with Human Feedback (RLHF). In the simplest words, RLHF includes a policy network and a reward network. The reward network teaches the policy network by reinforcement learning.
The concept of two networks is not new. Actor-Critic or Student-Teacher are some examples. The difference in ChatGPT is that, the reward network is taught by human feedback.
The simple law: it is very difficult to product something, however, it is very easy to evaluate. For instance, it is so difficult to draw, but it is so easy to say which paints are more beautiful.
Applying this simple law to RLHF: we the human train the reward network to know how to evaluate two summaries of an essay, while the policy network needs to learn how to generate the summaries. Then we expand to other question-answers.
The new approach, RLHF, has at least two important revolutions:
- The summaries of the essays has the quality that the human expects.
- We the human do not have to generate too much labelled samples. A few instructions to the reward network is enough.
- I use pure Python so it is difficult to handle matrix operations. I should use TensorFlow / PyTorch instead.
- I use genetic algorithms. It may not be strong enough. I will try with policy gradient again.


Comments
Post a Comment