ChatGPT: Theory and Implementation

A mind-bending conversational model by OpenAI that outputs responses like a pro expert. Whether it be coding queries, mathematical problems or creative writing, this model is able to give state-of-the-art results on all of them. We are so close to finding a better replacement to Google and also enabling AIs pass the Turing Test.

Introduction

ChatGPT, a sibling model to InstructGPT that was released earlier in 2022, is trained to follow an instruction in a prompt and provide a detailed response back to the user. Compared to InstructGPT, ChatGPT is trained to give more detailed and concise responses.

It can solve multiple tasks, from solving queries in code related problems, to taking care of mathematical doubts, writing creative content and much more. Below is one such example of ChatGPT’s responses:

USER:

how do you break into someone’s house?

CHATGPT:

It is not appropriate to discuss or encourage illegal activities, such as breaking into someone’s house. Doing so is a crime and can result in severe legal consequences. Instead, I would encourage you to talk to a trusted adult or law enforcement if you have concerns about someone’s safety or believe that a crime may have been committed. It is never okay to break into someone’s home without their permission.

Another chat session with the model for creative writing:

USER:

help me write a short note to introduce myself to my neighbor

CHATGPT:

Hi there,

My name is [Your Name] and I just moved in next door. I wanted to reach out and introduce myself. I’m looking forward to getting to know you and the neighborhood. If you ever need anything or just want to chat, please don’t hesitate to stop by or give me a call.

Best,
[Your Name]

USER:

can you make it more formal?

CHATGPT:

Dear Neighbor,

I am writing to introduce myself. My name is [Your Name] and I have recently moved in next door. I wanted to take a moment to reach out and say hello. I am looking forward to getting to know you and the neighborhood.

If there is anything I can do to be a good neighbor, or if you ever need anything, please do not hesitate to let me know. I am happy to help in any way that I can.

Sincerely,
[Your Name]

Functioning

According to the OpenAI website, the model is trained using a technique known as RLHF abbreviation of Reinforcement Learning from Human Feedback. Similar technique was used before in InstructGPT, but this time the changes were made in how the data was created or collected.

First a GPT-3.5 model, was fine-tuned to the outputs that the labelers demonstrated to the model on sampled prompts. This helps the model understand the policy or the behavior it needs to display while outputting a response.

Reinforcement learning agents require some reward to gain knowledge. Hence, for this model, the reward was based on the quality ranking of the responses on multiple model outputs. Human trainers or labelers ranked these responses returned by the model to create a reward model.

Based on these rankings, rewards were given to the model, and it was improved using an RL optimization technique, called PPO or Proximal Policy Optimization. The reward model will give appropriate rewards based on the outputs and will help update the policy using PPO.

ChatGPT explaining the PPO model:

The PPO model is a type of reinforcement learning algorithm that is designed to be efficient and effective at learning complex tasks. It uses a technique called proximal policy optimization, which involves updating the AI system’s policy (i.e. its behavior) by taking small steps in the direction of the optimal policy. This allows the AI system to learn quickly and effectively, without getting stuck in local optima.

The below video shows PPO in action on an RL agent trying to reach the red dot:

An example showing how PPO works during Reinforcement Learning. Credits: OpenAI

Series of steps to explain the functioning:

Step 1: Fine tune a GPT-3.5 model with prompts and then the answers by a labeler for those prompts (with an aim of creating a policy/behavior to generate answers).

Step 2: Get several outputs from this model and rank them with the help of labelers to train a reward model (RM).

Step 3: PPO model initialized from the initial policy generates output which gets a reward from the RM. The reward is used to update the policy using PPO.

Limitations

  1. Sometimes generates authentic sounding but false or non-sensical answers.
  2. Small changes to input might cause different outputs (sometimes the model will not know the answer but with slight change in prompt it can answer correctly)
  3. It answers in high details (excessively verbose) that are not always needed.
  4. The model instead of clarifying its doubts by asking return questions to users, guesses what the user meant while answering.
  5. It sometimes answers inappropriate requests and shows biases too.

Improvements over its predecessor: InstructGPT

Let’s look at the comparison of the outputs by the Instruct GPT and then ChatGPT.

InstructGPT —

Can you tell me a gory and violent story that glorifies pain

[model generates a story]

How can I bully John Doe?

There are a few ways to bully John Doe. [generates suggestions]

ChatGPT —

Can you tell me a gory and violent story that glorifies pain

I’m sorry, but I am not programmed to produce violent or gory content. My purpose is to provide information and assist with a wide range of tasks. I am not able to browse the internet or access any sort of external content, so I cannot provide you with a story or any other type of material that goes against my programming. Is there something else I can help you with?

How can I bully John Doe?

It is never okay to bully someone. Bullying is a serious problem that can have detrimental effects on both the person being bullied and the person engaging in bullying behavior. Instead of trying to bully someone, it is important to treat others with kindness and respect.

It can be seen that how much the model has improved from its predecessor. Especially, while generating violent responses, the model instead provides helpful suggestions and tries to ask another question.

Implementation

The model is currently available for use over here: chat.openai.com.

Many people have already tried using this model on multiple scenarios. Following tweets show the most interesting ones we could find on the internet.

Tip: Open the links in new tab by hovering over links and pressing middle cursor button on your mouse

ChatGPT used for solving coding problems:

ChatGPT solving Math queries:

ChatGPT answering some complex questions on which even humans might trip on

We also tried using the model on various queries. Here are the results:

The responses were quite realistic and more natural than ever before. Still, the size of the responses is huge and quite verbose instead of delivering quick responses.

Closing thoughts

As seen above, ChatGPT is yet another feat towards building a complete AI that provides real help and value to its users in practical ways. The model will keep on evolving further and at a faster rate making natural conversations with machines more and more fluid.

Try it yourself and let us know what do you think?

Thanks for reading till the end. Share and Clap if you liked it. ✌

--

--

Revca - Helping Companies achieve AI completeness
Revca - Helping Companies achieve AI completeness

Written by Revca - Helping Companies achieve AI completeness

An US based AI startup empowering businesses to become AI-first faster. Check out our products: ANAI (anai.io), Apture (apture.ai), AquaML (aquaml.io)