|
| 1 | +# GPT-NeoXT-Chat-Base-20B |
| 2 | + |
| 3 | +OpenChatKit includes an instruction-tuned 20 billion parameter language model called GPT-NeoXT-Chat-Base-20B, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between [Together](https://www.together.xyz/), [LAION](https://laion.ai), and [Ontocord.ai](https://ontocord.ai). Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions. |
| 4 | + |
| 5 | +In this doc, you'll find steps for: |
| 6 | +- Training an OpenChatKit model |
| 7 | +- Testing inference using the model |
| 8 | +- Augmenting the model with additional context from a retrieval index |
| 9 | + |
| 10 | +# Contents |
| 11 | + |
| 12 | +- [Requirements](#requirements) |
| 13 | +- [Pre-trained Weights](#pre-trained-weights) |
| 14 | +- [Datasets](#datasets) |
| 15 | + * [Data Contributions](#data-contributions) |
| 16 | +- [Pretrained Base Model](#pretrained-base-model) |
| 17 | +- [Training and Finetuning](#training-and-finetuning) |
| 18 | + * [(Optional) 8bit Adam](#optional-8bit-adam) |
| 19 | + * [Train GPT-NeoX-Chat-Base-20B](#train-gpt-neox-chat-base-20b) |
| 20 | +- [Converting Weights to Huggingface Format](#converting-weights-to-huggingface-format) |
| 21 | +- [Inference](#inference) |
| 22 | +- [Monitoring](#monitoring) |
| 23 | + * [Loguru](#loguru) |
| 24 | + * [Weights & Biases](#weights--biases) |
| 25 | +- [Experimental: Retrieval-Augmented Models](#experimental-retrieval-augmented-models) |
| 26 | +- [Acknowledgements](#acknowledgements) |
| 27 | + |
| 28 | +# Requirements |
| 29 | + |
| 30 | +Before you begin, you need to install PyTorch and other dependencies. |
| 31 | + |
| 32 | +1. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) from their website. |
| 33 | + |
| 34 | +2. Install [Git LFS](https://git-lfs.com/) from their website. |
| 35 | + |
| 36 | +3. Install the `git lfs` hooks. |
| 37 | + |
| 38 | +```shell |
| 39 | +git lfs install |
| 40 | +``` |
| 41 | + |
| 42 | +4. Install mamba in the `base` environment so it's available in all environments. |
| 43 | + |
| 44 | +```shell |
| 45 | +conda install mamba -n base -c conda-forge |
| 46 | +``` |
| 47 | + |
| 48 | +5. Create an environment called OpenChatKit using the `environment.yml` file at the root of this repo. |
| 49 | + |
| 50 | +```shell |
| 51 | +mamba env create -f environment.yml |
| 52 | +``` |
| 53 | + |
| 54 | +6. Activate the new conda environment. |
| 55 | + |
| 56 | +```shell |
| 57 | +conda activate OpenChatKit |
| 58 | +``` |
| 59 | + |
| 60 | +# Pre-trained Weights |
| 61 | + |
| 62 | +GPT-NeoXT-Chat-Base-20B is a 20B-parameter variant of GPT-NeoX, fine-tuned on conversational datasets. We are releasing pre-trained weights for this model as [togethercomputer/GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface. |
| 63 | + |
| 64 | +More details can be found on the model card for [GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface. |
| 65 | + |
| 66 | +# Datasets |
| 67 | + |
| 68 | +The chat model was trained on the [OIG](https://huggingface.co/datasets/laion/OIG) dataset built by [LAION](https://laion.ai/), [Together](https://www.together.xyz/), and [Ontocord.ai](https://www.ontocord.ai/). To download the dataset from Huggingface run the command below from the root of the repo. |
| 69 | + |
| 70 | +```shell |
| 71 | +python data/OIG/prepare.py |
| 72 | +``` |
| 73 | + |
| 74 | +Once the command completes, the data will be in the `data/OIG/files` directory. |
| 75 | + |
| 76 | +## Data Contributions |
| 77 | + |
| 78 | +You can help make this chat model better by contributing data! See the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repo for more details. |
| 79 | + |
| 80 | +# Pretrained Base Model |
| 81 | + |
| 82 | +As mentioned above, the chat model is a fine-tuned variant of GPT-NeoX-20B from Eleuther AI. To download GPT-NeoX-20B and prepare it for fine tuning, run this command from the root of the repo. |
| 83 | + |
| 84 | +```shell |
| 85 | +python pretrained/GPT-NeoX-20B/prepare.py |
| 86 | +``` |
| 87 | + |
| 88 | +The weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b`. |
| 89 | + |
| 90 | +In case you want to fine-tune other gpt-neox models, e.g. [the Pythia model suite](https://huggingface.co/models?sort=downloads&search=pythia), you can specify the HF model name, for example: |
| 91 | + |
| 92 | +```shell |
| 93 | +python pretrained/GPT-NeoX-20B/prepare.py --model-name EleutherAI/pythia-6.9b-deduped |
| 94 | +``` |
| 95 | + |
| 96 | +And the weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_pythia-6.9b-deduped`. |
| 97 | + |
| 98 | + |
| 99 | +# Training and Finetuning |
| 100 | + |
| 101 | +## (Optional) 8bit Adam |
| 102 | + |
| 103 | +To use 8bit-adam during training, install the `bitsandbytes` package. |
| 104 | + |
| 105 | +```shell |
| 106 | +pip install bitsandbytes # optional, to use 8bit-adam |
| 107 | +``` |
| 108 | + |
| 109 | +## Train GPT-NeoX-Chat-Base-20B |
| 110 | + |
| 111 | +The `training/finetune_GPT-NeoXT-Chat-Base-20B.sh` script configures and runs the training loop. After downloading the dataset and the base model, run: |
| 112 | + |
| 113 | +```shell |
| 114 | +bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh |
| 115 | +``` |
| 116 | + |
| 117 | +The script launches 8 processes with a pipeline-parallel degree of 8 and a data-parallel degree of 1. |
| 118 | + |
| 119 | +As the training loop runs, checkpoints are saved to the `model_ckpts` directory at the root of the repo. |
| 120 | + |
| 121 | +Please see [the training README](training/README.md) for more details about customizing the training run. |
| 122 | + |
| 123 | +The `training/finetune_Pythia-Chat-Base-7B.sh` script is another example to fine-tune a 7B pythia (gpt-neox) model. The script launches 8 processes with a pipeline-parallel degree of 4 and a data-parallel degree of 2. |
| 124 | + |
| 125 | +# Converting Weights to Huggingface Format |
| 126 | + |
| 127 | +Before you can use this model to perform inference, it must be converted to the Huggingface format. Run this command from the root of the repo to do so. |
| 128 | + |
| 129 | +```shell |
| 130 | +mkdir huggingface_models \ |
| 131 | + && python tools/convert_to_hf_gptneox.py \ |
| 132 | + --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_100 \ |
| 133 | + --save-path huggingface_models/GPT-NeoXT-Chat-Base-20B \ |
| 134 | + --n-stages 8 \ |
| 135 | + --n-layer-per-stage 6 \ |
| 136 | + --fp16 |
| 137 | +``` |
| 138 | +where the `--fp16` flag will load and store models in fp16. |
| 139 | + |
| 140 | +Make sure to replace `model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_100` with the latest checkpoint in the `model_ckpts/GPT-Neo-XT-Chat-Base-20B` directory. |
| 141 | + |
| 142 | +If you need to convert ckpts of other gpt-neox variants, make sure to specify the correct config name for your variant. |
| 143 | +For example, if you want to convert a checkpoint fine-tuned from `EleutherAI/pythia-6.9b-deduped`, you should indicate this as a config name: |
| 144 | +```shell |
| 145 | +python tools/convert_to_hf_gptneox.py \ |
| 146 | + --config-name EleutherAI/pythia-6.9b-deduped \ |
| 147 | + --ckpt-path model_ckpts/Pythia-Chat-Base-7B/checkpoint_100 \ |
| 148 | + --save-path huggingface_models/Pythia-Chat-Base-7B \ |
| 149 | + --n-stages 4 \ |
| 150 | + --n-layer-per-stage 8 \ |
| 151 | + --fp16 |
| 152 | +``` |
| 153 | + |
| 154 | + |
| 155 | +# Inference |
| 156 | + |
| 157 | +To help you test the model, we provide a simple test command line test harness to interact with the bot. |
| 158 | + |
| 159 | +```shell |
| 160 | +python inference/bot.py |
| 161 | +``` |
| 162 | + |
| 163 | +By default the script will load the model named GPT-NeoXT-Chat-Base-20B model under the `huggingface_models` directory, but you can override that behavior by specifying `--model`. |
| 164 | + |
| 165 | +For example, if you want to load the base model from our Huggingface, repo, you can run the following command which downloads the weights from HuggingFace. |
| 166 | + |
| 167 | +```shell |
| 168 | +python inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B |
| 169 | +``` |
| 170 | + |
| 171 | +Once the model has loaded, enter text at the prompt and the model will reply. |
| 172 | + |
| 173 | +```shell |
| 174 | +$ python inference/bot.py |
| 175 | +Loading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1... |
| 176 | +Welcome to OpenChatKit shell. Type /help or /? to list commands. |
| 177 | + |
| 178 | +>>> Hello. |
| 179 | +Setting `pad_token_id` to `eos_token_id`:0 for open-end generation. |
| 180 | +Hello human. |
| 181 | + |
| 182 | +>>> |
| 183 | +``` |
| 184 | + |
| 185 | +Commands are prefixed with a `/`, and the `/quit` command exits. |
| 186 | + |
| 187 | +Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware. |
| 188 | + |
| 189 | +# Monitoring |
| 190 | + |
| 191 | +By default, the training script simply prints the loss as training proceeds, but it can also output metrics to a file using [loguru](https://github.com/Delgan/loguru) or report them to Weights & Biases. |
| 192 | + |
| 193 | +## Loguru |
| 194 | + |
| 195 | +Add the flag `--train-log-backend loguru` to your training script to log to `./logs/file_{time}.log` |
| 196 | + |
| 197 | +## Weights & Biases |
| 198 | + |
| 199 | +To use Weights & Biases, first login with your Weights & Biases token. |
| 200 | + |
| 201 | +```shell |
| 202 | +wandb login |
| 203 | +``` |
| 204 | + |
| 205 | +And set `--train-log-backend wandb` in the training script to enable logging to Weights & Biases. |
| 206 | + |
| 207 | +# Experimental: Retrieval-Augmented Models |
| 208 | + |
| 209 | +*Note: Retrieval is still experimental.* |
| 210 | + |
| 211 | +The code in `/retrieval` implements a python package for querying a Faiss index of Wikipedia. The following steps explain how to use this index to augment queries in the test harness with context from the retriever. |
| 212 | + |
| 213 | +1. Download the Wikipedia index. |
| 214 | + |
| 215 | +```shell |
| 216 | +python data/wikipedia-3sentence-level-retrieval-index/prepare.py |
| 217 | +``` |
| 218 | + |
| 219 | +2. Run the bot with the `--retrieval` flag. |
| 220 | + |
| 221 | +```shell |
| 222 | +python inference/bot.py --retrieval |
| 223 | +``` |
| 224 | + |
| 225 | +After starting, the bot will load both the chat model and the retrieval index, which takes a long time. Once the model and the index are loaded, all queries will be augmented with extra context. |
| 226 | + |
| 227 | + |
| 228 | +```shell |
| 229 | +$ python inference/bot.py --retrieval |
| 230 | +Loading /OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0... |
| 231 | +Loading retrieval index... |
| 232 | +Welcome to OpenChatKit shell. Type /help or /? to list commands. |
| 233 | + |
| 234 | +>>> Where is Zurich? |
| 235 | +Setting `pad_token_id` to `eos_token_id`:0 for open-end generation. |
| 236 | +Where is Zurich? |
| 237 | +Zurich is located in Switzerland. |
| 238 | + |
| 239 | +>>> |
| 240 | +``` |
| 241 | + |
| 242 | +# Acknowledgements |
| 243 | + |
| 244 | +Our model is a fine-tuned version of [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b), a large language model trained by [Eleuther AI](https://www.eleuther.ai). We evaluated our model on [HELM](https://crfm.stanford.edu/helm/latest/) provided by the [Center for Research on Foundation Models](https://crfm.stanford.edu). And we collaborated with both [CRFM](https://crfm.stanford.edu) and [HazyResearch](http://hazyresearch.stanford.edu) at Stanford to build this model. |
| 245 | + |
| 246 | +We collaborated with [LAION](https://laion.ai/) and [Ontocord.ai](https://www.ontocord.ai/) to build the training data used to fine tune this model. |
0 commit comments