Skip to content

Commit 5180a70

Browse files
authored
Merge pull request #78 from csris/csris/pythia-support
Add Pythia Support
2 parents a71963d + 1ed1d6e commit 5180a70

5 files changed

Lines changed: 136 additions & 69 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ dmypy.json
133133
/data/OIG/files/
134134
/data/wikipedia-3sentence-level-retrieval-index/files/
135135
/pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b/
136+
/pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/
136137

137138
# ignore training output
138139
/model_ckpts/

README.md

Lines changed: 77 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenChatKit
22

3-
OpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between [Together](https://www.together.xyz/), [LAION](https://laion.ai), and [Ontocord.ai](https://ontocord.ai). Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions.
3+
OpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned language models, a moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. OpenChatKit models were trained on the OIG-43M training dataset, which was a collaboration between [Together](https://www.together.xyz/), [LAION](https://laion.ai), and [Ontocord.ai](https://ontocord.ai).
44

55
In this repo, you'll find code for:
66
- Training an OpenChatKit model
@@ -9,16 +9,15 @@ In this repo, you'll find code for:
99

1010
# Contents
1111

12-
- [Requirements](#requirements)
13-
- [Pre-trained Weights](#pre-trained-weights)
14-
- [Datasets](#datasets)
15-
* [Data Contributions](#data-contributions)
16-
- [Pretrained Base Model](#pretrained-base-model)
17-
- [Training and Finetuning](#training-and-finetuning)
12+
- [Getting Started](#getting-started)
13+
* [Requirements](#requirements)
14+
* [Chatting with Pythia-Chat-Base-7B](#chatting-with-pythia-chat-base-7b)
15+
- [Reproducing Pythia-Chat-Base-7B](#reproducing-pythia-chat-base-7b)
16+
* [Downloading training data and the base model](#downloading-training-data-and-the-base-model)
1817
* [(Optional) 8bit Adam](#optional-8bit-adam)
19-
* [Train GPT-NeoX-Chat-Base-20B](#train-gpt-neox-chat-base-20b)
20-
- [Converting Weights to Huggingface Format](#converting-weights-to-huggingface-format)
21-
- [Inference](#inference)
18+
* [Training the model](#training-the-model)
19+
* [Converting weights to Huggingface format](#converting-weights-to-huggingface-format)
20+
* [Testing the new model](#testing-the-new-model)
2221
- [Monitoring](#monitoring)
2322
* [Loguru](#loguru)
2423
* [Weights & Biases](#weights--biases)
@@ -27,7 +26,15 @@ In this repo, you'll find code for:
2726
- [Citing OpenChatKit](#citing-openchatkit)
2827
- [Acknowledgements](#acknowledgements)
2928

30-
# Requirements
29+
# Getting Started
30+
31+
In this tutorial, you will download Pythia-Chat-Base-7B, an instruction-tuned language model, and run some some inference requests against it using a command-line tool.
32+
33+
Pythia-Chat-Base-7B is a 7B-parameter fine-tuned variant of Pythia-6.9B-deduped from Eleuther AI. Pre-trained weights for this model are available on Huggingface as [togethercomputer/Pythia-Chat-Base-7B](https://huggingface.co/togethercomputer/Pythia-Chat-Base-7B) under an Apache 2.0 license.
34+
35+
More details can be found on the model card for [Pythia-Chat-Base-7B](https://huggingface.co/togethercomputer/Pythia-Chat-Base-7B) on Huggingface.
36+
37+
## Requirements
3138

3239
Before you begin, you need to install PyTorch and other dependencies.
3340

@@ -49,6 +56,9 @@ conda install mamba -n base -c conda-forge
4956

5057
5. Create an environment called OpenChatKit using the `environment.yml` file at the root of this repo.
5158

59+
> **Note**
60+
> Use `mamba` to create the environment. It's **much** faster than using `conda`.
61+
5262
```shell
5363
mamba env create -f environment.yml
5464
```
@@ -59,46 +69,62 @@ mamba env create -f environment.yml
5969
conda activate OpenChatKit
6070
```
6171

62-
# Pre-trained Weights
72+
## Chatting with Pythia-Chat-Base-7B
6373

64-
GPT-NeoXT-Chat-Base-20B is a 20B-parameter variant of GPT-NeoX, fine-tuned on conversational datasets. We are releasing pre-trained weights for this model as [togethercomputer/GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface.
74+
To help you try the model, [`inference/bot.py`](inference/bot.py) is a simple command-line test harness that provides a shell inferface enabling you to chat with the model. Simply enter text at the prompt and the model replies. The test harness also maintains conversation history to provide the model with context.
6575

66-
More details can be found on the model card for [GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface.
6776

68-
# Datasets
77+
Start the bot by calling `bot.py` from the root for the repo.
6978

70-
The chat model was trained on the [OIG](https://huggingface.co/datasets/laion/OIG) dataset built by [LAION](https://laion.ai/), [Together](https://www.together.xyz/), and [Ontocord.ai](https://www.ontocord.ai/). To download the dataset from Huggingface run the command below from the root of the repo.
79+
```shell
80+
python inference/bot.py --model togethercomputer/Pythia-Chat-Base-7B
81+
```
82+
83+
Loading the model can take some time, but once it's loaded, you are greeted with a prompt. Say hello.
7184

7285
```shell
73-
python data/OIG/prepare.py
86+
$ python inference/bot.py
87+
Loading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1...
88+
Welcome to OpenChatKit shell. Type /help or /? to list commands.
89+
90+
>>> Hello.
91+
Hello human.
92+
93+
>>>
7494
```
7595

76-
Once the command completes, the data will be in the `data/OIG/files` directory.
96+
Enter additional queries at the prompt, and the model replies. Under the covers, the shell is forming a prompt with all previous queries and passes that to the model to generate more text.
7797

78-
## Data Contributions
98+
The shell also supports additional commands to inspect hyperparamters, the full prompt, and more. Commands are prefixed with a `/`.
7999

80-
You can help make this chat model better by contributing data! See the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repo for more details.
100+
> **Note**
101+
> The `/quit` command exits the shell.
81102
82-
# Pretrained Base Model
103+
Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.
83104

84-
As mentioned above, the chat model is a fine-tuned variant of GPT-NeoX-20B from Eleuther AI. To download GPT-NeoX-20B and prepare it for fine tuning, run this command from the root of the repo.
105+
# Reproducing Pythia-Chat-Base-7B
85106

86-
```shell
87-
python pretrained/GPT-NeoX-20B/prepare.py
88-
```
107+
This tutorial walks through reproducing the Pythia-Chat-Base-7B model by fine-tuning Eleuther AI's Pythia-6.9B-deduped model using the OIG dataset.
89108

90-
The weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b`.
109+
## Downloading training data and the base model
91110

92-
In case you want to fine-tune other gpt-neox models, e.g. [the Pythia model suite](https://huggingface.co/models?sort=downloads&search=pythia), you can specify the HF model name, for example:
111+
The chat model was trained on the [OIG](https://huggingface.co/datasets/laion/OIG) dataset built by [LAION](https://laion.ai/), [Together](https://www.together.xyz/), and [Ontocord.ai](https://www.ontocord.ai/). To download the dataset from Huggingface run the command below from the root of the repo.
93112

94113
```shell
95-
python pretrained/GPT-NeoX-20B/prepare.py --model-name EleutherAI/pythia-6.9b-deduped
114+
python data/OIG/prepare.py
96115
```
116+
> **Note**
117+
> You can help make this chat model better by contributing data! See the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repo for more details.
118+
119+
Once the command completes, the data will be in the `data/OIG/files` directory.
97120

98-
And the weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_pythia-6.9b-deduped`.
121+
Pythia-Chat-Base-7B is a fine-tuned variant of Pythia-6.9B-deduped from Eleuther AI. To download the model and prepare it for fine tuning, run this command from the root of the repo.
99122

123+
```shell
124+
python pretrained/Pythia-6.9B-deduped/prepare.py
125+
```
100126

101-
# Training and Finetuning
127+
The weights for this model will be in the `pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped` directory.
102128

103129
## (Optional) 8bit Adam
104130

@@ -108,66 +134,48 @@ To use 8bit-adam during training, install the `bitsandbytes` package.
108134
pip install bitsandbytes # optional, to use 8bit-adam
109135
```
110136

111-
## Train GPT-NeoX-Chat-Base-20B
137+
## Training the model
112138

113-
The `training/finetune_GPT-NeoXT-Chat-Base-20B.sh` script configures and runs the training loop. After downloading the dataset and the base model, run:
139+
The `training/finetune_Pythia-Chat-Base-7B.sh` script configures and runs the training loop. After downloading the dataset and the base model, run:
114140

115141
```shell
116-
bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
142+
bash training/finetune_Pythia-Chat-Base-7B.sh
117143
```
118144

119-
The script launches 8 processes with a pipeline-parallel degree of 8 and a data-parallel degree of 1.
120-
121145
As the training loop runs, checkpoints are saved to the `model_ckpts` directory at the root of the repo.
122146

123147
Please see [the training README](training/README.md) for more details about customizing the training run.
124148

125-
The `training/finetune_Pythia-Chat-Base-7B.sh` script is another example to fine-tune a 7B pythia (gpt-neox) model. The script launches 8 processes with a pipeline-parallel degree of 4 and a data-parallel degree of 2.
126-
127-
# Converting Weights to Huggingface Format
149+
## Converting weights to Huggingface format
128150

129151
Before you can use this model to perform inference, it must be converted to the Huggingface format. Run this command from the root of the repo to do so.
130152

131153
```shell
132154
mkdir huggingface_models \
133155
&& python tools/convert_to_hf_gptneox.py \
134-
--ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_100 \
135-
--save-path huggingface_models/GPT-NeoXT-Chat-Base-20B \
136-
--n-stages 8 \
137-
--n-layer-per-stage 6 \
156+
--config-name EleutherAI/pythia-6.9b-deduped \
157+
--ckpt-path model_ckpts/Pythia-Chat-Base-7B/checkpoint_100 \
158+
--save-path huggingface_models/Pythia-Chat-Base-7B \
159+
--n-stages 4 \
160+
--n-layer-per-stage 8 \
138161
--fp16
139162
```
140163
where the `--fp16` flag will load and store models in fp16.
141164

142-
Make sure to replace `model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_100` with the latest checkpoint in the `model_ckpts/GPT-Neo-XT-Chat-Base-20B` directory.
165+
Make sure to replace `model_ckpts/Pythia-Chat-Base-7B/checkpoint_100` with the latest checkpoint in the `model_ckpts/Pythia-Chat-Base-7B` directory.
143166

144-
If you need to convert ckpts of other gpt-neox variants, make sure to specify the correct config name for your variant.
145-
For example, if you want to convert a checkpoint fine-tuned from `EleutherAI/pythia-6.9b-deduped`, you should indicate this as a config name:
146-
```shell
147-
python tools/convert_to_hf_gptneox.py \
148-
--config-name EleutherAI/pythia-6.9b-deduped \
149-
--ckpt-path model_ckpts/Pythia-Chat-Base-7B/checkpoint_100 \
150-
--save-path huggingface_models/Pythia-Chat-Base-7B \
151-
--n-stages 4 \
152-
--n-layer-per-stage 8 \
153-
--fp16
154-
```
167+
## Testing the new model
155168

156-
157-
# Inference
158-
159-
To help you test the model, we provide a simple test command line test harness to interact with the bot.
169+
You can use the OpenChatKit Shell test harness to chat with the new model. From the root of the repo, run
160170

161171
```shell
162172
python inference/bot.py
163173
```
164174

165-
By default the script will load the model named GPT-NeoXT-Chat-Base-20B model under the `huggingface_models` directory, but you can override that behavior by specifying `--model`.
166-
167-
For example, if you want to load the base model from our Huggingface, repo, you can run the following command which downloads the weights from HuggingFace.
175+
By default the script will load the model named Pythia-Chat-Base-7B under the `huggingface_models` directory, but you can override that behavior by specifying `--model`.
168176

169177
```shell
170-
python inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B
178+
python inference/bot.py --model ./huggingface_models/GPT-NeoXT-Chat-Base-20B
171179
```
172180

173181
Once the model has loaded, enter text at the prompt and the model will reply.
@@ -178,13 +186,15 @@ Loading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../hug
178186
Welcome to OpenChatKit shell. Type /help or /? to list commands.
179187

180188
>>> Hello.
181-
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
182189
Hello human.
183190

184191
>>>
185192
```
186193

187-
Commands are prefixed with a `/`, and the `/quit` command exits.
194+
The shell also supports additional commands to inspect hyperparamters, the full prompt, and more. Commands are prefixed with a `/`.
195+
196+
> **Note**
197+
> The `/quit` command exits the shell.
188198
189199
Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.
190200

@@ -208,7 +218,8 @@ And set `--train-log-backend wandb` in the training script to enable logging to
208218

209219
# Experimental: Retrieval-Augmented Models
210220

211-
*Note: Retrieval is still experimental.*
221+
> **Warning**
222+
> Retrieval support is experimental.
212223
213224
The code in `/retrieval` implements a python package for querying a Faiss index of Wikipedia. The following steps explain how to use this index to augment queries in the test harness with context from the retriever.
214225

@@ -234,7 +245,6 @@ Loading retrieval index...
234245
Welcome to OpenChatKit shell. Type /help or /? to list commands.
235246

236247
>>> Where is Zurich?
237-
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
238248
Where is Zurich?
239249
Zurich is located in Switzerland.
240250

@@ -281,6 +291,6 @@ For full terms, see the LICENSE file. If you have any questions, comments, or co
281291

282292
# Acknowledgements
283293

284-
Our model is a fine-tuned version of [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b), a large language model trained by [Eleuther AI](https://www.eleuther.ai). We evaluated our model on [HELM](https://crfm.stanford.edu/helm/latest/) provided by the [Center for Research on Foundation Models](https://crfm.stanford.edu). And we collaborated with both [CRFM](https://crfm.stanford.edu) and [HazyResearch](http://hazyresearch.stanford.edu) at Stanford to build this model.
294+
Our models are fine-tuned versions of large language models trained by [Eleuther AI](https://www.eleuther.ai). We evaluated our model on [HELM](https://crfm.stanford.edu/helm/latest/) provided by the [Center for Research on Foundation Models](https://crfm.stanford.edu). And we collaborated with both [CRFM](https://crfm.stanford.edu) and [HazyResearch](http://hazyresearch.stanford.edu) at Stanford to build this model.
285295

286296
We collaborated with [LAION](https://laion.ai/) and [Ontocord.ai](https://www.ontocord.ai/) to build the training data used to fine tune this model.

inference/bot.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ def main():
202202
)
203203
parser.add_argument(
204204
'--model',
205-
default=f"{INFERENCE_DIR}/../huggingface_models/GPT-NeoXT-Chat-Base-20B",
205+
default=f"{INFERENCE_DIR}/../huggingface_models/Pythia-Chat-Base-7B",
206206
help='name/path of the model'
207207
)
208208
parser.add_argument(
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
import os
2+
import argparse
3+
import torch
4+
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
5+
6+
DIR = os.path.dirname(os.path.abspath(__file__))
7+
8+
9+
if __name__ == '__main__':
10+
parser = argparse.ArgumentParser(description='Convert HF checkpoints')
11+
parser.add_argument('--model-name', type=str, default='EleutherAI/pythia-6.9b-deduped',
12+
help='model-name')
13+
parser.add_argument('--save-dir', type=str, default=DIR,
14+
help='model-name')
15+
parser.add_argument('--offload-dir', type=str, default=None,
16+
help='directory to offload from memory')
17+
args = parser.parse_args()
18+
19+
if not os.path.exists(args.save_dir):
20+
os.mkdir(args.save_dir)
21+
save_path = os.path.join(args.save_dir, args.model_name.replace('/', '_'))
22+
if not os.path.exists(save_path):
23+
os.mkdir(save_path)
24+
25+
print('loading model from HF...')
26+
config = AutoConfig.from_pretrained(args.model_name)
27+
config.save_pretrained(save_path)
28+
tokenizer = AutoTokenizer.from_pretrained(args.model_name)
29+
tokenizer.save_pretrained(save_path)
30+
# offload model from memory to disk if offload-dir is specified
31+
if args.offload_dir is not None:
32+
if not os.path.exists(args.offload_dir):
33+
os.mkdir(args.offload_dir)
34+
model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder=args.offload_dir)
35+
else:
36+
model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)
37+
print('loaded model from HF...')
38+
39+
print('converting the embedding layer...')
40+
item = {}
41+
item['embed_in.weight'] = model.gpt_neox.embed_in.weight
42+
torch.save(item, os.path.join(save_path, 'pytorch_embs.pt'))
43+
print('converted the embedding layer.')
44+
45+
for i in range(len(model.gpt_neox.layers)):
46+
print(f'converting the {i}-th transformer layer...')
47+
torch.save(model.gpt_neox.layers[i].state_dict(), os.path.join(save_path, f'pytorch_{i}.pt'))
48+
print(f'converted the {i}-th transformer layer.')
49+
50+
print('converting the lm_head layer...')
51+
item = {}
52+
item['embed_out.weight'] = model.embed_out.weight
53+
item['final_layer_norm.weight'] = model.gpt_neox.final_layer_norm.weight
54+
item['final_layer_norm.bias'] = model.gpt_neox.final_layer_norm.bias
55+
torch.save(item, os.path.join(save_path, 'pytorch_lm_head.pt'))
56+
print('converted the lm_head layer.')

training/finetune_Pythia-Chat-Base-7B.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ export MODEL_NAME=Pythia-Chat-Base-7B
77

88
export SHOW_DATA=0
99

10-
BASE_MODEL="${DIR}/../pretrained/GPT-NeoX-20B/EleutherAI_pythia-6.9b-deduped/"
10+
BASE_MODEL="${DIR}/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/"
1111

1212
CHECKPOINT_STEPS=100
1313

0 commit comments

Comments
 (0)