Skip to content

Commit aa09ce7

Browse files
authored
Merge pull request #70 from orangetin/inference-py
Update inference to support multi-gpu and offloading
2 parents 5834060 + deeaa2d commit aa09ce7

3 files changed

Lines changed: 144 additions & 10 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,8 @@ Hello human.
186186

187187
Commands are prefixed with a `/`, and the `/quit` command exits.
188188

189+
Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.
190+
189191
# Monitoring
190192

191193
By default, the training script simply prints the loss as training proceeds, but it can also output metrics to a file using [loguru](https://github.com/Delgan/loguru) or report them to Weights & Biases.

inference/README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# OpenChatKit Inference
2+
This directory contains code for OpenChatKit's inference.
3+
4+
## Arguments
5+
- `--gpu-id`: Primary GPU device to load inputs onto for inference. Default: `0`
6+
- `--model`: name/path of the model. Default = `../huggingface_models/GPT-NeoXT-Chat-Base-20B`
7+
- `--max-tokens`: the maximum number of tokens to generate. Default: `128`
8+
- `--sample`: indicates whether to sample. Default: `True`
9+
- `--temperature`: temperature for the LM. Default: `0.6`
10+
- `--top-k`: top-k for the LM. Default: `40`
11+
- `--retrieval`: augment queries with context from the retrieval index. Default `False`
12+
- `-g` `--gpu-vram`: GPU ID and VRAM to allocate to loading the model, separated by a `:` in the format `ID:RAM` where ID is the CUDA ID and RAM is in GiB. `gpu-id` must be present in this list to avoid errors. Accepts multiple values, for example, `-g ID_0:RAM_0 ID_1:RAM_1 ID_N:RAM_N`
13+
- `-r` `--cpu-ram`: CPU RAM overflow allocation for loading the model. Optional, and only used if the model does not fit onto the GPUs given.
14+
15+
## Hardware requirements for inference
16+
The GPT-NeoXT-Chat-Base-20B model requires at least 41GB of free VRAM. Used VRAM also goes up by ~100-200 MB per prompt.
17+
18+
- A **minimum of 80 GB is recommended**
19+
20+
- A **minimum of 48 GB in VRAM is recommended** for fast responses.
21+
22+
If you'd like to run inference on a GPU with <48 GB VRAM, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).
23+
24+
By default, inference uses only CUDA Device 0.
25+
26+
**NOTE: Inference currently requires at least 1x GPU.**
27+
28+
## Running on multiple GPUs
29+
Add the argument
30+
31+
```-g ID0:MAX_VRAM ID1:MAX_VRAM ID2:MAX_VRAM ...```
32+
33+
where IDx is the CUDA ID of the device and MAX_VRAM is the amount of VRAM you'd like to allocate to the device.
34+
35+
For example, if you are running this on 4x 48 GB GPUs and want to distribute the model across all devices, add ```-g 0:10 1:12 2:12 3:12 4:12```. In this example, the first device gets loaded to a max of 10 GiB while the others are loaded with a max of 12 GiB.
36+
37+
How it works: The model fills up the max available VRAM on the first device passed and then overflows into the next until the whole model is loaded.
38+
39+
**IMPORTANT: This MAX_VRAM is only for loading the model. It does not account for the additional inputs that are added to the device. It is recommended to set the MAX_VRAM to be at least 1 or 2 GiB less than the max available VRAM on each device, and at least 3GiB less than the max available VRAM on the primary device (set by `gpu-id` default=0).**
40+
41+
**Decrease MAX_VRAM if you run into CUDA OOM. This happens because each input takes up additional space on the device.**
42+
43+
**NOTE: Total MAX_VRAM across all devices must be > size of the model in GB. If not, `bot.py` automatically offloads the rest of the model to RAM and disk. It will use up all available RAM. To allocate a specified amount of RAM: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**
44+
45+
## Running on specific GPUs
46+
If you have multiple GPUs but would only like to use a specific device(s), [use the same steps as in this section on running on multiple devices](#running-on-multiple-gpus) and only specify the devices you'd like to use.
47+
48+
Also, if needed, add the argument `--gpu-id ID` where ID is the CUDA ID of the device you'd like to make the primary device. NOTE: The device specified in `--gpu-id` must be present as one of the ID in the argument `-g` to avoid errors.
49+
50+
- **Example #1**: to run inference on devices 2 and 5 with a max of 25 GiB on each, and make device 5 the primary device, add: `--gpu-id 5 -g 2:25 5:25`. In this example, not adding `--gpu-id 5` will give you an error.
51+
- **Example #2**: to run inference on devices 0 and 3 with a max of 10GiB on 0 and 40GiB on 3, with device 0 as the primary device, add: `-g 0:10 3:40`. In this example, `--gpu-id` is not required because device 0 is specified in `-g`.
52+
- **Example #3**: to run inference only on device 1 with a max of 75 GiB, add: `--gpu-id 1 -g 1:75`
53+
54+
55+
## Running on consumer hardware
56+
If you have multiple GPUs, each <48 GB VRAM, [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
57+
- Running on just 1x GPU with <48 GB VRAM,
58+
- <48 GB VRAM combined across multiple GPUs
59+
- Running into Out-Of-Memory (OOM) issues
60+
61+
In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds.
62+
63+
The model will load without specifying `-r`, however, it is not recommended because it will allocate all available RAM to the model. To limit how much RAM the model can use, add `-r`.
64+
65+
If the total VRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.
66+
67+
- Example: `-g 0:12 -r 20` will first load up to 12 GiB of the model into the CUDA device 0, then load up to 20 GiB into RAM, and load the rest into the "offload" directory.
68+
69+
How it works:
70+
- https://github.com/huggingface/blog/blob/main/accelerate-large-models.md
71+
- https://www.youtube.com/embed/MWCSGj9jEAo

inference/bot.py

Lines changed: 71 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,47 @@
1111
import argparse
1212
import conversation as convo
1313
import retrieval.wikipedia as wp
14-
from transformers import AutoTokenizer, AutoModelForCausalLM
14+
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
15+
from accelerate import infer_auto_device_map, init_empty_weights
16+
1517

1618
class ChatModel:
1719
human_id = "<human>"
1820
bot_id = "<bot>"
1921

20-
def __init__(self, model_name, gpu_id):
21-
device = torch.device('cuda', gpu_id)
22-
self._model = AutoModelForCausalLM.from_pretrained(
23-
model_name).half()
24-
self._model.to(device)
22+
def __init__(self, model_name, gpu_id, max_memory):
23+
device = torch.device('cuda', gpu_id) # TODO: allow sending to cpu
24+
25+
# recommended default for devices with > 40 GB VRAM
26+
# load model onto one device
27+
if max_memory is None:
28+
self._model = AutoModelForCausalLM.from_pretrained(
29+
model_name, torch_dtype=torch.float16, device_map="auto")
30+
self._model.to(device)
31+
# load the model with the given max_memory config (for devices with insufficient VRAM or multi-gpu)
32+
else:
33+
config = AutoConfig.from_pretrained(model_name)
34+
# load empty weights
35+
with init_empty_weights():
36+
model_from_conf = AutoModelForCausalLM.from_config(config)
37+
38+
model_from_conf.tie_weights()
39+
40+
# create a device_map from max_memory
41+
device_map = infer_auto_device_map(
42+
model_from_conf,
43+
max_memory=max_memory,
44+
no_split_module_classes=["GPTNeoXLayer"],
45+
dtype="float16"
46+
)
47+
# load the model with the above device_map
48+
self._model = AutoModelForCausalLM.from_pretrained(
49+
model_name,
50+
device_map=device_map,
51+
offload_folder="offload", # optional offload-to-disk overflow directory (auto-created)
52+
offload_state_dict=True,
53+
torch_dtype=torch.float16
54+
)
2555
self._tokenizer = AutoTokenizer.from_pretrained(model_name)
2656

2757
def do_inference(self, prompt, max_new_tokens, do_sample, temperature, top_k):
@@ -49,7 +79,7 @@ class OpenChatKitShell(cmd.Cmd):
4979
intro = "Welcome to OpenChatKit shell. Type /help or /? to list commands.\n"
5080
prompt = ">>> "
5181

52-
def __init__(self, gpu_id, model_name_or_path, max_tokens, sample, temperature, top_k, retrieval):
82+
def __init__(self, gpu_id, model_name_or_path, max_tokens, sample, temperature, top_k, retrieval, max_memory):
5383
super().__init__()
5484
self._gpu_id = int(gpu_id)
5585
self._model_name_or_path = model_name_or_path
@@ -58,10 +88,11 @@ def __init__(self, gpu_id, model_name_or_path, max_tokens, sample, temperature,
5888
self._temperature = temperature
5989
self._top_k = top_k
6090
self._retrieval = retrieval
91+
self._max_memory = max_memory
6192

6293
def preloop(self):
6394
print(f"Loading {self._model_name_or_path} to cuda:{self._gpu_id}...")
64-
self._model = ChatModel(self._model_name_or_path, self._gpu_id)
95+
self._model = ChatModel(self._model_name_or_path, self._gpu_id, self._max_memory)
6596

6697
if self._retrieval:
6798
print(f"Loading retrieval index...")
@@ -139,7 +170,7 @@ def main():
139170
parser.add_argument(
140171
'--model',
141172
default=f"{INFERENCE_DIR}/../huggingface_models/GPT-NeoXT-Chat-Base-20B",
142-
help='the ID of the GPU to run on'
173+
help='name/path of the model'
143174
)
144175
parser.add_argument(
145176
'--max-tokens',
@@ -168,16 +199,46 @@ def main():
168199
action='store_true',
169200
help='augment queries with context from the retrieval index'
170201
)
202+
parser.add_argument(
203+
'-g',
204+
'--gpu-vram',
205+
action='store',
206+
help='max VRAM to allocate per GPU',
207+
nargs='+',
208+
required=False,
209+
)
210+
parser.add_argument(
211+
'-r',
212+
'--cpu-ram',
213+
default=None,
214+
type=int,
215+
help='max CPU RAM to allocate',
216+
required=False
217+
)
171218
args = parser.parse_args()
172219

220+
# set max_memory dictionary if given
221+
if args.gpu_vram is None:
222+
max_memory = None
223+
else:
224+
max_memory = {}
225+
for i in range(len(args.gpu_vram)):
226+
# assign CUDA ID as label and XGiB as value
227+
max_memory[int(args.gpu_vram[i].split(':')[0])] = f"{args.gpu_vram[i].split(':')[1]}GiB"
228+
229+
if args.cpu_ram is not None:
230+
# add cpu to max-memory if given
231+
max_memory['cpu'] = f"{int(args.cpu_ram)}GiB"
232+
173233
OpenChatKitShell(
174234
args.gpu_id,
175235
args.model,
176236
args.max_tokens,
177237
args.sample,
178238
args.temperature,
179239
args.top_k,
180-
args.retrieval
240+
args.retrieval,
241+
max_memory
181242
).cmdloop()
182243

183244

0 commit comments

Comments
 (0)