Skip to content

Commit deeaa2d

Browse files
committed
Fixed typo
1 parent d7ee550 commit deeaa2d

2 files changed

Lines changed: 17 additions & 17 deletions

File tree

inference/README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,17 @@ This directory contains code for OpenChatKit's inference.
99
- `--temperature`: temperature for the LM. Default: `0.6`
1010
- `--top-k`: top-k for the LM. Default: `40`
1111
- `--retrieval`: augment queries with context from the retrieval index. Default `False`
12-
- `-g` `--gpu-vram`: GPU ID and vRAM to allocate to loading the model, separated by a `:` in the format `ID:RAM` where ID is the CUDA ID and RAM is in GiB. `gpu-id` must be present in this list to avoid errors. Accepts multiple values, for example, `-g ID_0:RAM_0 ID_1:RAM_1 ID_N:RAM_N`
12+
- `-g` `--gpu-vram`: GPU ID and VRAM to allocate to loading the model, separated by a `:` in the format `ID:RAM` where ID is the CUDA ID and RAM is in GiB. `gpu-id` must be present in this list to avoid errors. Accepts multiple values, for example, `-g ID_0:RAM_0 ID_1:RAM_1 ID_N:RAM_N`
1313
- `-r` `--cpu-ram`: CPU RAM overflow allocation for loading the model. Optional, and only used if the model does not fit onto the GPUs given.
1414

1515
## Hardware requirements for inference
16-
The GPT-NeoXT-Chat-Base-20B model requires at least 41GB of free vRAM. Used vRAM also goes up by ~100-200 MB per prompt.
16+
The GPT-NeoXT-Chat-Base-20B model requires at least 41GB of free VRAM. Used VRAM also goes up by ~100-200 MB per prompt.
1717

1818
- A **minimum of 80 GB is recommended**
1919

20-
- A **minimum of 48 GB in vRAM is recommended** for fast responses.
20+
- A **minimum of 48 GB in VRAM is recommended** for fast responses.
2121

22-
If you'd like to run inference on a GPU with <48 GB vRAM, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).
22+
If you'd like to run inference on a GPU with <48 GB VRAM, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).
2323

2424
By default, inference uses only CUDA Device 0.
2525

@@ -28,19 +28,19 @@ By default, inference uses only CUDA Device 0.
2828
## Running on multiple GPUs
2929
Add the argument
3030

31-
```-g ID0:MAX_vRAM ID1:MAX_vRAM ID2:MAX_vRAM ...```
31+
```-g ID0:MAX_VRAM ID1:MAX_VRAM ID2:MAX_VRAM ...```
3232

33-
where IDx is the CUDA ID of the device and MAX_vRAM is the amount of vRAM you'd like to allocate to the device.
33+
where IDx is the CUDA ID of the device and MAX_VRAM is the amount of VRAM you'd like to allocate to the device.
3434

3535
For example, if you are running this on 4x 48 GB GPUs and want to distribute the model across all devices, add ```-g 0:10 1:12 2:12 3:12 4:12```. In this example, the first device gets loaded to a max of 10 GiB while the others are loaded with a max of 12 GiB.
3636

37-
How it works: The model fills up the max available vRAM on the first device passed and then overflows into the next until the whole model is loaded.
37+
How it works: The model fills up the max available VRAM on the first device passed and then overflows into the next until the whole model is loaded.
3838

39-
**IMPORTANT: This MAX_vRAM is only for loading the model. It does not account for the additional inputs that are added to the device. It is recommended to set the MAX_vRAM to be at least 1 or 2 GiB less than the max available vRAM on each device, and at least 3GiB less than the max available vRAM on the primary device (set by `gpu-id` default=0).**
39+
**IMPORTANT: This MAX_VRAM is only for loading the model. It does not account for the additional inputs that are added to the device. It is recommended to set the MAX_VRAM to be at least 1 or 2 GiB less than the max available VRAM on each device, and at least 3GiB less than the max available VRAM on the primary device (set by `gpu-id` default=0).**
4040

41-
**Decrease MAX_vRAM if you run into CUDA OOM. This happens because each input takes up additional space on the device.**
41+
**Decrease MAX_VRAM if you run into CUDA OOM. This happens because each input takes up additional space on the device.**
4242

43-
**NOTE: Total MAX_vRAM across all devices must be > size of the model in GB. If not, `bot.py` automatically offloads the rest of the model to RAM and disk. It will use up all available RAM. To allocate a specified amount of RAM: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**
43+
**NOTE: Total MAX_VRAM across all devices must be > size of the model in GB. If not, `bot.py` automatically offloads the rest of the model to RAM and disk. It will use up all available RAM. To allocate a specified amount of RAM: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**
4444

4545
## Running on specific GPUs
4646
If you have multiple GPUs but would only like to use a specific device(s), [use the same steps as in this section on running on multiple devices](#running-on-multiple-gpus) and only specify the devices you'd like to use.
@@ -53,16 +53,16 @@ Also, if needed, add the argument `--gpu-id ID` where ID is the CUDA ID of the d
5353

5454

5555
## Running on consumer hardware
56-
If you have multiple GPUs, each <48 GB vRAM, [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
57-
- Running on just 1x GPU with <48 GB vRAM,
58-
- <48 GB vRAM combined across multiple GPUs
56+
If you have multiple GPUs, each <48 GB VRAM, [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
57+
- Running on just 1x GPU with <48 GB VRAM,
58+
- <48 GB VRAM combined across multiple GPUs
5959
- Running into Out-Of-Memory (OOM) issues
6060

6161
In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds.
6262

6363
The model will load without specifying `-r`, however, it is not recommended because it will allocate all available RAM to the model. To limit how much RAM the model can use, add `-r`.
6464

65-
If the total vRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.
65+
If the total VRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.
6666

6767
- Example: `-g 0:12 -r 20` will first load up to 12 GiB of the model into the CUDA device 0, then load up to 20 GiB into RAM, and load the rest into the "offload" directory.
6868

inference/bot.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ class ChatModel:
2222
def __init__(self, model_name, gpu_id, max_memory):
2323
device = torch.device('cuda', gpu_id) # TODO: allow sending to cpu
2424

25-
# recommended default for devices with > 40 GB vRAM
25+
# recommended default for devices with > 40 GB VRAM
2626
# load model onto one device
2727
if max_memory is None:
2828
self._model = AutoModelForCausalLM.from_pretrained(
2929
model_name, torch_dtype=torch.float16, device_map="auto")
3030
self._model.to(device)
31-
# load the model with the given max_memory config (for devices with insufficient vRAM or multi-gpu)
31+
# load the model with the given max_memory config (for devices with insufficient VRAM or multi-gpu)
3232
else:
3333
config = AutoConfig.from_pretrained(model_name)
3434
# load empty weights
@@ -203,7 +203,7 @@ def main():
203203
'-g',
204204
'--gpu-vram',
205205
action='store',
206-
help='max vRAM to allocate per GPU',
206+
help='max VRAM to allocate per GPU',
207207
nargs='+',
208208
required=False,
209209
)

0 commit comments

Comments
 (0)