Skip to content

Commit 5f5f63b

Browse files
authored
Update README.md
1 parent e0dc623 commit 5f5f63b

1 file changed

Lines changed: 5 additions & 3 deletions

File tree

inference/README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ How it works: The model fills up the max available vRAM on the first device pass
4040

4141
**Decrease MAX_vRAM if you run into CUDA OOM. This happens because each input takes up additional space on the device.**
4242

43-
**NOTE: Total MAX_vRAM across all devices must be > size of the model in GB. If not, you'll need to offload parts of the model to CPU: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**
43+
**NOTE: Total MAX_vRAM across all devices must be > size of the model in GB. If not, `bot.py` automatically offloads the rest of the model to RAM and disk. It will use up all available RAM. To allocate a specified amount of RAM: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**
4444

4545
## Running on specific GPUs
4646
If you have multiple GPUs but would only like to use a specific device(s), [use the same steps as in this section on running on multiple devices](#running-on-multiple-gpus) and only specify the devices you'd like to use.
@@ -58,12 +58,14 @@ If you have multiple GPUs, each <48 GB vRAM, [the steps mentioned in this sectio
5858
- <48 GB vRAM combined across multiple GPUs
5959
- Running into Out-Of-Memory (OOM) issues
6060

61-
In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds.
61+
In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds.
62+
63+
The model will load without specifying `-r`, however, it is not recommended because it will allocate all available RAM to the model. To limit how much RAM the model can use, add `-r`.
6264

6365
If the total vRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.
6466

6567
- Example: `-g 0:12 -r 20` will first load up to 12 GiB of the model into the CUDA device 0, then load up to 20 GiB into RAM, and load the rest into the "offload" directory.
6668

6769
How it works:
6870
- https://github.com/huggingface/blog/blob/main/accelerate-large-models.md
69-
- https://www.youtube.com/embed/MWCSGj9jEAo
71+
- https://www.youtube.com/embed/MWCSGj9jEAo

0 commit comments

Comments
 (0)