You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The script [torchchat_llm_text_gen.py](torchchat_llm_text_gen.py) demonstrates how to run llm inference using the Llama2 7B model via torchchat. It leverages the 4 bit dynamic quantization speedups and can supports multiple vision and text models.
198
+
### Transformers
199
+
The script [transformers_llm_text_gen.py](transformers_llm_text_gen.py) demonstrates how to generate text using Llama2 7B model via Transformers. It leverages the 4 bit dynamic quantization speedups and can supports vast number of text models.
200
200
201
-
To run infernece using torchchat call:
201
+
Run inference using default (groupwise, layout-aware INT4) using tranformer call:
Description: Path to the model quantization config.
211
-
212
-
`--max-new-tokens`
213
-
Description: Max new tokens to generate.
214
-
215
-
`--compile`
216
-
Description: Whether to compile the model (default: `False`).
207
+
Run with symmetric_channelwise quantization:
217
208
218
-
`--model`
219
-
Description: Model alias. (Default: `"llama2"` )
220
-
221
-
`--prompt`
222
-
Description: Input prompt for model generation.
223
-
224
-
### Transformers
225
-
The script [transformers_llm_text_gen.py](transformers_llm_text_gen.py) demonstrates how to generate text using Llama2 7B model via Transformers. It leverages the 4 bit dynamic quantization speedups and can supports vast number of text models.
0 commit comments