Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,16 @@
**fp8参数**:
- fp8_format: 用于前向和反向传播中FP8张量的FP8格式方案。可选为'e4m3','hybrid'。默认为None。
- fp8_recipe: 用于前向和反向传播中 FP8 张量的 FP8 算法方案。可选为'tensorwise', 'delayed', 'mxfp8', 'blockwise'。默认为'delayed'。其中blockwise fp8需要 cuda129 以上版本。
- fp8_amax_history_len: 每个张量记录 amax 历史的步数。默认为1024。
- fp8_amax_compute_algo: 用于根据历史记录计算 amax 的算法。可选为'most_recent', 'max'。默认为'max'。
- fp8_param_gather: 保持计算参数为 fp8(不使用任何其他中间数据类型),并在 fp8 格式下执行参数的 all-gather 操作。默认为False。
- 提示:若想导出FP8权重格式,设置为True;否则设置为False。
- fp8_amax_history_len: 每个张量记录 amax 历史的步数。默认为1024。
- fp8_amax_compute_algo: 用于根据历史记录计算 amax 的算法。可选为'most_recent', 'max'。默认为'max'。

**fp4参数**:
- fp4_format: 用于前向和反向传播中FP4张量的FP4格式方案,可选为'e2m1'。默认为None。
- fp4_recipe: 若设置此参数,则通过 Transformer Engine 启用 FP4 精度。目前仅支持 'nvfp4',该选项使用适用于 Blackwell+ 架构的 NVFP4BlockScaling 方案。默认为'nvfp4'。
- fp4_param_gather: 若设置此参数,则将参数保持为 FP4 精度以节省内存。注意并非所有参数都会被转换为 FP4,例如偏置项将保持不变。默认为False。


**混合精度参数**:
- fp16: fp16模式。默认为None,会根据模型的torch_dtype进行设置,即torch_dtype为float16或者float32则fp16设置为True。torch_dtype默认读取config.json。
Expand Down
9 changes: 7 additions & 2 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,11 +183,16 @@ For guidance on selecting parallelization strategies, please refer to the [Train
**FP8 Parameters**:
- fp8_format: The FP8 format scheme used for FP8 tensors in the forward and backward pass. Options are 'e4m3' and 'hybrid'. Default is None.
- fp8_recipe: The FP8 recipe (algorithm scheme) used for FP8 tensors in the forward and backward pass. Options are 'tensorwise', 'delayed', 'mxfp8', and 'blockwise'. Default is 'delayed'. Note that blockwise fp8 requires CUDA version 12.9 or higher.
- fp8_amax_history_len: Number of steps for which amax history is recorded per tensor. Default is 1024.
- fp8_amax_compute_algo: Algorithm for computing amax from history. Options are 'most_recent' and 'max'. Default is 'max'.
- fp8_param_gather: Keep the compute parameter in FP8 (do not use any other intermediate dtype) and perform the parameter all-gather in FP8 format. Default is False.
- Tips: Set this to True if you want to export weights in FP8 format; otherwise, set it to False.
- fp8_amax_history_len: Number of steps for which amax history is recorded per tensor. Default is 1024.
- fp8_amax_compute_algo: Algorithm for computing amax from history. Options are 'most_recent' and 'max'. Default is 'max'.

**FP4 Parameters**:

- fp4_format: The FP4 format scheme for FP4 tensors in forward and backward passes, optionally set to 'e2m1'. Defaults to None.
- fp4_recipe: If set, enables FP4 precision through Transformer Engine. Currently only 'nvfp4' is supported, which uses the NVFP4BlockScaling recipe for Blackwell+ architecture. Default is 'nvfp4'.
- fp4_param_gather: If set, keeps the parameters in FP4 precision to save memory. Note that not all parameters will be converted to FP4; for example, biases will remain unchanged. Default is False.

**Mixed Precision Parameters**:

Expand Down
12 changes: 10 additions & 2 deletions swift/megatron/arguments/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -530,9 +530,14 @@ class MegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
# fp8
fp8_format: Literal['e4m3', 'hybrid'] = None
fp8_recipe: Literal['tensorwise', 'delayed', 'mxfp8', 'blockwise'] = 'delayed'
fp8_param_gather: bool = False
fp8_amax_history_len: int = 1024
fp8_amax_compute_algo: Literal['most_recent', 'max'] = 'max'
fp8_param_gather: bool = False

# fp4
fp4_format: Literal['e2m1'] = None
fp4_recipe: Literal['nvfp4'] = 'nvfp4'
fp4_param_gather: bool = False

# mixed precision
fp16: Optional[bool] = None
Expand Down Expand Up @@ -700,7 +705,10 @@ def __post_init__(self):
or self.decoder_last_pipeline_num_layers is not None):
raise ValueError('pipeline_model_parallel_size must be greater than 1 if you want to set '
'decoder_first_pipeline_num_layers or decoder_last_pipeline_num_layers.')
self.fp8 = self.fp8_format # compat megatron-lm
# compat megatron-core
self.fp8 = self.fp8_format
self.fp4 = self.fp4_format

if self.task_type not in {'causal_lm', 'generative_reranker'}:
self.untie_embeddings_and_output_weights = True
if self.vit_gradient_checkpointing_kwargs is not None:
Expand Down
5 changes: 3 additions & 2 deletions swift/megatron/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,10 +220,11 @@ def get_padding_to(args):
padding_to = (padding_to or 1) * args.context_parallel_size
origin_padding_to = padding_to
fp8_format = getattr(args, 'fp8_format', None) or getattr(args, 'fp8', None)
fp4_format = getattr(args, 'fp4_format', None) or getattr(args, 'fp4', None)
if args.fp8_recipe == 'blockwise':
padding_to = (padding_to or 1) * 128
elif fp8_format is not None:
padding_to = max((padding_to or 1) * 8, 16)
elif fp8_format is not None or fp4_format is not None:
padding_to = (padding_to or 1) * 16
Comment thread
Jintao-Huang marked this conversation as resolved.
if args.attention_backend == 'fused':
padding_to = max(padding_to or 1, ((origin_padding_to) or 1) * 64)
return padding_to
Loading