Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/source/Instruction/Supported-models-and-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -1125,14 +1125,14 @@
|[google/gemma-3n-E4B](https://modelscope.cn/models/google/gemma-3n-E4B)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B)|
|[google/gemma-3n-E2B-it](https://modelscope.cn/models/google/gemma-3n-E2B-it)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it)|
|[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
|[google/gemma-4-31B](https://modelscope.cn/models/google/gemma-4-31B)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B)|
|[google/gemma-4-31B-it](https://modelscope.cn/models/google/gemma-4-31B-it)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)|
|[google/gemma-4-26B-A4B](https://modelscope.cn/models/google/gemma-4-26B-A4B)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B)|
|[google/gemma-4-26B-A4B-it](https://modelscope.cn/models/google/gemma-4-26B-A4B-it)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)|
|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
|[google/gemma-4-31B](https://modelscope.cn/models/google/gemma-4-31B)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B)|
|[google/gemma-4-31B-it](https://modelscope.cn/models/google/gemma-4-31B-it)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)|
|[google/gemma-4-26B-A4B](https://modelscope.cn/models/google/gemma-4-26B-A4B)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B)|
|[google/gemma-4-26B-A4B-it](https://modelscope.cn/models/google/gemma-4-26B-A4B-it)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)|
|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Base-2503)|mistral3|mistral_2503|transformers>=4.49|✘|vision|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)|
|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|mistral3|mistral_2503|transformers>=4.49|✘|vision|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|
|[mistralai/Ministral-3-3B-Base-2512](https://modelscope.cn/models/mistralai/Ministral-3-3B-Base-2512)|mistral3|mistral_2512|transformers>=5.0.0.dev0, mistral-common>=1.8.6|✘|vision|[mistralai/Ministral-3-3B-Base-2512](https://huggingface.co/mistralai/Ministral-3-3B-Base-2512)|
Expand Down
16 changes: 8 additions & 8 deletions docs/source_en/Instruction/Supported-models-and-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -1126,14 +1126,14 @@ The table below introduces the models integrated with ms-swift:
|[google/gemma-3n-E4B](https://modelscope.cn/models/google/gemma-3n-E4B)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B)|
|[google/gemma-3n-E2B-it](https://modelscope.cn/models/google/gemma-3n-E2B-it)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it)|
|[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✘|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
|[google/gemma-4-31B](https://modelscope.cn/models/google/gemma-4-31B)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B)|
|[google/gemma-4-31B-it](https://modelscope.cn/models/google/gemma-4-31B-it)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)|
|[google/gemma-4-26B-A4B](https://modelscope.cn/models/google/gemma-4-26B-A4B)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B)|
|[google/gemma-4-26B-A4B-it](https://modelscope.cn/models/google/gemma-4-26B-A4B-it)|gemma4|gemma4|transformers>=4.53|✘|-|[google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)|
|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4_nothinking|transformers>=4.53|✔|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
|[google/gemma-4-31B](https://modelscope.cn/models/google/gemma-4-31B)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B)|
|[google/gemma-4-31B-it](https://modelscope.cn/models/google/gemma-4-31B-it)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)|
|[google/gemma-4-26B-A4B](https://modelscope.cn/models/google/gemma-4-26B-A4B)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B)|
|[google/gemma-4-26B-A4B-it](https://modelscope.cn/models/google/gemma-4-26B-A4B-it)|gemma4|gemma4|transformers>=4.53|✔|-|[google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)|
|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Base-2503)|mistral3|mistral_2503|transformers>=4.49|✘|vision|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)|
|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|mistral3|mistral_2503|transformers>=4.49|✘|vision|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|
|[mistralai/Ministral-3-3B-Base-2512](https://modelscope.cn/models/mistralai/Ministral-3-3B-Base-2512)|mistral3|mistral_2512|transformers>=5.0.0.dev0, mistral-common>=1.8.6|✘|vision|[mistralai/Ministral-3-3B-Base-2512](https://huggingface.co/mistralai/Ministral-3-3B-Base-2512)|
Expand Down
59 changes: 59 additions & 0 deletions examples/models/gemma4/mcore.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# 8 * 80GiB
# Due to the use of group_by_length, the data is not sufficiently shuffled,
# which may cause fluctuations in the loss curve. Please adjust the parameters accordingly.
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
--model google/gemma-4-26B-A4B-it \
--save_safetensors true \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
'AI-ModelScope/alpaca-gpt4-data-en#500' \
'swift/self-cognition#500' \
'AI-ModelScope/LaTeX_OCR:human_handwrite#2000' \
--load_from_cache_file true \
--add_non_thinking_prefix true \
--split_dataset_ratio 0.01 \
--tuner_type full \
--tensor_model_parallel_size 2 \
--expert_model_parallel_size 4 \
--pipeline_model_parallel_size 2 \
Comment on lines +18 to +20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parallelism configuration appears inconsistent with the total number of GPUs (NPROC_PER_NODE=8). With tensor_model_parallel_size=2 and pipeline_model_parallel_size=2, the Data Parallel (DP) size is calculated as 8 / (2 * 2) = 2. In Megatron-Core, the Expert Parallel (EP) size (expert_model_parallel_size) must typically be less than or equal to the DP size (EP <= DP). Setting EP=4 while DP=2 will likely result in a runtime error during model initialization.

--moe_permute_fusion true \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 1e-6 \
--micro_batch_size 8 \
--global_batch_size 16 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--num_train_epochs 1 \
--finetune true \
--freeze_llm false \
--freeze_vit true \
--freeze_aligner true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-6 \
--output_dir megatron_output/gemma-4-26B-A4B-it \
--eval_steps 500 \
--save_steps 500 \
--max_length 4096 \
--dataloader_num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--attention_backend unfused \
--group_by_length true \
--padding_free false \
--model_author swift \
--model_name swift-robot

# CUDA_VISIBLE_DEVICES=0 swift infer \
# --model megatron_output/gemma-4-26B-A4B-it/vx-xxx/checkpoint-xxx \
# --stream true \
# --enable_thinking false \
# --load_data_args true \
# --max_new_tokens 2048
13 changes: 8 additions & 5 deletions swift/megatron/utils/convert_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,13 @@ def _model_cpu_forward_context(modules,
compute_device=None,
share_embedding: bool = False,
target_device='cpu'):
try:
origin_torch_dtype = next(modules[0].parameters()).dtype
except StopIteration:
origin_torch_dtype = next(modules[-1].parameters()).dtype
for module in modules:
try:
origin_torch_dtype = next(module.parameters()).dtype
except StopIteration:
pass
else:
break
Comment on lines +65 to +71
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable origin_torch_dtype is not initialized before the loop. If the modules list is empty or if none of the modules contain parameters (causing StopIteration in every iteration), origin_torch_dtype will remain undefined. This will lead to an UnboundLocalError when it is accessed later in the _to_cpu_hook function (line 85). Initializing it to None provides a safe fallback, as module.to(dtype=None) is a no-op in PyTorch.

Suggested change
for module in modules:
try:
origin_torch_dtype = next(module.parameters()).dtype
except StopIteration:
pass
else:
break
origin_torch_dtype = None
for module in modules:
try:
origin_torch_dtype = next(module.parameters()).dtype
break
except StopIteration:
pass

embeddings = None
if share_embedding:
embeddings = [module for module in modules if isinstance(module, (nn.Embedding, VocabParallelEmbedding))]
Expand All @@ -77,7 +80,7 @@ def _to_cuda_hook(module, args):
return args

def _to_cpu_hook(module, args, output):
if share_embedding and module in embeddings:
if share_embedding and module in embeddings or 'rotaryemb' in module.__class__.__name__.lower():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The boolean expression relies on operator precedence (and before or), which can be error-prone and harder to read. Additionally, checking for 'rotaryemb' in the class name is a bit fragile. While string matching is often used in this context to avoid circular imports, adding parentheses would at least clarify the intended logic.

Suggested change
if share_embedding and module in embeddings or 'rotaryemb' in module.__class__.__name__.lower():
if (share_embedding and module in embeddings) or 'rotaryemb' in module.__class__.__name__.lower():

return
module.to(device=target_device, dtype=origin_torch_dtype)

Expand Down
Loading