Skip to content

Observations on Training Efficiency with Different Masking Strategies #10

@LuciferZap

Description

@LuciferZap

Thank you for your great work! I tried to reproduce the first two stages of training and conducted three experiments on A800 GPUs. Below are the details:

Experiment 1: Fully used 75% mask training for 227K steps, with a batch size of 2048.

Experiment 2:(github page recommended)Used 75% mask training for 179K steps, followed by 0% mask training for 48K steps, with a batch size of 2048.

Experiment 3: Used 0% mask training for 138K steps, with a batch size of 2048.

I found that Experiment 3 achieved the highest training efficiency. The computational cost across all three experiments was roughly the same. However, I did not observe any efficiency improvement from using mask training. Based on the final visualization results, the samples from Experiment 3 also appear to be the best.

If I missed any important details, please let me know!

Note on Resources: Due to instability in my resource pool, these experiments were sometimes run on 3 A800 GPUs and other times on a single GPU. However, the total computational cost remained consistent.

Attached below are the detailed visualization results:

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions