Skip to content

Delta transfer learning between models (e.g. from Base to Turbo models)#1353

Draft
dxqb wants to merge 1 commit intoNerogar:masterfrom
dxqb:transfer
Draft

Delta transfer learning between models (e.g. from Base to Turbo models)#1353
dxqb wants to merge 1 commit intoNerogar:masterfrom
dxqb:transfer

Conversation

@dxqb
Copy link
Copy Markdown
Collaborator

@dxqb dxqb commented Mar 1, 2026

Experimental training method to teach a LoRA from one model (for example Flux2 Base, Z-Image non-Turbo, ...) into another model (Flux2 non-Base, Z-Image Turbo, ...). Using this method you can directly train Turbo models without affecting distillation.

It could also be used to teach the knowledge of an already existing LoRA into another LoRA, or between any two models as long they have the same latent space/VAE (untested though).

Usage for Base-to-Turbo training:

  1. Train a LoRA on the Base model

  2. Transfer step 1

  • keep the base model in Base Model on the model tab
  • set the trained LoRA as LoRA base model on the LoRA tab
  • set transfer_step1 to True
  • start training
    This step only creates training data. It does not output a model.
  1. Transfer step 2
  • change Base Model on the model tab to the Turbo model
  • keep the trained LoRA as LoRA base model on the LoRA tab (removing it is also possible, but then you teach the Turbo model from scratch)
  • set transfer_step1 to False and transfer_step2 to True
  • start training

Validation loss is meaningless when teaching a Turbo model. Sample often! Transfer learning can go very quickly (for example 300 steps on a LoRA that took 2000 steps to train originally)
Flux2 non-Base seems to require a higher learning rate in the e-4 range than Flux2 Base (e-5 range). Z-Image Turbo seems to learn well at the same learning rate as Z-Image non-Turbo (in the e-4 range)
Increase transfer_guidance to emphasize your concept, or lower it if the concept is overdone, looks like a carricature or looks like you've used too high CFG. 2.5 seems to work well for Flux2, 3.0 for Z-Image Turbo.

Do not change training data, timestep distribution, batch size, ... between transfer step 1 and transfer step 2. Other training parameters like learning rate, optimizer, ... can be changed without repeating step 1.

@dxqb dxqb marked this pull request as draft March 1, 2026 18:45
def train(self):


transfer_step1 = False
Copy link
Copy Markdown
Collaborator Author

@dxqb dxqb Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuration is here
don't forget to restart OneTrainer when you have changed something

@dxqb
Copy link
Copy Markdown
Collaborator Author

dxqb commented Mar 1, 2026

Here is some theory why this works, for anyone interested:

We teach the delta of what the teacher model has learned to the student:
grafik

The student model is CFG-distilled, the teacher model is not. Therefore we CFG-scale the teacher predictions:
grafik

Plugged into both teacher predictions:
grafik

The unguided prediction is identical for the trained and the prior teacher (or at least: should be):
grafik

Therefore:
grafik

grafik

With guidance close to the CFG factor that was used to distill the student model, or close to 1.0 for non-distilled student models.

@m4xw
Copy link
Copy Markdown

m4xw commented Mar 5, 2026

I actually just implemented model distillation using a modified version of the prior prediction codepaths last week

Also added parent model quantization and low vram switch between cpu, still took about 12sec per iter on my 4070 laptop gpu tho since i just went slightly over my limit..

I can push it if you want to compare

@m4xw
Copy link
Copy Markdown

m4xw commented Mar 5, 2026

@dxqb
Copy link
Copy Markdown
Collaborator Author

dxqb commented Mar 6, 2026

I actually just implemented model distillation using a modified version of the prior prediction codepaths last week

the point of my PR is the delta target, it's not just distillation. I'm changing the headline to make that clearer. Please open a separate PR if you want to contribute distillation.

Also added parent model quantization and low vram switch between cpu, still took about 12sec per iter on my 4070 laptop gpu tho since i just went slightly over my limit..

they have in common though that they both need a prediction by a teacher model. it's not realistic to have 2 models in vram on consumer cards. swapping per step is slow.

Have a look at what I did with step1 and step2. It needs more work before it could be merged, but if we merge any of these techniques I think we should go the 2-step route because it isn't slower and doesn't need more vram.

@dxqb dxqb changed the title Transfer learning between models (e.g. from Base to Turbo models) Delta transfer learning between models (e.g. from Base to Turbo models) Mar 6, 2026
@m4xw
Copy link
Copy Markdown

m4xw commented Mar 6, 2026

When you said

Using this method you can directly train Turbo models without affecting distillation.

I was worried you already work on it as well and that we do redundant work, so just asking^^

Yea I did consider a 2 step process but I didnt come up with a easy way like your shenanigans, at least wouldnt be good enough for a PR like that to me, but I might just try that for myself for the speed boost and see

If Distillation is something upstream wants I will gladly open a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants