Delta transfer learning between models (e.g. from Base to Turbo models)#1353
Delta transfer learning between models (e.g. from Base to Turbo models)#1353dxqb wants to merge 1 commit intoNerogar:masterfrom
Conversation
| def train(self): | ||
|
|
||
|
|
||
| transfer_step1 = False |
There was a problem hiding this comment.
configuration is here
don't forget to restart OneTrainer when you have changed something
|
I actually just implemented model distillation using a modified version of the prior prediction codepaths last week Also added parent model quantization and low vram switch between cpu, still took about 12sec per iter on my 4070 laptop gpu tho since i just went slightly over my limit.. I can push it if you want to compare |
|
Just pushed m4xw@714aeba#diff-9abd2e84e75b7bf8790da2178370d0e9052f3d8b19d14be568671bcea53ede80 Still a WIP tho. |
the point of my PR is the delta target, it's not just distillation. I'm changing the headline to make that clearer. Please open a separate PR if you want to contribute distillation.
they have in common though that they both need a prediction by a teacher model. it's not realistic to have 2 models in vram on consumer cards. swapping per step is slow. Have a look at what I did with step1 and step2. It needs more work before it could be merged, but if we merge any of these techniques I think we should go the 2-step route because it isn't slower and doesn't need more vram. |
|
When you said
I was worried you already work on it as well and that we do redundant work, so just asking^^ Yea I did consider a 2 step process but I didnt come up with a easy way like your shenanigans, at least wouldnt be good enough for a PR like that to me, but I might just try that for myself for the speed boost and see If Distillation is something upstream wants I will gladly open a PR |






Experimental training method to teach a LoRA from one model (for example Flux2 Base, Z-Image non-Turbo, ...) into another model (Flux2 non-Base, Z-Image Turbo, ...). Using this method you can directly train Turbo models without affecting distillation.
It could also be used to teach the knowledge of an already existing LoRA into another LoRA, or between any two models as long they have the same latent space/VAE (untested though).
Usage for Base-to-Turbo training:
Train a LoRA on the Base model
Transfer step 1
Base Modelon themodeltabLoRA base modelon theLoRAtabtransfer_step1toTrueThis step only creates training data. It does not output a model.
Base Modelon themodel tabto the Turbo modelLoRA base modelon theLoRAtab (removing it is also possible, but then you teach the Turbo model from scratch)transfer_step1toFalseandtransfer_step2toTrueValidation loss is meaningless when teaching a Turbo model. Sample often! Transfer learning can go very quickly (for example 300 steps on a LoRA that took 2000 steps to train originally)
Flux2 non-Base seems to require a higher learning rate in the e-4 range than Flux2 Base (e-5 range). Z-Image Turbo seems to learn well at the same learning rate as Z-Image non-Turbo (in the e-4 range)
Increase
transfer_guidanceto emphasize your concept, or lower it if the concept is overdone, looks like a carricature or looks like you've used too high CFG.2.5seems to work well for Flux2,3.0for Z-Image Turbo.Do not change training data, timestep distribution, batch size, ... between transfer step 1 and transfer step 2. Other training parameters like learning rate, optimizer, ... can be changed without repeating step 1.