AFAIK a big part of it is that they distilled the guidance into the model.
I'm going to simplify all of this a lot so please bear with me, but normally the equation to denoise an image would look something like this:
pos = model(latent, positive_prompt_emb)
neg = model(latent, negative_prompt_emb)
next_latent = latent + dt * (neg + cfg_scale * (pos - neg))
So what this does - you trigger the model once with a negative prompt (which can be empty) to get the "starting point" for the prediction, and then you run the model again with a positive prompt to get the direction in which you want to go, and then you combine them.
So, for example, let's assume your positive prompt is "dog", and your negative prompt is empty. So triggering the model with your empty prompt with generate a "neutral" latent, and then you nudge it into the direction of your positive prompt, in the direction of a "dog". And you do this for 20 steps, and you get an image of a dog.
Now, for Flux the equation looks like this:
next_latent = latent + dt * model(latent, positive_prompt_emb)
The guidance here was distilled
into the model. It's cheaper to do inference with, but now we can't really train the model too much without destroying this embedded guidance (the model will just forget it and collapse).
There's also an issue of training dynamics. We don't know exactly how they trained their models, so it's impossible for us to jerry rig our training runs in a similar way. And if you don't match the original training dynamics when finetuning it also negatively affects the model.
So you might ask here - what if we just train the model for a really long time - will it be able to recover? And the answer is - yes, but at this point the most of the original model will essentially be overwritten. People actually did this for Flux Schnell, but you need way more resources to pull it off and the results can be disappointing: https://huggingface.co/lodestones/Chroma