I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?
In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything