Let's say you have a completely random data set. You generate a bunch of random variables x1 through xn and a random dependent variable y. Then you poke around and see whether any of the x variables look like they might predict y, so you pick those variables and try to build a model on them. What you end up with is a model where, according to the standard tests of statistical significance, some of the xs predict the y, even though all the data is completely random.
This is a much more likely explanation for why the answer weights on the biographical assessment were so weird than some conspiracy between the contractors who developed the test, the FAA staff, and the black employee organization.
They had a dataset that was very skewed because historically there have been very few black controllers, and so was very prone to overfitting. The FAA asked the contractor to use that dataset to build a test that would serve as a rough filter, screen in good candidates, and not show a disparate impact. The contractor delivered a test that fulfilled those criteria (at least in the technical sense that it passed statistical validation). Whether or not the test actually made any sense was not their department.