Wonderful!
The edge cases are the gold btw, collect the whole set and keep them in a human and machine readable format.
I'd also go through and using a color coded set of cables, insert bad cables (one at a time at first) while the system is doing an aggressive all to all workload and see how quickly you can identify faults.
It is the gray failures that will bring the system down, often multiple as a single failure will go undetected for months and then finally tip over an inflection point at a later time.
Are you workloads ephemeral and/or do they live migrate? Or will physical hosts have long uptimes? It is nice to be able to rebaseline the hardware before and after host kernel upgrades so you can detect any anomalies.
You would be surprised about how larger of a systemic performance degradation that major cloud providers have been able to see over months because "all machines are the same", high precision but low absolute accuracy. It is nice to run the same benchmarks on bare metal and then again under virtualization.
I am sure you know, but you are running a multivariate longitudinal experiment, science the shit out of it.