test B - after
what are you talking about ?
But sometimes it’s the only possible approach.
But in the absence of the ability to run them simultaneously, "A is before and B is after" can be a fine proxy. Of course, if B is worse, it'd be nice if you could only subject, say, 5% of your population to it before you just slam the slider to 100% and hit everyone with it.
The problem is that for A/B testing to really work you need independent groups outcomes. As soon as there is any bias in group selection or cross group effect it's very hard to unpick.
A more valid design would be randomly assigning some cities to institute congestion pricing, and other cities to not have it. Obviously not feasible in practice, but that's at least the kind of thing to strive toward when designing these kinds of studies.
I wish more laws would pre-state what their intended outcome and success would look like.
In any case, not every policy change needs to be an academic exercise.
Policies have different effects depending on how likely people judge them to be long-term changes. Construction along a route will cause people to temporarily use alternative forms of transportation, but not e.g. sell their car or buy a long-term bus pass.
Yes, the inability to know counterfactuals will make judging policies more subjective than we might like. The closest we get to A/B testing is when different jurisdictions adopt substantially similar policies at different times. For example, this was done to judge improvements from phasing out leaded-gasoline, since it was done at different times and rates in different areas.
in such big scale a/b test is tool to deceive, not to get to right conclusion
I've got a textbook on field experiments that refers to these kinds of questions as FUQ - acronym for "Fundamentally Unanswerable Questions". You can collect suggestive evidence, but firmly establishing cause and effect is something you've just got to let go of.
(Purely hypothetically: one could identify 10% of the island as operating under the new rules and compare outcomes. This is politically fraught on multiple levels and also gives messy spatial results.)
but we have before and after data to compare - that's what this article is about. and the congestion pricing plan included requirements to publish data specifically for the purposes of comparison between last year and this year.
Cities are stupidly heterogenous. These data wouldn't be more meaningful than comparing cities with congestion pricing to those without. (And comparing them from their congestion eras.)
"Our treatment units are stupidly heterogeneous" is exactly the problem it solves. A century's worth of developing increasingly sophisticated statistical techniques for making do without random assignment has thus far failed to accomplish anything than provisional mitigations that are notoriously easy to use incorrectly in practice.