- False negative results will make your devs hate the tests. People want to get things done and will start ignoring them if you unnecessarily break their workflow. In the CI, you should always retry on failure to avoid flaky false-negative tests.
- E2E Tests can fail suddenly. To avoid breaking people's workflow, we do a megabenchmark every day at 1 AM, and the test runs multiple times - even if it passes - so that we can measure flakiness. If a test fails in the benchmark, we remove it from the CI so we don't break other developers' workflows. The next day, we either fix the test or the bug.
- Claude Code SDK has been a blessing for E2E. Before, you couldn't run all the E2E in the PR's CI due to the time they all take. Now, we can send the branch to the Claude Code SDK to determine what E2E tests should run.
- Also, MCPs and Claude Code now write most of my E2E. I wrote a detailed Claude.md to let it run autonomously --writing, validating, and repeating -- while I do something else. It does in 3 to 4 shots. For the price of a cup of coffee, it saves me 30-60 minutes per test.