> I wish more effort was put into baseline performance for Python.
There has been. That's why the bytecode is incompatible between minor versions. It was a major selling(?) point for 3.11 and 3.12 in particular.
But the "Faster CPython" team at Microsoft was apparently just laid off (https://www.linkedin.com/posts/mdboom_its-been-a-tough-coupl...), and all of the optimization work has to my understanding been based around fairly traditional techniques. The C part of the codebase has decades of legacy to it, after all.
Alternative implementations like PyPy often post impressive results, and are worth checking out if you need to worry about native Python performance. Not to mention the benefits of shifting the work onto compiled code like NumPy, as you already do.