I haven't checked this one out yet, but a common trick is using combinations of instructions and data invariants allowing you to work in "lanes".
The easiest example is xor, which can trivially be interpreted as either xoring one large integer or xoring a vector of smaller integers.
Take a look at the SWAR example here [0] as a pretty common/easy example of that technique being good for something in the real world.
Dedicated hardware is almost always better, but you can still get major improvements with a little elbow grease.
[0] https://nimrod.blog/posts/algorithms-behind-popcount/