Actual manual cache management is way too much of an implementation detail for a general-purpose CPU to expose; doing so would deeply tie code to a specific set of processor behavior. Cache sizes and even hierarchies change often between processor generations, and some internal cache behavior has changed within a generation as a result of microcode and/or hardware steppings. Actual cache control would be like MIPS exposing delay slots but so much worse (at least older delay slots really only turn into performance issues, older cache control would easily turn into correctness issues).
Really the only way to make this work is for the final compilation/"specialization" step to occur on the specific device in question, like with a processor using binary translation (e.g. Transmeta, Nvidia Denver) or specialization (e.g. Mill) or a system that effectively enforces runtime compilation (e.g. runtime shader/program compilation in OpenGL and OpenCL).