I just used o3 to design a distributed scheduler that scales to 1M+ sxchedules a day. It was perfect, and did better than two weeks of thought around the best way to build this.
While impressive, I'm not convinced that improved performance on tasks of this nature are indicative of progress toward AGI. Building a scheduler is a well studied problem space. Something like the ARC benchmark is much more indicative of progress toward true AGI, but probably still insufficient.