> Are astronauts riding horses now represented in the training data more than would have been possible 5 years ago?
Yes.
Though I'm a bit confused why this became the goto. If I remember correctly the claim was about it being "out of distribution" but I have high confidence that astronauts riding horses are within the training dataset prior to DALL-E. The big reason everyone should believe this is because astronauts have always been compared to cowboys. And... what do we stereotypically associate with cowboys?
The second reason, is because it is the main poster for the 2006 movie The Astronaut Farmer:
https://en.wikipedia.org/wiki/The_Astronaut_Farmer
But here's some other ones I found that are timestamped. It's kinda hard to find random digital art that is timestamped. Looks like even shutterstock doesn't... And places like deviantart don't have great search. Hell... even Google will just flat out ignore advanced search terms (the fuck is even the point of having them?). The term is so littered now that this makes search difficult, but I found two relatively quickly.
2014: https://www.behance.net/gallery/18695387/Space-Cowboy#
2016: https://drawception.com/game/DZgKzhbrhq/badass-space-cowboy-...
But even if the samples did not exist, I do not think this represents a significantly out of distribution, if at all, image. Are we in doubt that there's images like astronauts riding rockets? I think certainly there exists "astronaut riding horse" along the interpolation between "person riding horse" and "astronaut riding <insert any term>". Mind you, generating samples in distribution but not in training (or test) is still a great feat and impressive accomplishment. This should in no way be underplayed at all! But there is a difference in claiming out of distribution.
> I'd like to see what this approach can do if trained exclusively on non-synthetic permissively licensed inputs
One minor point. The term "synthetically generated" is a bit ambiguous. It may include digital art. It does not necessarily mean generated by a machine learning generative model. TBH, I find the ambiguity frustrating as there is some important distinctions.