An image of an archeologist adventurer who wears a hat and uses a bullwhip

(theaiunderwriter.substack.com)

1503 points participant3 | 1 comments | 03 Apr 25 17:55 UTC | HN request time: 1.467s | source

Show context

MgB2 ◴[03 Apr 25 20:24 UTC] No.43574927[source]▶

Idk, the models generating what are basically 1:1 copies of the training data from pretty generic descriptions feels like a severe case of overfitting to me. What use is a generational model that just regurgitates the input?

I feel like the less advanced generations, maybe even because of their limitations in terms of size, were better at coming up with something that at least feels new.

In the end, other than for copyright-washing, why wouldn't I just use the original movie still/photo in the first place?

replies(13): >>43575052 #>>43575080 #>>43575231 #>>43576085 #>>43576153 #>>43577026 #>>43577350 #>>43578381 #>>43578512 #>>43578581 #>>43579012 #>>43579408 #>>43582494 #

yk ◴[04 Apr 25 05:15 UTC] No.43578581[source]▶

>>43574927 #

Tried Flux.dev with the same prompts [0] and it seems actually to be a GPT problem. Could be that in GPT the text encoder understands the prompt better and just generates the implied IP, or could be that a diffusion model is just inherently less prone to overfitting than a multimodal transformer model.

[0] https://imgur.com/a/wqrBGRF Image captions are the impled IP, I copied the prompts from the blog post.

replies(1): >>43578685 #

1. jsemrau ◴[04 Apr 25 05:37 UTC] No.43578685[source]▶

>>43578581 #

DALL-E 3 already uses a model that trained on synthetic data that take the prompt and augments it. This might lead to the overfitting. It could also be, and might be the simpler explanation, that its just looks up the right file from a RAG.

↑