3D models (sculpts, texture, retopo, etc.) are following a similar trend and trajectory.
Open video models are lagging behind by several years. While CogVideo and Pyramid are promising, video models are petabyte scale and so much more costly to build and train.
I'm hoping video becomes free and cheap, but it's looking like we might be waiting a while.
Major kudos to all of the teams building and training open source models!
~25x faster performance than Flux-dev, while offering comparable quality in benchmarks. And visually the examples (surely cherry-picked, but still) look great!
Especially since with GenAI the best way to get good results is to just generate a large amount of them and pick the best (imo). Performance like this will make that much easier/faster/cheaper.
Code is unfortunately "(Coming soon)" for now. Can't wait to play with it!
I'd be curious to see how a vision model would go if it were finetuned to select the best image match to a given criteria.
It's possible that you could do O1 style training to build a final stage auto-cherrypicker.
Looking at their methodology, it seems like it's more of an accumulation of existing good ideas into the one model.
If it performs as well as they say, perhaps you can say the breakthrough is discovering just how much can be gained by combining recent advances.
It's sitting on just the edge of sounding too good to be true to me. I will certainly be pleased if it holds up to scrutiny.
Basically they compress/decompress the images more, which means they need less computation during generation. But on the flip side this should mean less variability.
Isn't this more of a design trade-off than an optimization?
> surely cherry-picked
As someone who works in generative vision, this is one of the most frustrating aspects (especially for those with less GPU resources). There's been a silent competition for picking the best images and not showing random results (even when there are random results they may be a selected batch). So it is hard to judge actual quality until you can play around.Also, I'm not sure what laptop that is but they say 0.37s to generate a 1024x1024 image on a 4090. They also mention that it requires 16GB VRAM. But that laptop looks like a MSI Titan, which has a 4090, and correct me if I'm wrong, but I think the 4090 is the only mobile card with 16GB?[0] (I know desktop graphics have 16 for most cards). The laptop demo takes 4s to generate a 1024x1024 image. But they are chopped down quite a bit[1]
I wonder if that's with or without TensorRT
[0] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...
[1] https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...
It would decrease the workload by having fewer things to compare against balanced against workload per comparison. For normal N² that makes sense but the page says.
We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N²) to O(N) Mix-FFN
So not sure what's up there.
On the other hand, at least one suit was making headway as of 2024-08-14, about 2 months ago [0]. It seems like there must be some merit to GPs claim if this is moving forward. But again, I'm still trying to figure out where to stand.
[0] https://arstechnica.com/tech-policy/2024/08/artists-claim-bi...
That would be useful for e.g. book illustration, comic strips, icon sets. Otherwise, people would think you pick those images all over the internet and not from one source/theme.
Looking forward to it. This space just keeps getting more interesting.
The remaining claim may not be a good claim, but it isn't completely laughable.
https://cdn.arstechnica.net/wp-content/uploads/2024/08/Ander... Order-on-Motions-to-Dismiss-8-12-2024.pdf
In October 2023, I largely granted the motions to dismiss brought by defendants Stability, Midjourney and DeviantArt. The only claim that survived was the direct infringement claim asserted against Stability, based on Stability’s alleged “creation and use of ‘Training Images’ scraped from the internet into the LAION datasets and then used to train Stable Diffusion.”
I think you could have grounds for saying that construction of LAION violates copyright which would be covered by this. It doesn't necessarily mean training on LAION is copyright violation.
None of this has been decided. It might be wrong.
The rest of the case was "Not even wrong"
The learning process is similar, and it isn't identical.
Humans and AI both have the intellectual capacity to violate copyright, but also human artists generally know what copyright is while image generators don't (even the LLMs which do understand copyright are easily fooled, and many of the users complain about them being "lobotomised" if they follow corporate policy rather than user instructions).
And while there's people like me who really did mean "public domain" or "MIT license" well before even GANs, it's also true that most people couldn't have given informed consent prior to knowing what these models could do.
The cost of the electricity needed to create an image, was the cost of hiring someone on the UN abject poverty threshold to examine it for 10 seconds… with 2 year old models and hardware:
https://benwheatley.github.io/blog/2022/10/09-19.33.04.html
(There's also trademark issues; from the discussions, I think those are what artists actually care about even though they use the word "copyright").
Nonetheless, as an art director, nothing I'd put into production. I guess that's because what I'm focused on is tickling the client base with something original.
You have to release your model in some fashion for it to be impressive.
Better metrics (assuming goal is text->image) would be some sort of inception score or CLIP-based text matching score. These metrics are computable on single samples.
I did a lot of 3D-rendered illustration back in the 1990s and early 2000s, necessarily low-polygon stuff, but things that were supposed to be life-like scenes, with tons of textures, that took a very long time to render. Including what may have been the first and only children's book illustrated with Infini-D on a Mac IIsi.
So, feeding these old renderings into StableDif with 75% bias toward the original image and a couple of basic prompts, produced results that blew my mind. It was like seeing what my illustrations could have been if I'd had a team at Pixar Studios refining them. In the sense that it was still my character art and my creation, totally recognizable, but polished and refined to a level that would have been unimaginable in 1997.