edit: to clarify, it may be possible to get this loss back and there is reason to be optimistic
Still, I’m a little sad they didn’t release the optimizer weights. It would’ve given us so much valuable info about the dataset, among other benefits.
Is it better to have billions of high resolution parameters and quantize them at the end, or to train low resolution parameters where the training algorithms see the lower resolution? It’s beyond me, but I’d love to know.
So if the model is computed using float16s, distribute as-is and let the end user choose to user it like that or compromise for faster processing of there system can deal with many billions of int8s more effectively.
https://news.ycombinator.com/item?id=34478503
I have long wished for less linear stories in video games, where branching narrative (a la Choose Your Own Adventure) is one possible way to give the player agency. The problem is, true branches are expensive, because you end up writing a bunch of content the player never experiences.
I see a lot of potential, but it's going to take a different kind of craftsmanship, and likely many iterations, to realize something more than a novelty.
Obviously $1.25/hr 24/7 does add up quickly, after one month the bill would come to $900.
Imagine playing a level and doing some particular feats in it. They get presented to GPT with a prompt and the story gets send to a AI voice model in game where the NPC asks/tells the player character about it.
Unclear why you think that since experiments show the opposite.
In general the gradient seems to get too "bumpy" to do good gradient decent at lower levels of precision.
There are some papers showing that making the training loop aware of quantitization can help ultimate quantizied performance but I'm not aware of this being implemented at large scale.
Different resolutions draw your attention to different types of features.
Forgive my lack of proper terminology here.
1. You bound the branching in a particular fashion, and provide overall "pressures" into certain story arcs.
2. You use generative AI in a LOT more places in the game.
What happens when you are playing a Sci-Fi game, and you get the enemy NPC to somehow hallucinate that he is the King of Dragons, but you don't have Dragon models/animations/movesets in your game files? You either bound the LLM to not hallucinate that, or you generate that dragon live. I guess a 3rd option, is your game is a comedy and the King NPC gets labeled a crazy person.
However you can reduce memory by doing mixed-precision training if you are careful. See section "2.3.1. Loss Scaling To Preserve Small Gradient Magnitudes" in https://docs.nvidia.com/deeplearning/performance/mixed-preci...
I'm not quite sure I understand what they are describing in 2.3.1, are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?
I was thinking the a way to go about it would be to just increase the "mesh resolution" near the small hole, which in this case would be use a larger precision in the area local to the hole.
And teams with limited resources could also still handcraft the stories and quests but use LLMs to generate or add some variety or context awareness to the dialogues, at a lower cost.
It might be possible to work around this by estimating the gradient volatility through the n^th order derivatives, but you would then also have to deal with mixed precision SIMD which hardware doesn't really support.
https://www.youtube.com/watch?v=i-Aw32rgM-w&ab_channel=Kella...