Apple just released a weirdly interesting coding language model

(9to5mac.com)

Show context

vessenes ◴[08 Jul 25 10:51 UTC] No.44498842[source]▶

>>44472062 (OP) #

Short version: A Qwen-2.5 7b model that has been turned into a diffusion model.

A couple notable things: first is that you can do this at all, (left to right model -> out of order diffusion via finetuning) which is really interesting. Second, the final version beats original by a small margin on some benchmarks. Third is that it’s in the ballpark of Gemini diffusion, although not competitive — to be expected for any 7B parameter model.

A diffusion model comes with a lot of benefits in terms of parallelization and therefore speed; to my mind the architecture is a better fit for coding than strict left to right generation.

Overall, interesting. At some point these local models will get good enough for ‘real work’ and they will be slotted in at API providers rapidly. Apple’s game is on-device; I think we’ll see descendants of these start shipping with Xcode in the next year as just part of the coding experience.

replies(6): >>44498876 #>>44498921 #>>44499170 #>>44499226 #>>44499376 #>>44501060 #

1. baobun ◴[08 Jul 25 11:58 UTC] No.44499170[source]▶

>>44498842 #

Without having tried it, what I keep getting surprised with is how apparently widely different architectures (and in other cases training data) lead to very similar outcomes. I'd expect results to vary a lot more.

replies(3): >>44499473 #>>44499659 #>>44500645 #

2. viraptor ◴[08 Jul 25 12:50 UTC] No.44499473[source]▶

>>44499170 (TP) #

It doesn't look like it got pushed that much unfortunately. The article says they only added 20k examples to fine tune at the end, but maybe the ceiling is much higher for diffusion?

But yeah, RWKV also ends up in a similar performance area with similar sizes - I wish someone started using it at scale finally...

3. IMTDb ◴[08 Jul 25 13:14 UTC] No.44499659[source]▶

>>44499170 (TP) #

I would expect a lot of attempts to fail and those tend to not be published, or gather less attention. So if we have reached a local optimum, any technique that gets close to the current benchmarks is worth publishing, as soon as results reach that point. All the one that are too distant are discarded. In the end all the paper you see are close to the current status quo.

It's possible that some of those new architecture / optimization would allow us to go beyond the current benchmark score, but probably with more training data, and money. But to get money you need to show results, which is what you see today. Scaling remains king; maybe one of these technique is 2025 "attention" paper, but even that one needed a lot of scaling to go from the 2017 version to ChatGPT.

4. hnaccount_rng ◴[08 Jul 25 15:02 UTC] No.44500645[source]▶

>>44499170 (TP) #

But if the limiting factor is the data on which the models are trained and not the actual “computation” than this would be exactly expected right?

replies(1): >>44500914 #

5. Ldorigo ◴[08 Jul 25 15:26 UTC] No.44500914[source]▶

>>44500645 #

The data might be the limiting factor of current transformer architectures, but there's no reason to believe it's a general limiting factor of any language model (e.g. humans brains are "trained" on orders of magnitude less data and still generally perform better than any model available today)

replies(1): >>44501486 #

6. hnaccount_rng ◴[08 Jul 25 16:20 UTC] No.44501486{3}[source]▶

>>44500914 #

That depends on whether these current learning models can really generalise or whether they can only interpolate within their training set

↑