Softmax forever, or why I like softmax

(kyunghyuncho.me)

184 points jxmorris12 | 4 comments | 16 Feb 25 07:08 UTC | HN request time: 0.852s | source

Show context

yorwba ◴[16 Feb 25 08:48 UTC] No.43066481[source]▶

The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)

In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.

replies(2): >>43111775 #>>43111995 #

1. totalizator ◴[20 Feb 25 07:20 UTC] No.43111995[source]▶

>>43066481 #

That would be "welcome to the world of academia". My post-doc friends won't even read a blog post prior to checking author's resume. They are very dismissive every time they notice anything they consider sloppy etc.

replies(2): >>43112088 #>>43112320 #

2. lblume ◴[20 Feb 25 07:37 UTC] No.43112088[source]▶

>>43111995 (TP) #

Which is a problem with the reputation-based academic system itself ("publish or perish") and not individuals working in it.

3. throw_pm23 ◴[20 Feb 25 08:21 UTC] No.43112320[source]▶

>>43111995 (TP) #

You seem to merge two different points. Not reading based on sloppiness is defensible. Not reading based on the author's resume less so.

replies(1): >>43117070 #

4. aidenn0 ◴[20 Feb 25 16:54 UTC] No.43117070[source]▶

>>43112320 #

When "sloppiness" is defined as "did anything on my personal list of pet peeves" (and it often is) then the defensibility of the two begin to converge.

↑