←back to thread

1901 points l2silver | 1 comments | | HN request time: 0.211s | source

Maybe you've created your own AR program for wearables that shows the definition of a word when you highlight it IRL, or you've built a personal calendar app for your family to display on a monitor in the kitchen. Whatever it is, I'd love to hear it.
Show context
PaulHoule ◴[] No.35729958[source]
Smart RSS reader that, right now, ingests about 1000 articles a day and picks out 300 for me to skim. Since I helped write this paper

https://arxiv.org/abs/cs/0312018

I was always asking "Why is RSS failing? Why do failing RSS readers keep using the same failing interface that keeps failing?" and thought that text classification was ready in 2004 for content-based recommendation, then I wrote

https://ontology2.com/essays/ClassifyingHackerNewsArticles/

a few years ago, after Twitter went south I felt like I had to do something, so I did. Even though my old logistic regression classifier works well, I have one based on MiniLM that outperforms it, and the same embedding makes short work of classification be it "cluster together articles about Ukraine, sports, deep learning, etc." over the last four months or "cluster together the four articles written about the same event in the last four days".

I am looking towards applying it to: images, sorting 5000+ search results on a topic, workflow systems (would this article be interesting to my wife, my son, hacker news?), and commercially interesting problems (is this person a good sales prospect?)

replies(10): >>35730396 #>>35730409 #>>35737702 #>>35738576 #>>35739040 #>>35739911 #>>35744103 #>>35750477 #>>35757291 #>>35762145 #
ambicapter ◴[] No.35737702[source]
What do you mean by "outperform" in this context?
replies(1): >>35744710 #
1. PaulHoule ◴[] No.35744710[source]
Area under https://en.wikipedia.org/wiki/Receiver_operating_characteris... curve.

It's by no means perfect. For one thing if I have a choice of 0.78 ROC based on 40 days worth of data and 0.77 based on 10 days worth of data I'd take the later because I know the later one adjusts more quickly to my changing preferences. Also my selection algorithm breaks up the input into (say) 20 clusters and shows me the (say) 15 top results in each cluster and I know I like the results from that better than if I just take the highest ranking results.