Most active commenters
  • PaulHoule(3)

←back to thread

1901 points l2silver | 25 comments | | HN request time: 1.533s | source | bottom

Maybe you've created your own AR program for wearables that shows the definition of a word when you highlight it IRL, or you've built a personal calendar app for your family to display on a monitor in the kitchen. Whatever it is, I'd love to hear it.
1. PaulHoule ◴[] No.35729958[source]
Smart RSS reader that, right now, ingests about 1000 articles a day and picks out 300 for me to skim. Since I helped write this paper

https://arxiv.org/abs/cs/0312018

I was always asking "Why is RSS failing? Why do failing RSS readers keep using the same failing interface that keeps failing?" and thought that text classification was ready in 2004 for content-based recommendation, then I wrote

https://ontology2.com/essays/ClassifyingHackerNewsArticles/

a few years ago, after Twitter went south I felt like I had to do something, so I did. Even though my old logistic regression classifier works well, I have one based on MiniLM that outperforms it, and the same embedding makes short work of classification be it "cluster together articles about Ukraine, sports, deep learning, etc." over the last four months or "cluster together the four articles written about the same event in the last four days".

I am looking towards applying it to: images, sorting 5000+ search results on a topic, workflow systems (would this article be interesting to my wife, my son, hacker news?), and commercially interesting problems (is this person a good sales prospect?)

replies(10): >>35730396 #>>35730409 #>>35737702 #>>35738576 #>>35739040 #>>35739911 #>>35744103 #>>35750477 #>>35757291 #>>35762145 #
2. greenie_beans ◴[] No.35730396[source]
this is cool, thanks for sharing
3. internetter ◴[] No.35730409[source]
Do you have public source code for this? Looks great.
replies(1): >>35730844 #
4. PaulHoule ◴[] No.35730844[source]
It's something I'm thinking about.

The system right now is highly reliable, I have no fear of doing a live demo of it, but live demos come off as strange because my feed is a strange mix of arXiv abstracts, Guardian articles about association football, etc. so it comes off as idiosyncratic and personal. (Oddly when I started this project I loved the NFL and hated the Premier League, when I started doing feature engineering as to "Why does it perform so well for arXiv papers and so poorly for sports" I started studying football articles in detail and started thinking "How would I feel if my team got relegated?" and "Wow, that game went 1-0 and it was an own goal" and next thing I knew I was hanging on every goal in every game Arsenal and Man City play -- it changed me.)

It's not even that hard for me to swap algorithms in and out but it should be easier, for instance I like the scikit-learn system for model selection mostly but there are some cases like SVC-P where I want to bypass it and I am not so sure how to comfortably fit fine-tuned transformer models into the system.

Another problem with it is that it depends on AWS Lambda and Suprfeeder for ingestion, it costs me less than $5 a month to run and about 10 cents per feed but (1) that's not cost-effective if I want to add a few hundred blogs like

https://www.righto.com/

and (2) I know many people hate AWS and other cloud services.

If somebody were interested in contributing some elbow grease that would help the case for open source, alternately a hosted demo of some kind would also be possible but I'm not ready to put my time and money into it. Contact me if you're interested in finding out more.

replies(1): >>35739028 #
5. ambicapter ◴[] No.35737702[source]
What do you mean by "outperform" in this context?
replies(1): >>35744710 #
6. 6510 ◴[] No.35738576[source]
I too have an (private) RSS "laboratory" project!

It isn't the elegant machinery you describe here as I'm quite unfamiliar with the technique you describe.

If I'm actively using it the feed list grows to about 35-40 000 at which point I find as many new feeds as I lose old ones.

I maintain a dozen categories of badwords, if any of those are in the headline it will be removed.

With many subscriptions things look quite different, higher frequency publishers start dominating the top of the newest list. The faster they publish the higher the standards I hold them to.

What is quite amazing is that some really terrible news websites use long titles that are highly descriptive. I have a good few of those, they get to stay around because the badword filter purges so much I hardly ever see them. For every 2000 bad ones business insider has a great article. It's a terrible website but their use of descriptive words in article titles is the best in the world.

The key insight imho is that the internet is much more of an echo chamber than people think.

As soon as you get rid of Musk and a few hundred other people, a few hundred companies, a dozen countries and a few thousand other topics you are left with a world of infinite other subjects. People are writing about stuff no one else ever thought of.

If everyone in the world is reading and writing about FOO it is absolutely amazing to get rid of FOO. There is no such thing as an important football match. (joking sorry)

Everyone is praising normality but you should really wonder who creates these norms. If they are good of bad people is besides the point. Musk says 1 something interesting per day I'm sure. For every 100 000 topics inserted into the collective we chose 1 then, by the tens of millions, we talk about it. Every day is Musk day.

It doesn't matter how hard you resist participating, eventually you will learn that space x launched a rocket. There is no avoiding it.

Autonomy is something fucking amazing. I imagine millions of articles are published per day. 99% things said before. What part should I want to read? The 1% with the most traffic?

You should get on the train to nowhere just like everyone else - they say. Stop wandering around on your own, you should get on the train just like me!

I'm not usually telling anyone not to get on the train. If people want to discuss "rss is dead" for the ten thousandth time, let them. They think they chose the topic themselves.

There is 13 billion years of history, 6000 sq km of earth, 7.9 billion people alive, 100 billion dead, 8.7 million species of plants and animals, 350 thousand chemical compounds, 130 million books since the printing press, 100 billion stars in the milky way alone. What to spend my time on? The Trump investigations? Really?

I'm sorry for not being very technical.

replies(3): >>35740673 #>>35748650 #>>35758836 #
7. rolisz ◴[] No.35739028{3}[source]
> If somebody were interested in contributing some elbow grease that would help the case for open source,

Sent you an email! I've been wanting such an ML powered RSS reader for quite some time. I'd love to help make it open source if possible.

8. nergal ◴[] No.35739040[source]
Nice approach! I added a very basic keyword filter in my rss reader (https://github.com/lallassu/gorss) to do some sort of "cleaning". But having a section in the reader that would filter out the articles more intelligent would be very nice, and maybe bundled them into clusters.
9. embit ◴[] No.35739911[source]
I do something similar for my personal news reader. [1]. Originally I had done it so I can read my tech news quickly. Now few of my friends also have started using it.

[1] https://embit.ca

replies(1): >>35742001 #
10. PcChip ◴[] No.35740673[source]
Interested in your filters, or a link to your results!
11. md_ ◴[] No.35742001[source]
Amusing. I, too, wrote my own ML-powered newsreader. (Not linking here because I don't want to de-anonymize my HN handle.)

I guess this is a thing people do. ;)

replies(1): >>35748635 #
12. kasrak ◴[] No.35744103[source]
Cool!

I've been hacking on a related side project -- in my case, I wanted something like this but for Twitter. Right now it's using gpt-3.5-turbo to cluster related Tweets & rank based on my interests.

Source is here: https://github.com/kasrak/feedpaper

replies(1): >>35746815 #
13. PaulHoule ◴[] No.35744710[source]
Area under https://en.wikipedia.org/wiki/Receiver_operating_characteris... curve.

It's by no means perfect. For one thing if I have a choice of 0.78 ROC based on 40 days worth of data and 0.77 based on 10 days worth of data I'd take the later because I know the later one adjusts more quickly to my changing preferences. Also my selection algorithm breaks up the input into (say) 20 clusters and shows me the (say) 15 top results in each cluster and I know I like the results from that better than if I just take the highest ranking results.

14. imirzadeh ◴[] No.35746815[source]
I also wrote my own feed aggregator (https://mofeed.news) from scratch in Go. It can connect to twitter, reddit, spotify/itunes(podcasts), rss, medium, youtube, etc.It has a good search engine (meilisearch) and also supports newsletter (each user has an inbox).

I'm currently testing it and have a few test users for feedback. I hope I can open source it sometime this year, after integrating the feedbacks, and polishing the code.

It's not intelligent for now, but that's by choice. Instead, I prefer to have custom rules for filtering (e.g., adding tweets from user X to feed only if the likes are above a threshold). I may use GPT for summarization later, but honestly most of the posts are either short (e.g., tweets), or they come from websites that have description/summary in their meta tags.

Please shoot me an email to `hey@mofeed.news` if you want to test it :)

15. hammyhavoc ◴[] No.35748635{3}[source]
Does it use naive Bayes powered recommendations?
replies(1): >>35862196 #
16. hammyhavoc ◴[] No.35748650[source]
On the one hand, "echo chambers bad", on the other hand, I'd be lying if I didn't acknowledge that what floods and outnumbers most other content is a form of signal-to-noise ratio manipulation, and is thus creating an artificial echo chamber by sheer volume.

Would love to know more. I'm a sucker for anything RSS.

replies(1): >>35750366 #
17. 6510 ◴[] No.35750366{3}[source]
Ok, ill tell you some of the funny parts... I've kinda enjoyed keeping the "secret" for long enough.

My aggregator is actually a giant grease monkey script. Grease monkey can just do xml requests cross domain. HTML is great for creating links that open in the browser and I don't want to be alt tapping from one window to the other. This also makes me uninterested in the <description> element which consumes a huge amount of space. I'm really discarding everything except [the pubDate, the headline, a link to the content chopped into the [sub] domain, and the rest of the path] It's a giant date sorted array of arrays. New entries are pushed into a temp array from which they are pop'ed and spliced into the right spot in the big one. I usually keep 5000 entries (2000 is enough tho) so nr 5001 is removed.

It stays small data that way! I can periodically post it to some simple php script that turns it into a static html document. If the oldest result is newer than the newest from the previous dump it dumps it again.

Every 20 seconds the table on the page is replaced with the updated results. I had it in real time but it moves to much to read and click on things. No need to preserve any dom, dom manipulation is slow, just feed a whole new table into the page as a string.

XHR requests must be async which is a terrible technology. You don't know how many responses you are going to get per second. It has some relationship with the number of requests you make but making a lot of requests there is no way to avoid receiving a lot simultaneously which, if you try do anything with the response text, freezes the browser. There is no time to do more than push them into an array.

Then I use a setInterval to parse them, how many depends on how far behind the parser is. setInterval delays if the browser is busy, parsing can be far behind. If it is to far behind the number of requests per second is dialed down.

Probably the funniest part is the parser

It starts with a regex exec in a do{}while() loop. It looks for things that look like pubDates, it compares up to 3 time stamps with the oldest pubDate in my result set. Most feeds never make it beyond that point.

Then it tries to parse the feed by the rules. It looks for the string <item>, if it finds that it looks for <title> etc

Lots of feeds are hilariously broken and invalid so I have an arsenal of alternative approaches in the correct order of likelyhood.

The funniest moment was when someone deleted their feed and redirected it to the website front page.... but I never noticed it. It tried to parse the "feed" in all ways it could and as a last resort looked for <a href=""> elements. Because it also couldn't find a pubDate of any kind, nothing that looked like a date, it took the /2023/01/03 part of the url and made a time stamp from that, inserted it into the result set and happily continued.

Feeds that are slow get suspended, lvl 1 is 24 hours, level 5 is a week, 5 weeks is the limit. I might try them again after a few more months.

In monitor mode each function has it's own tiny log window. The log() function is normally empty, modern js skips it efficiently regardless what kind of complex params are set.

At times I have benchmarking things inthere so that it choses the fastest approach automatically.

It can consume opml (any number pretty much) but I prefer using flat lists of feed urls.

Apart from the (optional) dumps the backend is just static files.

Because it runs in the browser it can also use credentials. It mixes in my gmail feed too. https://mail.google.com/mail/feed/atom

Not really something I want to share with the world.

I will share this http://salamisushi.go-here.nl

This is a greassemonkey script, it detects feeds linked in the web pages you visit, displays an edit box with the feeds it found and it lets you export the list as an opml file.

The web pages you visit are actually amazingly interesting to you personally. Much more so than one would expect. Unsubbing the garbage goes much faster than finding interesting feeds manually one by one. If there is a lot of garbage it is because you've visited to many garbage websites. It's funny to reflect on.

Thanks for your time

replies(2): >>35750498 #>>35752234 #
18. awestroke ◴[] No.35750477[source]
Skimming 300 articles per day sounds exhausting
replies(1): >>35752228 #
19. walterbell ◴[] No.35750498{4}[source]
This commentary is even better than code :) Thanks!
20. arendtio ◴[] No.35752228[source]
I agree. I have built a 'new tab' website for me and included a widget with the top 3 from the HN best list. So I am very happy with about 1-2 per day ;-)

Btw. this is the way I found this thread :D

21. ydant ◴[] No.35752234{4}[source]
This all reminds me of my friend, who doesn't really have a programming background, but uses Tasker (an Android app programmed through a UI) to automate tons of stuff and is incredibly happy with it. Meanwhile, I've had tons of ideas of things I'd like to automate, but never did, because Tasker is clunky and not "reliable" or "the right way", and a "proper" approach would be "better" or Tasker "might stop working / be unreliable". In the end, he has accomplished a lot more than I have, even though I supposedly have a much better foundation for accomplishing those sorts of things.

That is to say, I admire your approach here. I appreciate that you've just gone with it and did what many people on here would look at and say "that's insane".

There's a lot of power in just doing something, in the environment that's most readily available to you, and not getting hung up on the best way to plan it out and do it and make sure it's reliable and durable. You're exactly where you want to be, and the other person is sitting back at the start still planning.

Thank you for sharing.

22. stormfather ◴[] No.35757291[source]
How does the bit that determines if 2 articles are really about the same story work? How did you structure training data for that and get a training set?
23. unsane ◴[] No.35758836[source]
Your post resonated with me a lot.

I'm jealous! Experiencing the world with all the garbage filtered out.

24. barakplasma ◴[] No.35762145[source]
https://github.com/piqoni/matcha can summarize feed articles with gpt3.5 , but not categorize yet
25. md_ ◴[] No.35862196{4}[source]
No. A DNN w/ BERT/nnlm embeddings (I'm between models right now). And a sort of context-based classifier so that it can react to new things you "like" without model retraining.