←back to thread

1901 points l2silver | 4 comments | | HN request time: 0.647s | source

Maybe you've created your own AR program for wearables that shows the definition of a word when you highlight it IRL, or you've built a personal calendar app for your family to display on a monitor in the kitchen. Whatever it is, I'd love to hear it.
Show context
PaulHoule ◴[] No.35729958[source]
Smart RSS reader that, right now, ingests about 1000 articles a day and picks out 300 for me to skim. Since I helped write this paper

https://arxiv.org/abs/cs/0312018

I was always asking "Why is RSS failing? Why do failing RSS readers keep using the same failing interface that keeps failing?" and thought that text classification was ready in 2004 for content-based recommendation, then I wrote

https://ontology2.com/essays/ClassifyingHackerNewsArticles/

a few years ago, after Twitter went south I felt like I had to do something, so I did. Even though my old logistic regression classifier works well, I have one based on MiniLM that outperforms it, and the same embedding makes short work of classification be it "cluster together articles about Ukraine, sports, deep learning, etc." over the last four months or "cluster together the four articles written about the same event in the last four days".

I am looking towards applying it to: images, sorting 5000+ search results on a topic, workflow systems (would this article be interesting to my wife, my son, hacker news?), and commercially interesting problems (is this person a good sales prospect?)

replies(10): >>35730396 #>>35730409 #>>35737702 #>>35738576 #>>35739040 #>>35739911 #>>35744103 #>>35750477 #>>35757291 #>>35762145 #
6510 ◴[] No.35738576[source]
I too have an (private) RSS "laboratory" project!

It isn't the elegant machinery you describe here as I'm quite unfamiliar with the technique you describe.

If I'm actively using it the feed list grows to about 35-40 000 at which point I find as many new feeds as I lose old ones.

I maintain a dozen categories of badwords, if any of those are in the headline it will be removed.

With many subscriptions things look quite different, higher frequency publishers start dominating the top of the newest list. The faster they publish the higher the standards I hold them to.

What is quite amazing is that some really terrible news websites use long titles that are highly descriptive. I have a good few of those, they get to stay around because the badword filter purges so much I hardly ever see them. For every 2000 bad ones business insider has a great article. It's a terrible website but their use of descriptive words in article titles is the best in the world.

The key insight imho is that the internet is much more of an echo chamber than people think.

As soon as you get rid of Musk and a few hundred other people, a few hundred companies, a dozen countries and a few thousand other topics you are left with a world of infinite other subjects. People are writing about stuff no one else ever thought of.

If everyone in the world is reading and writing about FOO it is absolutely amazing to get rid of FOO. There is no such thing as an important football match. (joking sorry)

Everyone is praising normality but you should really wonder who creates these norms. If they are good of bad people is besides the point. Musk says 1 something interesting per day I'm sure. For every 100 000 topics inserted into the collective we chose 1 then, by the tens of millions, we talk about it. Every day is Musk day.

It doesn't matter how hard you resist participating, eventually you will learn that space x launched a rocket. There is no avoiding it.

Autonomy is something fucking amazing. I imagine millions of articles are published per day. 99% things said before. What part should I want to read? The 1% with the most traffic?

You should get on the train to nowhere just like everyone else - they say. Stop wandering around on your own, you should get on the train just like me!

I'm not usually telling anyone not to get on the train. If people want to discuss "rss is dead" for the ten thousandth time, let them. They think they chose the topic themselves.

There is 13 billion years of history, 6000 sq km of earth, 7.9 billion people alive, 100 billion dead, 8.7 million species of plants and animals, 350 thousand chemical compounds, 130 million books since the printing press, 100 billion stars in the milky way alone. What to spend my time on? The Trump investigations? Really?

I'm sorry for not being very technical.

replies(3): >>35740673 #>>35748650 #>>35758836 #
1. hammyhavoc ◴[] No.35748650[source]
On the one hand, "echo chambers bad", on the other hand, I'd be lying if I didn't acknowledge that what floods and outnumbers most other content is a form of signal-to-noise ratio manipulation, and is thus creating an artificial echo chamber by sheer volume.

Would love to know more. I'm a sucker for anything RSS.

replies(1): >>35750366 #
2. 6510 ◴[] No.35750366[source]
Ok, ill tell you some of the funny parts... I've kinda enjoyed keeping the "secret" for long enough.

My aggregator is actually a giant grease monkey script. Grease monkey can just do xml requests cross domain. HTML is great for creating links that open in the browser and I don't want to be alt tapping from one window to the other. This also makes me uninterested in the <description> element which consumes a huge amount of space. I'm really discarding everything except [the pubDate, the headline, a link to the content chopped into the [sub] domain, and the rest of the path] It's a giant date sorted array of arrays. New entries are pushed into a temp array from which they are pop'ed and spliced into the right spot in the big one. I usually keep 5000 entries (2000 is enough tho) so nr 5001 is removed.

It stays small data that way! I can periodically post it to some simple php script that turns it into a static html document. If the oldest result is newer than the newest from the previous dump it dumps it again.

Every 20 seconds the table on the page is replaced with the updated results. I had it in real time but it moves to much to read and click on things. No need to preserve any dom, dom manipulation is slow, just feed a whole new table into the page as a string.

XHR requests must be async which is a terrible technology. You don't know how many responses you are going to get per second. It has some relationship with the number of requests you make but making a lot of requests there is no way to avoid receiving a lot simultaneously which, if you try do anything with the response text, freezes the browser. There is no time to do more than push them into an array.

Then I use a setInterval to parse them, how many depends on how far behind the parser is. setInterval delays if the browser is busy, parsing can be far behind. If it is to far behind the number of requests per second is dialed down.

Probably the funniest part is the parser

It starts with a regex exec in a do{}while() loop. It looks for things that look like pubDates, it compares up to 3 time stamps with the oldest pubDate in my result set. Most feeds never make it beyond that point.

Then it tries to parse the feed by the rules. It looks for the string <item>, if it finds that it looks for <title> etc

Lots of feeds are hilariously broken and invalid so I have an arsenal of alternative approaches in the correct order of likelyhood.

The funniest moment was when someone deleted their feed and redirected it to the website front page.... but I never noticed it. It tried to parse the "feed" in all ways it could and as a last resort looked for <a href=""> elements. Because it also couldn't find a pubDate of any kind, nothing that looked like a date, it took the /2023/01/03 part of the url and made a time stamp from that, inserted it into the result set and happily continued.

Feeds that are slow get suspended, lvl 1 is 24 hours, level 5 is a week, 5 weeks is the limit. I might try them again after a few more months.

In monitor mode each function has it's own tiny log window. The log() function is normally empty, modern js skips it efficiently regardless what kind of complex params are set.

At times I have benchmarking things inthere so that it choses the fastest approach automatically.

It can consume opml (any number pretty much) but I prefer using flat lists of feed urls.

Apart from the (optional) dumps the backend is just static files.

Because it runs in the browser it can also use credentials. It mixes in my gmail feed too. https://mail.google.com/mail/feed/atom

Not really something I want to share with the world.

I will share this http://salamisushi.go-here.nl

This is a greassemonkey script, it detects feeds linked in the web pages you visit, displays an edit box with the feeds it found and it lets you export the list as an opml file.

The web pages you visit are actually amazingly interesting to you personally. Much more so than one would expect. Unsubbing the garbage goes much faster than finding interesting feeds manually one by one. If there is a lot of garbage it is because you've visited to many garbage websites. It's funny to reflect on.

Thanks for your time

replies(2): >>35750498 #>>35752234 #
3. walterbell ◴[] No.35750498[source]
This commentary is even better than code :) Thanks!
4. ydant ◴[] No.35752234[source]
This all reminds me of my friend, who doesn't really have a programming background, but uses Tasker (an Android app programmed through a UI) to automate tons of stuff and is incredibly happy with it. Meanwhile, I've had tons of ideas of things I'd like to automate, but never did, because Tasker is clunky and not "reliable" or "the right way", and a "proper" approach would be "better" or Tasker "might stop working / be unreliable". In the end, he has accomplished a lot more than I have, even though I supposedly have a much better foundation for accomplishing those sorts of things.

That is to say, I admire your approach here. I appreciate that you've just gone with it and did what many people on here would look at and say "that's insane".

There's a lot of power in just doing something, in the environment that's most readily available to you, and not getting hung up on the best way to plan it out and do it and make sure it's reliable and durable. You're exactly where you want to be, and the other person is sitting back at the start still planning.

Thank you for sharing.