Show HN: Epublifier – scrape pages (books, manuals) for offline reading

(github.com)

1. dartharva ◴[21 Oct 24 13:37 UTC] No.41904034[source]▶

>>41903864 (OP) #

Awesome!

replies(1): >>41907032 #

2. mati365 ◴[21 Oct 24 13:54 UTC] No.41904180[source]▶

>>41903864 (OP) #

Is it legal?

replies(3): >>41904216 #>>41904275 #>>41905279 #

3. reaperducer ◴[21 Oct 24 13:58 UTC] No.41904216[source]▶

>>41904180 #

Depends on where you live.

Where I am, it's perfectly legal.

Before cell service was as widespread as it is today, there were programs that would scrape web pages into ePUBs so you could read them later on your Palm Pilot. I used it every day during my commute. And the best part was that they ended. No mind-numbing infinite scroll.

When I switched to a "smart" phone (SonyEricsson m600c), I really missed it.

replies(3): >>41904456 #>>41904997 #>>41905412 #

4. ◴[21 Oct 24 14:02 UTC] No.41904275[source]▶

>>41904180 #

5. stronglikedan ◴[21 Oct 24 14:08 UTC] No.41904341[source]▶

>>41903864 (OP) #

If this can handle those sites where every section is behind an accordion that must be expanded (and especially where it collapses other sections when you expand one), then this is going to be awesome.

replies(2): >>41904579 #>>41904625 #

6. bloopernova ◴[21 Oct 24 14:16 UTC] No.41904435[source]▶

>>41903864 (OP) #

E-Reader makers, take note. This is a cool feature that should be built in or at least able to be used with an API to get content onto the Kindle/etc. Or even a "send to Kindle" email address that can accept URLs too.

replies(4): >>41904795 #>>41905015 #>>41905131 #>>41905696 #

7. thesuitonym ◴[21 Oct 24 14:18 UTC] No.41904456{3}[source]▶

>>41904216 #

I wouldn't want to go back, because having instant access to anything is pretty amazing, but I do miss those days of offline internet.

replies(1): >>41904548 #

8. richardlblair ◴[21 Oct 24 14:26 UTC] No.41904548{4}[source]▶

>>41904456 #

Fully agree. I recently replaced my doomscrolling with a retro handheld and it really makes me happy. It also pushed me to pick up my ereader again.

I spend enough time at a computer than I shouldn't really need a smartphone outside of 'I need to message ___' or 'I need to go ___'

9. dotancohen ◴[21 Oct 24 14:31 UTC] No.41904579[source]▶

>>41904341 #

Can it remove popups for newsletters, or subscription, or logins, or cookies' notifications? Can it read pages that requires signing in?

replies(1): >>41904670 #

10. solarkraft ◴[21 Oct 24 14:34 UTC] No.41904608[source]▶

>>41903864 (OP) #

Neat!

I once made a simple version of this concept that saves an epub file on the server‘s file system, which is then synced to my e-book reader:

https://github.com/solarkraft/webpub

The main ingredient is Postlight Parser, which gives a simplified „document“ view for a website.

11. maoserr ◴[21 Oct 24 14:35 UTC] No.41904625[source]▶

>>41904341 #

Works on this site: https://docs.ray.io/en/latest/ for me.

12. maoserr ◴[21 Oct 24 14:39 UTC] No.41904670{3}[source]▶

>>41904579 #

It extract the main content using Readability by default (you can configure it with something else). Logins would depend on how you're parsing. It has two modes, it either browses to the page inside the window (for non-refreshing pages), or retrieves it in the background using fetch.

replies(1): >>41907400 #

13. andai ◴[21 Oct 24 14:49 UTC] No.41904795[source]▶

>>41904435 #

I wonder if this would have a positive or negative effect on profits.

On the one hand, they'd be adding a massive amount of free content to a platform where they make money because people pay to consume content.

On the other hand, it might actually increase sales simply because I'd spend more time using it, which would presumably result in more book purchases too.

(Also Kindle store is already full of $0 public domain stuff, so they already don't seem too bothered by that possibility.)

replies(1): >>41905029 #

14. kemayo ◴[21 Oct 24 14:57 UTC] No.41904893[source]▶

>>41903864 (OP) #

Having written my own one of these, the interesting thing about this one is really the UI for iterating on extracting content from an arbitrary site. Having a full GUI for working through the extraction is much more flexible than the norm.

15. ffsm8 ◴[21 Oct 24 15:02 UTC] No.41904955[source]▶

>>41903864 (OP) #

Heh, I'm currently creating something very similar.

A web scraper for blogs and mainly web novels etc and ePub parser that persists the data to database along with categories and tags, and a companion PWA for offline reading to track reading progress on various stories and let me keep multiple versions of the same story (web novels and published epub).

16. latchkey ◴[21 Oct 24 15:06 UTC] No.41904997{3}[source]▶

>>41904216 #

Danger Hiptop had a proxy that reformatted websites for their built in browser. Mostly as a way to reduce data transfer amounts.

https://medium.com/@chrisdesalvo/the-future-that-everyone-fo...

17. bryanrasmussen ◴[21 Oct 24 15:08 UTC] No.41905015[source]▶

>>41904435 #

You have this with the Remarkable sort of - https://remarkable.com/blog/introducing-read-on-remarkable

18. joseda-hg ◴[21 Oct 24 15:09 UTC] No.41905029{3}[source]▶

>>41904795 #

Huh didn't know that, guess I never assummed they would bother with it, I'd think about a published work in kindle like a product page in amazon therefore doesn't make sense to have 0$ items

Are they an amazon offer or do third parties take the time to set that up?

replies(1): >>41908269 #

19. ◴[21 Oct 24 15:19 UTC] No.41905131[source]▶

>>41904435 #

20. Tepix ◴[21 Oct 24 15:34 UTC] No.41905279[source]▶

>>41904180 #

If you can read it on a website, why not on an ebook reader?

If you start selling the resulting files, now that would be a copyright violation. German law has a right to create a "Privatkopie", i.e. a private copy. I guess this is similar to fair use in US law?

21. stuxnet79 ◴[21 Oct 24 15:35 UTC] No.41905290[source]▶

>>41903864 (OP) #

For those interested in a simple to use command line tool that accomplishes the same I've had success with percollate - https://github.com/danburzo/percollate

replies(3): >>41905350 #>>41905429 #>>41906025 #

22. ◴[21 Oct 24 15:41 UTC] No.41905350[source]▶

>>41905290 #

23. anthk ◴[21 Oct 24 15:46 UTC] No.41905387[source]▶

>>41903864 (OP) #

I had that, buf for terminal under Unix and for web pages, Gopher and Gemini. Offpunk:

https://sr.ht/~lioploum/offpunk/

Instead of Epub, it get catched down into text files (Gopher), Gemini files (Gemini) and HTML+images (Web Pages). You can visit the hier from ~/.cache/offpunk or directly from Offpunk.

With the "tour" function, forget about doomscrolling. You'll read all the articles in text mode sequentially until you finish down.

24. anthk ◴[21 Oct 24 15:48 UTC] No.41905412{3}[source]▶

>>41904216 #

If you have a GNU/Linux/Mac/BSD machine with Python:

https://sr.ht/~lioploum/offpunk/

25. Mkengine ◴[21 Oct 24 15:49 UTC] No.41905419[source]▶

>>41903864 (OP) #

Does it support http://fanfiction.net/ ? I never found an easy solution for that one.

replies(4): >>41905667 #>>41905946 #>>41906438 #>>41908224 #

26. Mkengine ◴[21 Oct 24 15:50 UTC] No.41905429[source]▶

>>41905290 #

Does it support http://fanfiction.net/ ? I never found an easy solution for that one.

27. maoserr ◴[21 Oct 24 16:14 UTC] No.41905667[source]▶

>>41905419 #

You can import a csv of all the chapter links, looks like it's just incremental numbering in the url

replies(1): >>41906788 #

28. 39896880 ◴[21 Oct 24 16:17 UTC] No.41905696[source]▶

>>41904435 #

Kobo has Pocket integration, is this substantially different?

29. pasc1878 ◴[21 Oct 24 16:46 UTC] No.41905946[source]▶

>>41905419 #

I use a calibre add-in https://www.mobileread.com/forums/showthread.php?t=259221

It sort of works ie some stories just work others just get the first page.

30. tra3 ◴[21 Oct 24 16:54 UTC] No.41906025[source]▶

>>41905290 #

This looks great!! I've long been looking for something that leverages readability (or similar).

Edit: Tried it with Reuters and it looks like percolate requires javascript, etc. Back to using "Print as PDF" from the browser.

31. vivzkestrel ◴[21 Oct 24 17:22 UTC] No.41906247[source]▶

>>41903864 (OP) #

Gonna love running this on all the documentation heavy websites like AWS VueJS MDN w3schools realpython betterstack

32. seridescent ◴[21 Oct 24 17:41 UTC] No.41906438[source]▶

>>41905419 #

you can export epubs from https://fichub.net/

33. t-3 ◴[21 Oct 24 18:15 UTC] No.41906788{3}[source]▶

>>41905667 #

The issue is most likely cloudflare blocking most the best scraping methods. If the site can be pulled down with eg. wget or curl without a bunch of options that you definitely aren't writing by hand, pandoc can just be used to directly make an epub.

34. B1FF_PSUVM ◴[21 Oct 24 18:38 UTC] No.41907032[source]▶

>>41904034 #

It's rather unfair to "first commenters", who got the article up from the pile and left a quick recommendation, to get downvoted by latecomers.

(dartharva's comment was the only thing here when I first looked from the front page)

35. 3abiton ◴[21 Oct 24 18:48 UTC] No.41907130[source]▶

>>41903864 (OP) #

This is an amazing tool! Long gone are the days when I used to force cache many webpages for offline travels.

36. dotancohen ◴[21 Oct 24 19:15 UTC] No.41907400{4}[source]▶

>>41904670 #

Terrific, thank you.

37. kemayo ◴[21 Oct 24 20:30 UTC] No.41908224[source]▶

>>41905419 #

Fanfiction.net is trivial... apart from it having Cloudflare bot blocking turned up to aggressive levels. I've not seen an approach that works, other than using headless browsers to fetch the content.

replies(1): >>41909356 #

38. andai ◴[21 Oct 24 20:34 UTC] No.41908269{4}[source]▶

>>41905029 #

It's on Amazon, tons of public domain stuff republished for $0 on Kindle. 1 click to "purchase" (free download).

39. KaoruAoiShiho ◴[21 Oct 24 21:24 UTC] No.41908662[source]▶

>>41903864 (OP) #

Calibre supports a massive list of sites.

https://github.com/JimmXinu/FanFicFare

https://github.com/JimmXinu/FanFicFare/wiki/SupportedSites

40. noam_compsci ◴[21 Oct 24 21:31 UTC] No.41908722[source]▶

>>41903864 (OP) #

Every so often, I want to get an epub of Paul Graham’s essays (eg right before a flight). Hopefully I’ll remember to use this

41. theultdev ◴[21 Oct 24 22:45 UTC] No.41909356{3}[source]▶

>>41908224 #

headless browsers won't work by default for cloudflare captchas.

open source stealth plugins don't really work now either.

you have to use real browser fingerprints.

42. maoserr ◴[21 Oct 24 22:58 UTC] No.41909433[source]▶

>>41903864 (OP) #

Hey I want to address why this extension is different from other scrapers.

This is for ad hoc generation of EPub from websites that don't have scrape well using traditional scrapers (think standard request based command line scripts or some other chrome extensions that scrape based on open tabs/window) for some reasons:

1. Usually command line scrapers and other extensions have predefined sites they work for, this one's outside of those sites

2. Or they requires nontrivial configuration and/or code

3. Some sites use javascript to dynamically generates/retrieve the text, in which case you need the browser to run the JS - This was the biggest gap for me.

4. This one runs in the browser, so maybe less likely to be detected and blocked

I also don't intend this scraper to be robust or used in a repeated fashion as a background scheduled job, that's why there's a UI for selecting key elements for scraping. It's meant to be more generalized so that you don't have to have a configuration for a site to still be able to scrape it relatively easily with just some mouse clicks.

If the site you're scraping is already handled by the other programs/extensions, then this wouldn't perform better since the other ones are specifically configured for those sites. Otherwise, this extension gives you the tool to scrape something once or twice without spending too much time coding/configuring.

I don't find myself sticking to the same site a lot, so wrote this.

43. yayitswei ◴[22 Oct 24 01:58 UTC] No.41910433[source]▶

>>41903864 (OP) #

Is there a good tool for scraping a multi-page website (e.g. documentation) into plain text to send to an LLM?

44. nik5 ◴[22 Oct 24 04:24 UTC] No.41911183[source]▶

>>41903864 (OP) #

Made something similar a while back Kindle-send[0] to send blogs to my Kindle. It also uses readability under the hood.

Now I use it to send blogs, books and sometimes send whole archives of a website (you can use it in scripts).

You can export Kindle highlights to Obsidian, so one benefit of making these epubs is how you accumulate the highlights at one place.

Although, name is kindle-send but it can send to any ereader that uses email as a mechanism to send books.

[0] https://github.com/nikhil1raghav/kindle-send

45. BHSPitMonkey ◴[22 Oct 24 04:58 UTC] No.41911319[source]▶

>>41903864 (OP) #

I'll jump on the bandwagon here to shamelessly plug my own little spin on Readability-based EPUB generator: It's a self-hosted OPDS server offering feeds of articles from HN, Tildes, and Pocket which are converted to EPUB on-the-fly (as soon as you try to fetch one). You can add/bookmark it in Koreader, which can run on most e-reader devices. It's simple to self-host (it's published as an image on Docker Hub and GHCR, or you can run it on Node directly).

My local instance just runs quietly on a Synology NAS; I like not having to interact with a computer to use it. Unlike the OP, it can't be used to compile many pages/URLs into a single EPUB, though.

https://github.com/BHSPitMonkey/news2reader

46. chuanliang ◴[22 Oct 24 10:22 UTC] No.41912890[source]▶

>>41903864 (OP) #

Thanks, very useful.

There are some tools similar to Epublifier

https://epub.press/

https://github.com/dteviot/WebToEpub

https://dotepub.com/