ArchiveBox is evolving: the future of self-hosted internet archives

1. nfriedly ◴[16 Oct 24 17:09 UTC] No.41861386[source]▶

I've been using an instance of https://readeck.org/ for personal archives of web pages and I really like it, but I might try out ArchiveBox at some point too.

I also run an instance of ArchiveTeam Warrior which is constantly uploading things to archive.org, and I like the direction ArchiveBox is heading with the distributed/federated archiving on the roadmap, so I may end up setting up an instance like that even if I don't use it for personal content.

replies(3): >>41861418 #>>41862054 #>>41867705 #

2. nikisweeting ◴[16 Oct 24 17:13 UTC] No.41861418[source]▶

>>41861386 (TP) #

I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit

There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

3. venusenvy47 ◴[16 Oct 24 18:20 UTC] No.41862054[source]▶

>>41861386 (TP) #

I've been using the Single File extension to save self-contained html files of pages I want to keep for posterity. I like it because any browser can open the files it creates. Is it easy to view the archive files from readeck? I haven't looked at fancier alternatives to my existing solution.

https://addons.mozilla.org/en-US/firefox/addon/single-file/

replies(3): >>41862155 #>>41862251 #>>41867736 #

4. nfriedly ◴[16 Oct 24 18:28 UTC] No.41862155[source]▶

>>41862054 #

I haven't looked at the on-disk format, I just use the browser interface. (It's fairly common for me to save something from my phone that I'll want to review on a computer later.)

Here's an example of an Amazon "review" I recently archived that has instructions for using a USB tester I have: https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2yw

And, for comparison, here's the original: https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP

It'd be nice if I could edit out the extra junk near the top, but the important bits are all there.

replies(1): >>41862409 #

5. nikisweeting ◴[16 Oct 24 18:36 UTC] No.41862251[source]▶

>>41862054 #

Singlefile is excellent, Gildas is a great developer. ArchiveBox has had singlefile as one of its extractors built in for years :)

replies(2): >>41866120 #>>41995800 #

6. ashildr ◴[16 Oct 24 18:48 UTC] No.41862409{3}[source]▶

>>41862155 #

I was about to post a link to the same URL but archived using singleFile, which looks like the original at amazon. I didn‘t because I realized that I have absolutely no idea what additional information would be hidden in the file. In the worst case any component sent by Amazon and archived into the file may contain PII, even if I am “logged out“.

I‘m not saying that singleFile is bad in any way, I‘m using it a lot on multiple devices, but I‘m not sure whether sharing archives is a good idea™.

replies(1): >>41862549 #

7. nikisweeting ◴[16 Oct 24 18:58 UTC] No.41862549{4}[source]▶

>>41862409 #

100%, this is the challenge of archiving logged in content.

It becomes un-shareable unless we use fake burner accounts for capture, or have really good sanitizing methods.

replies(1): >>41862763 #

8. ashildr ◴[16 Oct 24 19:17 UTC] No.41862763{5}[source]▶

>>41862549 #

Even when I‘m logged out I expect at least information on my geographical location to seep into the archive via URLs addressing specific CDN endpoints or similar mechanisms.

replies(1): >>41862920 #

9. nikisweeting ◴[16 Oct 24 19:32 UTC] No.41862920{6}[source]▶

>>41862763 #

Yup, this is why the ArchiveBox browser extension sends URLs to a separate server for archiving with an isolated burner profile.

I should write a full article on the security implications at some point, there aren't many good top-down explanations of why this is a hard problem.

replies(2): >>41865661 #>>41867775 #

10. ashildr ◴[17 Oct 24 01:44 UTC] No.41865661{7}[source]▶

>>41862920 #

I know it’s a lot of work but this would be great and it may give readers a deeper understanding into security in general.

11. gildas ◴[17 Oct 24 03:18 UTC] No.41866120{3}[source]▶

>>41862251 #

Thank you so much Niki :). The P2P sharing is a great idea. I really hope this feature will get things moving in the archiving field.

12. ninalanyon ◴[17 Oct 24 08:59 UTC] No.41867705[source]▶

>>41861386 (TP) #

I've just tried Readeck and it doesn't save a good quality copy of the pages using the Firefox extension. SingleFile and WebScrapBook do a much better job.

I prefer WebScrapBook because it saves all the assets as files under the original names in a directory rather than a zip file. This means that I can use other tools such as find, grep, and file managers like Nemo to search the archive without needing to rely on the application that saved the page.

13. ninalanyon ◴[17 Oct 24 09:05 UTC] No.41867736[source]▶

>>41862054 #

Readeck saves a page as a zip file. It's not hard to open from the command line or file manager, just unzip and launch the index.html in the web browser.

But it strips out a lot of detail. Zipping it also means that it's hard to deduplicate. I use WebScrapBook and run rdfind to hardlink all the identical files.

14. ninalanyon ◴[17 Oct 24 09:11 UTC] No.41867775{7}[source]▶

>>41862920 #

How does it save pages that are only available when you are logged in such as social networking pages?

replies(1): >>41871118 #

15. nikisweeting ◴[17 Oct 24 16:25 UTC] No.41871118{8}[source]▶

>>41867775 #

You set up a chrome profile for archiving that's logged into all the sites you want to save. I recommend using burner accounts dedicated to archiving, so you'd have to add them to any private pages/groups you want to archive.

It is possible to use your main account for archiving but there are security risks (you cant share the snapshots without leaking session headers).

replies(1): >>41878307 #

16. pbronez ◴[18 Oct 24 11:15 UTC] No.41878307{9}[source]▶

>>41871118 #

That’s a very cool solution- gives the user explicit control

17. venusenvy47 ◴[30 Oct 24 15:08 UTC] No.41995800{3}[source]▶

>>41862251 #

I just installed the ArchiveBox add-on to Firefox, but I'm not sure how to find the page that I just archived. Do people normally run their own server to collect the pages?