ArchiveBox is evolving: the future of self-hosted internet archives

1. the_gorilla ◴[16 Oct 24 16:51 UTC] No.41861221[source]▶

I don't know how anyone manages to use archivebox. I've tried it twice in the last 3 years and its site compatibility is bad, it quietly leaks everything you archive to archive.org by default, and whenever it fails on a download it stops archiving anything even after deleting and resubmitting all the jobs.

I'm sure it works for some people, but not me.

replies(1): >>41861275 #

2. nikisweeting ◴[16 Oct 24 16:57 UTC] No.41861275[source]▶

>>41861221 (TP) #

These are legitimate gripes that have plagued specific past releases, I hear your frustration. Please keep in mind this was a solo effort of a single developer, only worked on in my spare time over the last 7 years (up until very recently).

The new v0.8 adds a BG queue specifically to deal with the issue of stalling when some sites fail. There was a system to do this in the past, but it was imperfect and mostly optimized for the docker setup where a scheduler is running `archivebox update` every few hours to retry failed URLs.

Site compability is much improved with the new BETA, but it's a perpetual cat and mouse game to fix specific sites, which is why we think the new plugin system is the way forward. It's just not sustainable for a single company (really just me right now) to maintain hundreds of workarounds for each individual site. I'm also discussing with the Webrecorder and Archive.org teams how we can to share these site-specific workarounds as cross-compatible plugins (aka "behaviors") between our various software.

> it quietly leaks everything you archive to archive.org by default

It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context: https://news.ycombinator.com/item?id=26866689

replies(2): >>41861360 #>>41861432 #

3. freedomben ◴[16 Oct 24 17:06 UTC] No.41861360[source]▶

>>41861275 #

Yeah, I'm not sure whether archive.org should be defaulted to on or off (I see both sides of that one), but its existence is definitely surfaced.

I love Archive Box btw, thank you for your effort! It's filling a very important need.

4. the_gorilla ◴[16 Oct 24 17:14 UTC] No.41861432[source]▶

>>41861275 #

I can accept the other issues, but archivebox needs be private and secure by default.

Sending everything to archive.org is bad default value and it erodes a certain level of trust in the project. Requiring "several important changes and security considerations" just makes a non-starter. The default settings should be "safe" for the default user, because as you mentioned in that post, 90% of users are never going to change them. Users should be able to run it locally and archive data without worrying about security issues, unless you only want experts to be able to use your software.

Also a contradiction between your statement and your blogpost, someone saving their photos isn't going to be want to worry about whether they configured your tool correctly or leaking all the group logs or grandma's photos.

>It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context

> Who cares about saving stuff?

> All of us have content that we care about, that we want to see preserved, but privately:

> families might want to preserve their photo albums off Facebook, Flickr, Instagram

> individuals might want to save their bookmarks, social feeds, or chats from Signal/Discord

> companies might want to save their internal documents, old sites, competitor analyses, etc.

I want the project to do well but it really needs to be secure by default.

replies(3): >>41861455 #>>41861460 #>>41861997 #

5. nikisweeting ◴[16 Oct 24 17:17 UTC] No.41861455{3}[source]▶

>>41861432 #

> The default settings should be "safe" for the default user,

I 100% agree, but because private archiving is doable but NOT 100% safe yet I cant make that mode the default. The difficult reality currently is that archiving anything non-public is not simple to make safe.

Every capture will contain reflected session cookies, usernames, and PII, and other sensitive content. People don't understand that this means if they share a snapshot of one page they're potentially leaking their login credentials for an entire site.

It is possible to do safely, and we provide ways to achieve that that I'm constantly working on improving, but until it's easy and straightforward and doesn't require any user education on security implications, I cant make it the default.

The goal is to get it to the point where it CAN be the default, but I'm still at least 6mo away from that point. Check out the archivebox/sessions dir in the source code for a look at the development happening here.

Until then, it requires some user education and setting up a dedicated chrome profile + cookies + tweaking config to do. (as an intentional barrier to entry for private archiving)

replies(2): >>41864458 #>>41872119 #

6. hobs ◴[16 Oct 24 17:17 UTC] No.41861460{3}[source]▶

>>41861432 #

As a custom tool built to archive stuff for archive.org, why would you expect that it can also do a completely opposite task, saving information privately?

I can see why you would want such a tool, but it seems like a direct divergence from the core goal of the existing codebase.

replies(2): >>41861509 #>>41872293 #

7. Apocryphon ◴[16 Oct 24 18:14 UTC] No.41861997{3}[source]▶

>>41861432 #

Perhaps this data is "private" as in "personal property" and not "private" as in "confidential."

replies(1): >>41862308 #

8. nikisweeting ◴[16 Oct 24 18:41 UTC] No.41862308{4}[source]▶

>>41861997 #

It's intended for both but it currently requires extra setup to do "confidential" because there are security risks.

9. dang ◴[16 Oct 24 20:46 UTC] No.41863698{5}[source]▶

>>41861509 #

We've banned this account for breaking the site guidelines. Please don't create accounts to break HN's rules with.

https://news.ycombinator.com/newsguidelines.html

replies(1): >>41864671 #

10. bigiain ◴[16 Oct 24 22:24 UTC] No.41864458{4}[source]▶

>>41861455 #

That's a really good response, thanks.

I've been very impressed by all of your responses in here, but that one in particular shows empathy, compassion, and a deep deep subject matter expertise.

replies(1): >>41864981 #

11. nikisweeting ◴[16 Oct 24 23:37 UTC] No.41864981{5}[source]▶

>>41864458 #

Thank you. And thank you for taking the time to read all of it, there's a lot of great questions being asked.

12. arboles ◴[17 Oct 24 18:11 UTC] No.41872119{4}[source]▶

>>41861455 #

I don't think it's possible to remove information about yourself from a webpage before you share it. It's always possible to have crafted a website that sneaks reflected session information or the instance of archivebox's IP address into the main content. This can be a real response:

> And that was this week's newsletter! Congratulation for reading to the bottom, dear 198.51.100.1.

Even if the archivebox instance noted its own IP to do a search-and-replace like s|198\.51\.100\.1|XXX.XXX.XXX.XXX| on the snapshot it is about to create, it's possible to craft a response that obscures the presence of the information, such as by encoding the IP like this: MTk4LjUxLjEwMC4xCg==. I.e. steganography (https://en.wikipedia.org/wiki/Steganography).

Being able to anonymize archives before sharing them is something I would find interesting, but I don't think you can beat steganography, so I'm wondering what exactly you mean you plan to do.

13. arboles ◴[17 Oct 24 18:27 UTC] No.41872293{4}[source]▶

>>41861460 #

> As a custom tool built to archive stuff for archive.org

Archivebox has no association with archive.org. Sending URLs to archive.org is just one of its features, which can also be turned off.