I'm sure it works for some people, but not me.
I'm sure it works for some people, but not me.
The new v0.8 adds a BG queue specifically to deal with the issue of stalling when some sites fail. There was a system to do this in the past, but it was imperfect and mostly optimized for the docker setup where a scheduler is running `archivebox update` every few hours to retry failed URLs.
Site compability is much improved with the new BETA, but it's a perpetual cat and mouse game to fix specific sites, which is why we think the new plugin system is the way forward. It's just not sustainable for a single company (really just me right now) to maintain hundreds of workarounds for each individual site. I'm also discussing with the Webrecorder and Archive.org teams how we can to share these site-specific workarounds as cross-compatible plugins (aka "behaviors") between our various software.
> it quietly leaks everything you archive to archive.org by default
It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context: https://news.ycombinator.com/item?id=26866689
Sending everything to archive.org is bad default value and it erodes a certain level of trust in the project. Requiring "several important changes and security considerations" just makes a non-starter. The default settings should be "safe" for the default user, because as you mentioned in that post, 90% of users are never going to change them. Users should be able to run it locally and archive data without worrying about security issues, unless you only want experts to be able to use your software.
Also a contradiction between your statement and your blogpost, someone saving their photos isn't going to be want to worry about whether they configured your tool correctly or leaking all the group logs or grandma's photos.
>It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context
> Who cares about saving stuff?
> All of us have content that we care about, that we want to see preserved, but privately:
> families might want to preserve their photo albums off Facebook, Flickr, Instagram
> individuals might want to save their bookmarks, social feeds, or chats from Signal/Discord
> companies might want to save their internal documents, old sites, competitor analyses, etc.
I want the project to do well but it really needs to be secure by default.
I 100% agree, but because private archiving is doable but NOT 100% safe yet I cant make that mode the default. The difficult reality currently is that archiving anything non-public is not simple to make safe.
Every capture will contain reflected session cookies, usernames, and PII, and other sensitive content. People don't understand that this means if they share a snapshot of one page they're potentially leaking their login credentials for an entire site.
It is possible to do safely, and we provide ways to achieve that that I'm constantly working on improving, but until it's easy and straightforward and doesn't require any user education on security implications, I cant make it the default.
The goal is to get it to the point where it CAN be the default, but I'm still at least 6mo away from that point. Check out the archivebox/sessions dir in the source code for a look at the development happening here.
Until then, it requires some user education and setting up a dedicated chrome profile + cookies + tweaking config to do. (as an intentional barrier to entry for private archiving)
I can see why you would want such a tool, but it seems like a direct divergence from the core goal of the existing codebase.
> And that was this week's newsletter! Congratulation for reading to the bottom, dear 198.51.100.1.
Even if the archivebox instance noted its own IP to do a search-and-replace like s|198\.51\.100\.1|XXX.XXX.XXX.XXX| on the snapshot it is about to create, it's possible to craft a response that obscures the presence of the information, such as by encoding the IP like this: MTk4LjUxLjEwMC4xCg==. I.e. steganography (https://en.wikipedia.org/wiki/Steganography).
Being able to anonymize archives before sharing them is something I would find interesting, but I don't think you can beat steganography, so I'm wondering what exactly you mean you plan to do.