←back to thread

663 points nikisweeting | 1 comments | | HN request time: 0.343s | source

We've been pushing really hard over the last 6mo to develop this release. I'd love to hear feedback from people who've worked on big plugin systems in the past, or anyone who's tried our betas!
Show context
toomuchtodo ◴[] No.41861236[source]
https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives from a target, uploard the WARC files to object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred). The Internet Archive serves a torrent file for every item they host; one can do the same with WARC archives to enable a distributed archive. CDX indexes can be used for rapidly querying the underlying WARC archives.

You might support cryptographically signing WARC archives; Wayback is particular about archive provenance and integrity, for example.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0005... ("CDX Internet Archive Index File")

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002... ("WARC, Web ARChive file format")

https://github.com/internetarchive/wayback/tree/master/wayba... ("Wayback CDX Server API - BETA")

replies(3): >>41861288 #>>41861743 #>>41861951 #
nikisweeting ◴[] No.41861288[source]
I recommend Browsertrix for WARC creation, I think they are the best currently available for WARC/WACZ.

ArchiveBox is also gearing up to support real cryptographic signing of archives using https://tlsnotary.org/ in an upcoming plugin. (in a way that actually solves the TLS non-repudation issue, which traditional "signing a WARC" does not, more info: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...)

replies(4): >>41861315 #>>41861428 #>>41864927 #>>41880475 #
1. fasa99 ◴[] No.41880475[source]
>ArchiveBox is also gearing up to support real cryptographic signing of archives

That's a really interesting point. The gut reaction is "why are we wasting time on adding a nice-to-have such as a very fancy cousin of the MD5 checksum when the real meat of the time & effort is maximizing data download and scale"

But then, then go read the book 1984 and it may become clear the importance of ensuring the data is unchanged down the road.

But if this is a hedge against hypothetical future 1984 world, one would have to ask - what if the only file available has the wrong md5sum? Because then most people would say, "welp, something is better than nothing" and that's it. Perhaps something that might provide additional information about what/how/where something was changed in more detail.