Most active commenters
  • nikisweeting(5)

←back to thread

663 points nikisweeting | 13 comments | | HN request time: 0.988s | source | bottom

We've been pushing really hard over the last 6mo to develop this release. I'd love to hear feedback from people who've worked on big plugin systems in the past, or anyone who's tried our betas!
Show context
toomuchtodo ◴[] No.41861236[source]
https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives from a target, uploard the WARC files to object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred). The Internet Archive serves a torrent file for every item they host; one can do the same with WARC archives to enable a distributed archive. CDX indexes can be used for rapidly querying the underlying WARC archives.

You might support cryptographically signing WARC archives; Wayback is particular about archive provenance and integrity, for example.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0005... ("CDX Internet Archive Index File")

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002... ("WARC, Web ARChive file format")

https://github.com/internetarchive/wayback/tree/master/wayba... ("Wayback CDX Server API - BETA")

replies(3): >>41861288 #>>41861743 #>>41861951 #
1. nikisweeting ◴[] No.41861288[source]
I recommend Browsertrix for WARC creation, I think they are the best currently available for WARC/WACZ.

ArchiveBox is also gearing up to support real cryptographic signing of archives using https://tlsnotary.org/ in an upcoming plugin. (in a way that actually solves the TLS non-repudation issue, which traditional "signing a WARC" does not, more info: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...)

replies(4): >>41861315 #>>41861428 #>>41864927 #>>41880475 #
2. ◴[] No.41861315[source]
3. toomuchtodo ◴[] No.41861428[source]
Keep in mind, what signing methodology you use is a function of who accepts it. If I can confirm "ArchiveTeam ripped this", that is is superior to whatever tlsnotary is doing with MPC, blockchain, distributed ledger, whatever (in my use case). Have to trust someone at the end of the day. ArchiveTeam's Warrior doesn't use tlsnotary, for example, and rips entire sites just fine.
replies(1): >>41861514 #
4. nikisweeting ◴[] No.41861514[source]
The idea with TLSNotary is that you can have several universities or central agencies running signing servers but you dont have to share the cleartext content of your archives with them to get it signed.

This dramatically changes what is possible with signing because previously to get ArchiveTeam's signature of approval, they would have to see the content themselves to archive it. With TLSNotary they can sign without needing to see the content/access the cookies/etc.

replies(1): >>41862865 #
5. viraptor ◴[] No.41862865{3}[source]
Isn't that already possible with any kind of notary by giving them a sha256 of the content only? Or am I missing some distinction?
replies(1): >>41863381 #
6. nikisweeting ◴[] No.41863381{4}[source]
You can do that but it proves nothing because TLS session keys are symmetric, so the archiver can forge server responses and falsely attest that the server sent them.

Look up "TLS non repudiation"

A real solution like TLSNotary involves a neutral, reputable third party that can't see the cleartext attesting to the cyphertext using a ZK proof.

The neutral third party doing attestation can't see the content so they can't easily tamper with it, and attempts to tamper indiscriminately would be easily detected and ding their reputation.

7. digitaldragon ◴[] No.41864927[source]
Unfortunately, Browsertrix relies on the Chrome Devtools Protocol, which strips transfer encoding (and possibly transforms the data in other ways). This results in Browsertrix writing noncompliant WARC files, because the spec requires that the original transfer encoding be preserved.
replies(1): >>41866285 #
8. ikreymer ◴[] No.41866285[source]
Unfortunately, there is not much we can do about transfer-encoding, but the data is otherwise exactly as is returned from the browser. Browsertrix uses the browser to create web archives, so users get an accurate representation of what they see in their browser, which is generally what people want from archives.

We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.

replies(1): >>41867034 #
9. CorentinB ◴[] No.41867034{3}[source]
You could use a proxy.

"Archiving is always lossy" No.

replies(2): >>41867085 #>>41867198 #
10. nikisweeting ◴[] No.41867085{4}[source]
You're talking to the guy who built the best proxy recorder in the archiving industry ;) ikreymer created https://pywb.readthedocs.io/en/latest/

I think he has more context than any of us on the limits of proxy archiving vs browser based archiving.

But also if you really need perfect packet-level replication, just wireshark it as he said. Why bother with WARCs at all?

replies(1): >>41867325 #
11. ikreymer ◴[] No.41867198{4}[source]
Every archiving tool out there makes trade-offs about what is archived and how. No one preserves the raw TLS encrypted H3 traffic because that's not useful. When you browse through an archiving MITM proxy, there are different trade-offs: there's an extra HTTP connection involved (that's not stored), a fake MITM cert, and a downgrade of H2/H3 connection to HTTP/1 (some sites serve different content via H2 vs HTTP/1.1, can detect differences, etc...)

The web is best-effort, and so is archiving the web.

12. pabs3 ◴[] No.41867325{5}[source]
pywb has WARC issues too, due to use of warcio:

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

13. fasa99 ◴[] No.41880475[source]
>ArchiveBox is also gearing up to support real cryptographic signing of archives

That's a really interesting point. The gut reaction is "why are we wasting time on adding a nice-to-have such as a very fancy cousin of the MD5 checksum when the real meat of the time & effort is maximizing data download and scale"

But then, then go read the book 1984 and it may become clear the importance of ensuring the data is unchanged down the road.

But if this is a hedge against hypothetical future 1984 world, one would have to ask - what if the only file available has the wrong md5sum? Because then most people would say, "welp, something is better than nothing" and that's it. Perhaps something that might provide additional information about what/how/where something was changed in more detail.