This is a brilliant system relying on a randomised consensus protocol. I wanted to do my info sec dissertation on it, but its security model is extremely well thought out. There wasn't anything I felt I could add to it.
The IA has tried distributing their stores, but nowhere near enough people actually put their storage where their mouths are.
As for technical attacks, I'm not an expert but I'd assume it's more difficult for bad actors to bring down decentralized networks. Has the BitTorrent network ever gone offline because it was hacked for example? That seems like it would be extremely hard to do, not even the movie industry managed to take them down.
Typically because most people who have the upload, don't know that they can. And if they come to the notion on their own, they won't know how.
If they put the notion to a search engine, the keywords they come up with probably don't return the needed ELI5 page.
As in: How do I [?] for the Internet Archive?, most folks won't know what [?] needs to be.
The design is really very good.
If different data always gets a different reference, it's easy to know if you have enough backups of it. If the same name gets you a pile of snapshots taken under different conditions, it's hard to be sure which of those are the thing that we'd want to back up for that particular name.
For a large-scale archival project, it might not be ideal. Maybe something based on erasure coding would be better. Do you know how LOCKSS compares?
I was looking into using R2 as a web seed for the torrent but I don't _really_ want to spend much to upload content that is going to get "stolen" and reuploaded by content farms anyway you know?
* What is a "bird famine", and did one happen in 1880?
* Did any astrologer ever claim that the constellations "remember" the areas of the sky, and hence zodiac signs, that they belonged to in ancient times before precession shifted them around?
* Who first said "psychology is pulling habits out of rats", and in what context? (That one's on Wikiquote now, but only because I put it there after research on IA.)
Or consider the recently rediscovered Bram Stoker short story. That was found in an actual library, but only because the library kept copies of old Irish newspapers instead of lining cupboards with them.
The necessary documents to answer highly specific questions are very boring, and nobody has any reason to like them.
https://github.com/internetarchive/dweb-archive/blob/master/...
History has always gotten rewritten throughout time. If you have a giant library it's easier for bad actors to gain influence and alter certain books, or remove them. This isn't just theoretical, under external pressure IA has already removed sites from its archive for copyright and political reasons.
There are also threats that are generally not even considered because they happen with rare frequency, but when they happen they're devastating. The library of Alexandria was burned by Julius Caesar during a war. Likewise, if all your servers are in one country that geographic risk, they can get destroyed in the event of a war or such. No one expects this to happen today in the US, but archives should be robust long term, for decades, ideally even centuries.
What are some legal torrent trackers?
(this doc is 5-6 years old though, and I'm not sure what may have changed since then)
In my own (toy-scale) IPFS experiments a couple years ago it has been rather usable, but also the software has been utterly insane for operators and users, and if I were IA I would only consider it if I budgeted for a from-scratch rewrite (of the stuff in use). Nearly uncontrollable and unintrospectable and high resource use for no apparent reason.
What's the point of using IPFS then? Others can still spread the file elsewhere and verify it's the correct one, by using the exact same ID of the file, although on two different networks. The beauty of content-addressing I guess.
Was that any file in particular? I just tried it myself with a 257mb PDF (as reported by `ls -lrth`) and doesn't seem to add that much overhead:
$ du -sh ~/.ipfs
84K /home/user/.ipfs
$ ipfs add ~/Downloads/large\ PDF\ File.pdf
added QmSvbEgCuRNZpkKyQm6nA5vz5RTHW1nxb6MJdR4cZUrnDj large PDF File.pdf
256.58 MiB / 256.58 MiB [============] 100.00%
$ du -sh ~/.ipfs
264M /home/user/.ipfs
Especially if it's about having an Internet Archive backup.
We are talking about an (almost) worldwide archive after all.
Centralized entities emerge to absorb costs because nobody else can do it as efficiently alone.
I would wager at least 95% of "digital memory" archived is just absolute garbage from SEO spam to just some small websites holding no actual value.
The true digital memory of the world is almost entirely behind the walls of reddit, twitter, facebook, and very few other sites. The internet landscape has changed massively from the 90s and 2000s.
Most casual visitors to IA don't know that. Which is the point.
Giving up is for others.
With the 30-second "time to first byte" speed we all know and love from IA, I'm pretty sure it'd only get faster when you're the only person accessing an obscure document on a random person's shoebox in Korea as compared to trying to fetch it from a centralised server that has a few thousand other clients to attend to simultaneously
Depending on scale that’s not necessarily true. I find even today there are many services that cannot keep up with my residential fiber connection (3Gbps symmetrical), whereas torrents frequently can. IA in particular is notoriously slow when downloading from their servers, and even taking into account DHT time torrents can be much faster.
Now if all of their PBs of data were cached in a CDN, yeah that’s probably faster than any decentralized solution. But that will take a heck of a lot more money to maintain than I think is possible for IA.
Sort of like the bittorrent algorithm that favors retrieving and sharing the least-available chunks if you haven't assigned any priority to certain parts.
https://news.ycombinator.com/item?id=41860909
I'd never heard of it, but their responses to question and comments in that thread were really really good (and I now have "install and configure archivebox on the media server" on my upcoming weekend projects list).
Criminals using tools does not make the tools criminal.
Would be people be willing to buy an IA box that hosted a shard of random content along with the things they wanted themselves?
To me that's not even related to it being a torrent tracker, just that they were "aiding and abetting" copyright infringement.
This has precedent in illegal drug categorization, it's not just about the damage, but its ratio of noxious to helpful use.
>What happens when someone storing decentralized data decides to exit?
They exit, and they no longer store decentralized data. At the very least, IA would still have their copy(s), and that data can be spread to other decentralized nodes once it has been determined (through timeouts, etc) that the person has exited.
> Will data be copied to multiple places[...]?
Ideally, yes. It is fairly trivial to determine the reliability of each member (uptime + hash checks), and reliable members (a few nines of uptime and hash matches) can be trusted to store data with fewer copies while unreliable members can store data with more copies. Could also balance that idea with data that's in higher demand, by storing hot data lots of times on less reliable members while storing cold data on more reliable members.
> who pays for the decentralized storage long term? [...] who is going to pay for doubling, tripling or more the storage costs for backups?
This is unanswered for pretty much any decentralized storage project, and is probably the only important question left. There are people who would likely contribute to some degree without a financial incentive, but ideally there would be some sort of reward. This in theory could be a good use for crypto, but I'd be concerned about the possible perverse incentives and the general disdain the average person has for crypto these days. Funding in general could come from donations received by IA, whatever excess they have beyond their operating costs and reserve requirements - likely would be nowhere near enough to make something like this "financially viable" (i.e. profitable) but it might be enough to convince people who were on the fence to chip in few hundred GB and some bandwidth. This is an open question though, and probably the main reason no decentralized storage project has really taken off.
In Law the technicalities matter.
Trackers generally do not host any content, just hashcodes and (sometimes) meta data descriptions of content.
If "your" (ie let's say _you_ TZubiri) client is distributing child pornography content because you have a partially downloaded CP file then that's on _you_ and not on the tracker.
The "tracker" has unique hashcode signatures of tens of millions of torrents - it literaly just puts clients (such as the one that you might be running yourself on your machine in the example above) in touch with other clients who are "just asking" about the same unique hashcode signature.
Some tracker affiliated websites (eg: TPB) might host searchable indexes of metadata associated with specific torrents (and still not host the torrents themselves) but "pure" trackers can literally operate with zero knowledge of any content - just arrange handshakes between clients looking for matching hashes - whether that's UbuntuLatest or DonkeyNotKong
Unfortunately, when I talked to a few archival teams (including the IA) about whether they'd be interested in using it, I either got no response or a negative one.
On the other hand I also believe that a tracker that hosts hashes of illegal content, provides search facilities for and facilitates their download, is responsible, in a big way. That's my personal opinion and I think it's backed in cases like the pirate bay and sci hub.
That 0 knowledge tracker is interesting, my first reaction is that it's going to end up in very nasty places like Tor, onion, etc..
Downloading from example.com is just peer to peer with someone big. There's lots of hosting providers and DNS providers that are happy to host illegal-in-some-places content.
Most actual trackers are zero knowledge.
A tracker (bit of central software that handles 100+ thousand connections/second) is not a "torrent site" such as TPB, EZTV, etc.
A tracker handshakes torrent clients and introduces peers to each other, it has no idea nor needs an idea that "SomeName 1080p DSPN" maps to D23F5C5AAE3D5C361476108C97557F200327718A
All it needs is to store IP addresses that are interested in that hash and to pass handfuls of interested IP addresses to other interested parties (and some other bookkeeping).
From an actual tracker PoV the content is irrelevant and there's no means of telling one thing from another other than size - it's how trackers have operated for 20+ years now.
Here are some actual tracker addresses and ports
udp://tracker.opentrackr.org:1337/announce
udp://p4p.arenabg.com:1337/announce
udp://tracker.torrent.eu.org:451/announce
udp://tracker.dler.org:6969/announce
udp://open.stealth.si:80/announce
udp://ipv4.tracker.harry.lu:80/announce
https://opentracker.i2p.rocks:443/announce
Here's the bittorrent protocol: http://bittorrent.org/beps/bep_0052.htmlTrackers can hand out .torrent files if asked (bencoded dictionaries that describe filenames, sizes, checksums, directory structures of a torrents contents) but they don't have to; mostly they hand out peer lists of other clients .. peers can also answer requests for .torrent files.
A .torrent file isn't enough to determine illegal content.
Pornography can be contained in files labelled "BeautifulSunset.mkv" and Rick Astley parody videos can frequently be found in files labelled "DirtyFilthyRepubicanFootTappingNudeAfrica.avi"
Given that it's not clear how trackers could effectively filter by content that never actually traverses their servers.
* Strictly speaking, running in-browser, but that sounded like "Bowser" so I wrote online instead.
What there isn't is a currently maintained and advertised client and plan. That I can find. Clunky or not, incomplete or not.
There are other systems that have a rough plan for duplication and local copy and backup. You can easily contribute to them, run them, or make local copies. But not IA. (I mean you can try and cook up your own duplication method. And you can use a personal solution to mirror locally everything you visit and such.) No duplication or backup client or plan. No sister mirrored institution that you might fund. Nothing.
Mathematically a tracker would offer a function that given a hash, it returns you a list of peers with that file.
While a "torrent site" like TPB or SH, would offer a search mechanism, whereby they would host an index, content hashes and english descriptors, along with a search engine.
A user would then need to first use the "torrent site" to enter their search terms, and find the hash, then they would need to give the hash to a tracker, which would return the list of peers?
Is that right?
In any case, each party in the transaction shares liability. If we were analyzing a drug case or a people trafficking case, each distributor, wholesaler or retailer would bear liability and face criminal charges. A legal defense of the type "I just connected buyers with sellers I never exchanged the drug" would not have much chance of succeding, although it is a common method to obstruct justice by complicating evidence gathering. (One member collects the money, the other gives the drugs.)
> Is that right?
More or less.
> In any case, each party in the transaction shares liability.
That's exactly right Bob. Just as a telephone exchange shares liability for connecting drug sellers to drug buyers when given a phone number.
Clearly the telephone exchange should know by the number that the parties intend to discuss sharing child pornography rather than public access to free to air documentaries.
How do you propose that a telephone exchange vet phone numbers to ensure drugs are not discussed?
Bear in mind that in the case of a tracker the 'call' is NOT routed through the exchange.
With a proper telephone exchange the call data (voices) pass through the exchange equipment, with a tracker no actual file content passes through the trackers hardware.
The tracker, given a number, tells interested parties about each other .. they then talk directly to each other; be it about The Sky at Night -s2024e07- 2024-10-07 Question Time or about Debbie Does Donkeys.
Also keep in mind that trackers juggle a vast volume of connections of which a very small amount would be (say) child abuse related.
There are so many proven distributed archiving systems, a lot of which are mentioned in these comments.
https://docs.google.com/document/d/1qKgIjUTef-I-BLWjn4sEIbYo...
I'll write up a more detailed article on it, though, it'll be good to at least have the doc public somewhere.
In practice, that's mostly how they're being used.
But the protocol does support mutation. The BEP describing the behavior even has archive.org as an example...
> The intention is to allow publishers to serve content that might change over time in a more decentralized fashion. Consumers interested in the publisher's content only need to know their public key + optional salt. For instance, entities like Archive.org could publish their database dumps, and benefit from not having to maintain a central HTTP feed server to notify consumers about updates.
If this is what people think we need to work on education...
I miss when TPB used to have a CSV of all their magnet links, their new UI is trash. I can't even find anything like the old days, pretty much TPB is a dying old relic.
So long as this distributed protocol has the concept of individual files, there _will_ be clients out there that allow the user to select `popular-site.archive.tar.gz` and not `less-popular.tar.gz` for download.
And what one person doesn't download... they can't seed back. Distributed stuff is really good for low cost, high scale distribution of in-demand content. It's _terrible_ for long term reliability/availability, though.
Risk management is a balance, not fearmongering as you say. That's why I'd rather use advice from people with daily experience than look at the newsworthy experiences ("nothing happened today, again; regular security patches working fine" you'll never see) and conclude you'd attract threats and cyber attacks just by hosting backup copies of parts of the Internet Archive
Right now there are torrents and I do keep any torrents I download from IA in my client for years but torrents means I only get to contribute by sharing the things I downloaded in the past.
Side note: As an outsider, and someone who hasn't tried either version of FreeNet in more than almost 2 decades, was this kind of a schism like the Python 2 vs. Python 3 kerfuffle? Is there more to it?
[0]: https://www.hyphanet.org/
[1]: https://freenet.org/
If you have a raid, then you have 2 copies with like 99.99% availability and 5 mean time years to failure.
With a volunteer drive you have like ?% availability and ?% years to failure? You can't depend on it.
Also the average value of data is very low, you don't want to be making many copies of for no reason.
I'll restate the principle of good usage to bad usage ratio, telephone providers are a well established service with millions of legitimate users and uses. Furthermore they are a recognized service in law, they are regulated, and they can comply with law enforcement.
They are closer to the ISP, which according to my theory has some liability as well.
It's just a matter of the liability being small and the service to society being useful and necessary.
To take a spin to a similar but newer tech, consider crypto. My position is that its legality and liability for illegal usage of users (considering that of exchanges and online wallets, since the network is often not a legal entity) will depend on the ratio of legitimate to ilegitimate use that will be given to it.
There's definitely a second system effect, were undesirables go to the second system, so it might be a semantical difference unrelated to the technical protocols. Maybe if one system came first, or if by chance it were the most popular, the tables would be turned.
But I feel more strongly that there's design features that make law compliance, traceability and accountability difficult. In the case of trackers perhaps the microservice/object is a simple key-value store, but it is semantically associated with other protocols which have 'noxious' features described above AND are semantically associates with illegal material.
> Also the average value of data is very low, you don't want to be making many copies of for no reason.
The reason is that the value of that data is high to the archivist, since they want to preserve it.
Realistically you won't get enough volunteer-storage to cover one IA. And even if you did, it wouldn't satisfy the mission requirements, which is to store reliably for decades all of the data.
The average person, in my experience, can barely work a non-cellphone filesystem and actively stresses when a terminal is in front of them, especially for a brief moment. Education went out the window a decade ago.
Well, OK, maybe other webpage archives don't work as well, I haven't tried them, but there are others. And they're newer, so don't have such extensive historical pages.
Large numbers of Wikipedia references (which relied on IA to prevent link rot) must be completely broken now.
Neither version of Freenet is designed for long-term archiving of large amounts of data so it probably isn't ideally suited to replacing archive.org, but we are planning to build decentralized alternatives to services like wikipedia on top of Freenet.
[1] https://freenet.org/faq/#why-was-freenet-rearchitected-and-r...
Ditto trackers.
Have a look at the graphs here: https://opentrackr.org/
Over 10 million torrents tracked daily, on the order of 300 thousand connections per second, handshaking between some 200 million peers per week.
That's material from the Internet Archive, software releases, pooled filesharing, legitimate content sharing via embedded clients that use torrents to share load, and a lot of TV and movies that have variable copyright status
( One of the largest TV|Movie sharing sites for decades recent closed down after the sole operator stopped bearing the cost and didn't want to take on dubious revenue sources; that was housed in a country that had no copyright agreements with the US or UK and was entirely legal on its home soil.
Another "club" MVGroup only rip documentaries that are "free to air" in the US, the UK, Japan, Australia, etc. and in 20 years of publicaly sharing publicaly funded content haven't had any real issues )
> the ISP, which according to my theory has some liability as well.
The world's a big place.
The US MPA (Motion Picture Association - the big five) backed an Australian mini-me group AFACT (Australian Federation Against Copyright Theft) to establish ISP liability in a G20 country as a beach head bit of legislation.
That did not go well: Roadshow Films Pty Ltd v iiNet Ltd decided in the High Court of Australia (2012) https://en.wikipedia.org/wiki/Roadshow_Films_Pty_Ltd_v_iiNet...
The alliance of 34 companies unsuccessfully claimed that iiNet authorised primary copyright infringement by failing to take reasonable steps to prevent its customers from downloading and sharing infringing copies of films and television programs using BitTorrent.
That was a three strikes total face plant: The trial court delivered judgment on 4 February 2010, dismissing the application and awarding costs to iiNet.
An appeal to the Full Court of the Federal Court was dismissed.
A subsequent appeal to the High Court was unanimously dismissed on 20 April 2012.
It set a legal precedent: This case is important in copyright law of Australia because it tests copyright law changes required in the Australia–United States Free Trade Agreement, and set a precedent for future law suits about the responsibility of Australian Internet service providers with regards to copyright infringement via their services.
It's also now part of Crown Law .. ie. not directly part of the core British Law body, but a recognised bit of Commonwealth High Court Law that can be referenced for consideration in the UK, Canada, etc.> but it is semantically associated with other protocols which have 'noxious' features described above AND are semantically associates with illegal material.
Gosh, semantics hey. Some people feel in their waters that this is a protocol used by criminals and must therefore by banned or policed into non existance?
Is that a legal argument?
I also indicated above that having knowledge of .torrent manifests is problematic as that doesn't provide real actual knowledge of file contents just knowledge of file names ... LatestActionMovie.mkv might be a rootkit virus and HappyBunnyRabbits.avi might be the worst most exploitative underage pornography you can think of.
Some trackers are also private and require membership keys to access.
I was skating a lot as TZubiri seems unaware of many of the actual details and legitimate use cases, existing law, etc.