I think it's the preservation of information, but I also believe 90% is absolutely pointless. There is just so much of it, and data storage so cheap, that it makes sense to just save everything.
It is fortunate to me that someone thought to hang on to that note since I have become interested in genealogy and this was a fairly significant event in family history (had he not died I don't suppose I would be around since it was her second marriage that gave me my grandfather).
I long for scraps of anything that my dead relatives, wrote, created, etc. It connects me better to the past — the lives they lived, how they lived them. It somehow grounds me a little better ... well, it's rather hard to explain the draw of genealogy.
Sadly very little of the ephemera of everyday life was kept. I get it. It might have seemed like hanging on to junk mail — like you were a hoarder or whatever, but in this digital era we should be able to hold terabytes of what may appear to be ephemera.
I'm doing what I can – not for ego, I think, but for future generations that may find a connection to their past interesting.
Though just the music compressed in opus at 128kbit might work ok, 60 years of 24/7 128kbit is 30TB, so that would fit on 1 large HDD currently.
There is certainly a quantity of stuff online that is absolutely worth saving, but there's a considerably larger proportion that's just redundant to the point of being unremarkable and pointless. The trick is filtering, which can be capital-H Hard. That's why some may want to err on the side of over-collecting to reduce the possibility of missing something that will actually be important someday.
Do your research first. Use standards
Eg: html, pdf, h264/h265/av1 in mp4 container, chd, zip and so on depending on what you are storing.
It is not. It is only for us normal people. But the companies which log our lives in order to then capitalize on it, for them the internet is not ephemeral. They have copies of videos, pages, podcasts, whatever it is what can be found there.
Why would you want those companies to know more about yourself than you do?
Definitely no smiling:
Alamy is selling scans of ad prints from the 1850s.
I'll highlight, summarise and take notes of what I save. Or some combination of those. If I don't find anything new or directly applicable to my life, I'll let it pass by.
This approach isn't good for archival purposes, but I hesitate to save a lot of things that I'll never read again.
Paper lasts for centuries without any attention beyond keeping it moderately dry and away from things that eat it.
I'd also hazard to guess most people in this camp would want these companies to also not store these things the same as they don't want people to.
I'm a link hoarder myself (over 13k links on Pinboard: https://pinboard.in/u:pmigdal/). While I don't revisit most of them, some have proven invaluable for re-reading and sharing. I'm not sure about the typical half-life of internet content, but a lot disappears—whether because people stop paying for domains, official websites get reorganized (or their content removed), or other reasons.
This is where the Internet Archive steps in, doing the essential work of a digital librarian. I often share links from its Wayback Machine, which has been a link-saver more times than I can count.
There were like a dozen algs. I kept links to nice papers with diagrams. Then they started disappearing. Now I'd be pressed to find 2.
This is really useful info that is apparently disappearing. So yes, it happens, and maybe you should save that stuff.
Scholars will write papers on this era, speculating what it was like and how it fit into what came after.
The web documents the massive societal changes underway which do not relate to the internet directly. Things like changes in transportation technology, medicine, sexuality and gender, and how your average people felt about all of it. Scholars will data mine those opinions to understand who felt what ways and why, with the benefit of hindsight. New knowledge will come of it.
So yeah! It is all worth saving.
Often as spam comes from the same mailbox as order receipts and includes words like “order” while messages with receipts never include the word “receipt”. When inundated with daily or sometimes multiple times a day ad spam from the same company it becomes very difficult to filter for only not receipts, to clean a neglected inbox.
After I’m gone, I fully expect my family just to delete it all because the signal to noise is so low.
In a the Internet is both too ephemeral (self-hosted blogs disappear, Youtube videos get taken down) and too persistent at the same time; I don't think that most Twitter posts of non-public figures would need to remain public forever by default, for example, and I don't think I need to mention various data breaches.
The Internet Archive somewhat mitigates the first issue, but it makes me pretty nervous that there's essentially just one organization doing what used to be much more distributed to various physical libraries.
For the second one, I hope we'll see better solutions (both technical and social) as the technology and our interactions with it mature.
And then, hopefully, there are nice personal blog posts like this one, showing you that you are not alone having some peculiar habits and so that it might make sense even if most people don't even think about it.
I have the exact same feeling when I discover through hn, blog posts and events that I'm not the only one having my web browsers full of tabs. Literally having thousand of tabs.
Bonus points:
I don't need 30/50/100Gb mailbox (and the associated mailbox cost nowadays).
Search is not only fast but if I didn't found something - then there is nothing of this something in the mailbox.
I't mentally pleasurable to log in once in a while and throw a bunch of unneeded stuff into the trash bin, quite similar to a real life room cleaning.
The one thing I have in my garage I know someone would want is a big pile of laserdiscs. But, again, a thrift shop (or my library) wouldn't want them and I live pretty far out from a major city. Probably will try Craigslist post-winter though as I'm trying to declutter.
But, it often isn't cheap to do, and in that case, it makes sense to prioritize. The high priority items for me are the things that I might want to share, the ideas I want to amplify for my contemporaries and future generations that might examine my life. Stuff like [1] [2] and [3] which has influenced my thinking fundamentally, that I hope to build upon so that others can build upon what I have built.
I'd argue that you do this intuitively: you're mentioning a letter from your family's past because it is a high priority item--it's relevant because it was the last written words of your great-grandmother's first husband.
But, there's a lot that isn't worth keeping. My first form of archiving as a teenager was keeping ticket stubs for movies and concerts--a decade later I was going through my pile and found that I didn't even remember most of them. The better movies, I remembered--and I had them on DVD. The better concerts, I remembered--and I also had journal entries and CDs to remember the experience and the music. It's not important to me where/when I saw Everything, Everywhere, All At Once in theaters, but I have it on DVD and I can't wait to show it to my niece when she's older. And sure, I saw Amigo the Devil live, but frankly, he's not an artist you need to see in concert--the greatest impact of Cocaine and Abel[4] on me was when I listened to it alone in my room. The ticket stubs simply don't matter to me.
[1] https://www.viridiandesign.org/notes/451-500/the_last_viridi...
[2] https://www.ted.com/talks/brene_brown_the_power_of_vulnerabi...
[3] https://digital.wpi.edu/pdfviewer/wm117p10z
[4] https://www.youtube.com/watch?v=ZzjtLm0G49E
EDIT: All the things linked above, I have backed up in one form or another. Notably, the Schutt paper isn't at its original URL.
I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?
So the fact that ChatGPT can replace much of the web actually says less about the marvel of ChatGPT and more about the lack of anything really worthwhile because the profound just happens to be the least economically valuable.
The other thing is that searching for the long tail is near impossible. The big sites dominate Google, so I need something like marginalia to actually get to the old stuff that it used to be so easy to find. Because of the median user having simple queries, some questions are no longer answerable on Google: they are dominated by the median user and never show up.
My ex, however had a much more fluid relationship with the internet and media in general. They liked new things, and didn't particularly care if they enjoyed something and it faded into obscurity. I feel like that's the winning mentality, but I just can't bring myself to embrace it.
Come 2020, I was convinced that the world was going to end. So I simply... turned off the retention rule. One hour of news is around 5GB, but that's a very-high-bitrate MPEG-2 stream with an extra audio channel in Spanish. So I instead wrote a cron job to take that week's news, drop the stuff I don't care about, and H.264 the entire set of them down to 4.7GB, then burn them to a DVD for offline storage, since there's not much value to keeping them online.
By 2022, it was obvious the world was not, in fact, ending, but I never stopped this practice because of how simple it was, and how unobtrusive to store they are. I just make sure a fresh DVD is in the NAS every week, and put the DVDs on a spindle - they collectively take up about as much room as a toaster. I could make that even smaller and simpler if I opted for a portable hard drive.
Occasionally I'll manually toss something interesting in, like the presidential debates, or special coverage of some newsworthy event.
In 20 years, when it comes time to re-burn the earliest of them, maybe I'll make a value judgment on whether that's worth it, but for now it feels like I'd be losing something for not much of a good reason.
Is there something that "just works"?
When I had returned to the game after bombs-away.net had gone defunct, I posted my own personal archive to the GoG forum for the game. Now that I've returned to the Redux version I find my own files, with my personal notes, shared by a single other soul who had similarly maintained an archive, and apparently had collected mine at some point. I'm very glad to have helped preserve knowledge - but not everything of mine was there. Now that I've noticed the 2024 remaster effort and joined that community, I've been able to share files that were otherwise apparently completely lost - in particular, a set of images showing dimensions of certain common features in bombing targets, that allow estimating the total size of the target.
Unfortunately, my own personal archive included many forum topics that I just dragged off shortcuts to. I can see the old titles of the pages from the surviving shortcut files. I remember the questions I had (and now have again) that those shortcuts held the answers to. But because I didn't save the page itself, it's.. gone. That's immensely frustrating.
Yes, things are worth saving. Especially for topics with extensive information among a small niche audience that have a single point of failure. I've found an extension (SingleFileZ) that does a good job of archiving a web page with all embedded content into what's a zip file under the hood - so futureproof even if the extension disappears and it becomes difficult to simply open the file directly in browser.
EDIT - montebicyclelo mentions SingleFile, which apparently is a continuation of SingleFileZ, with new features. SingleFileZ already allowed automatically saving every visted page in a tab (or even among all tabs), batch archiving of a list of urls, etc, so presumably SingleFile has all these capabilities and more.
IMO, this content is of high cultural value and I fear it won’t be long that the goog suffers us to watch “their” content without infecting it with ads.
I wish there was an easier way to self host this content with a way to organize and browse using tags.
All that's necessary is making it possible for people seeking out your type of product to find you. And for revolutionary products, there's word of mouth.
If anything I think capitalism would function better without ads, because I would argue that advertising overall results in less informed customers, especially the modern lifestyle/brand type of advertising that's clearly quite effective at manipulating people.
You can't dedupe the large binary files because they're encoded in small parts likely differently every time they're posted.
It's hard to assign a value to these things. They are simultaneously junk and treasure. I think about the journey these items took to find their way to that flea market table. It was too diverse a collection to have come from one place. So I imagine all the paths each individual item traversed. The joy of the recipient reading a post card, holding on to it, rediscovering it on spring cleaning days. Or the photo living in an album or framed on a wall somewhere for a lifetime.
I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.
Most likely it is not worth it. But people should not be doing only things that are “worth doing”. Then again if something brought you joy but was complete waste of time - it was worth it.
Hate dementors who tell you otherwise, it is limited life time but it is yours. You should be helpful to others but doing only “what is worth” suck the beauty out of existence.
Well, except future historians who may find value in "personal" information (although I guess we've got such a surfeit of recorded "personal" information these days compared to even just 50 years ago, it may not be quite as useful as when they find, e.g., some Babylonian tablet with a shopping list on. But you never know!)
You click the "browser action" icon/button of the extension and it saves a single HTML file that looks exactly like the webpage you have open.
From its FAQ[1] on GitHub:
# What does SingleFile do?
SingleFile is a browser extension designed to help users save web pages as complete, self-contained files. The extension's primary function is to capture an entire web page, including its HTML, CSS, JavaScript, images, and other resources, and package them into a single HTML file.
# I am a web archivist, is it ok to use SingleFile to archive content?
No, SingleFile is not a tool used by professionals to archive content on the Web, especially in the academic field. Professionals prefer to rely on tools based on the WARC specification instead.
[1] https://github.com/gildas-lormeau/SingleFile/blob/master/faq...Also, it's helpful to get perspective on the attitudes for or against a new technology in recent history. I remembered there were people that said "If you aren't writing a kernel, you don't have their problems, so you don't need git." Turns out that's not true. Now that git is everywhere, it's harder to remember whether or even if there was pushback against it.
This was written about the insights from using git that he needed to highlight to people back then. https://keithp.com/blog/Repository_Formats_Matter/
I often reference it, and if it wasn't still up, I'd have only web archive to rely on.
So for me, lots of stuff I look at online (mainly blog posts) are worth saving. Sometimes, if the discussion is on a twitter thread, that too. Which makes me fear for the day Microsoft decides to do Github in, and we'd lose all the issues and comments.
If the ancestor before you is hoarding anything that comes across their path, be it digital ads or every physical greeting card they’ve ever gotten, the problem is with the person’s collection habits, not the medium.
There's also print-to-PDF that most OSes now have.
I set it to tolerate longer processing times, and to open the file after saving so I can sanity check that it got everything. Works great at faithfully saving a page with images as it appears in browser, and saves so much time.
You might also have a look at https://github.com/ArchiveBox/ArchiveBox
These sometimes disappear from the Web. Or disappear except for some third-party site that modifies and/or paywalls them.
Also, save the occasional important support info Web pages for those products. You'll know it when you see it. And if you don't save it now, it might be gone when you need it.
You don't need a fancy system for this. I just made a directory `~/doc/`, and started dropping files into it. Someday, I'll take the time to merge this with `~/wiki/`, but for now, I'm capturing the information with low friction, which is most important.
But hey, there's more whitespace now.
I went to a fantastic talk a few years ago at the British Library about digitizing a substantial quantity historic Australian newspapers. It was amazing to be able to read funeral announcements, product advertisements and other signals from the past showing us Australian culture from the 1800s.
Since we leave much less behind in terms of physical assets (personal letters, postcards, personal diaries), we should at least aspire to archive more from the digital realm, or to future historians we'd look like a blank century.
There is also the option to have the contents compressed, and (a separate option) to keep the plaintext of the file uncompressed, which will likewise still allow indexing to work while saving space.
1. SingleFile allows me to save a an HTML file that looks exactly like the webpage I saved. I never used a save-to-PDF functionality in any browser that allowed me to save a PDF that looks exactly like the webpage I was saving/printing. I wish browsers implement that, somebody did that once, they patched chromium to save a web page as SVG[1], AFAIK if you can save to SVG you can also save to PDF with not much modification to the code, unfortunately the fork is not maintained anymore.
2. The HTML files that SingleFile creates are responsive (just like the webpage you had open), PDF is not responsive. I like that because it makes it easier to read the webpage I saved on my phone later, with a PDF file you saved on your desktop, you have to pinch to zoom and pan while you read it on your phone.
3. HTML-files/Webpages are accessible to screen readers and my browser's extensions work on them, extensions don't work on PDF files (they _can_ work on HTML files opened from disk, if you allow/enable it in the extension's settings).
That's not a question of wants, companies will always know more about you than you, for the simple reason that even if you had all their data you have no means to extract any meaning from it. It requires immense organization and resources, increasingly so as the rate of data production increases.
For that reason the correct response isn't to engage in the same hoarding and privacy abuse of the companies, it's like bringing a knife to a tank fight, but to 1. make sure you don't produce that data to begin with through privacy protections and technical means and 2. create environments in which you have ownership of your data, instead of businesses.
There was a ton of great stuff 30 years ago, and I think it's definitely worth saving.
The Internet was a very different place, but it was quite real 30 years ago, and I think the idea that the further back you go the more valuable this kind of thing is is the right way of looking at it.
I recently deleted about 40,000 emails. Most of them were identical, duplicate marketing emails. I was forced to do this to free up storage.
That's when I realized something. I am paying my email provider for the full price for every byte of "represented" data. In reality, their distributed file systems could compress an arbitrary number of copies of these emails and only consume the amount of space that one email consumes. So 100,000 duplicate emails on the server are consolidated into one representation of the data, but each customer has to pay for each byte that is represented.
The vendor stores a file once and charge full price every time they reproduce it for someone. If you have 10,000 copies of a file they only have to store it once but you will pay for every byte in all 10,000 copies.
why still back up floppies if you could just move the data to a single dvd, or throw is on the SAN?
RAID is just algorithms, the actual transport doesn't matter (i.e. spinning platter and solid state both use SATA connectors).
Sometimes I wonder if there's an even easier browser-builtin function that does the same?
Storage keeps growing and price of storage keeps doing down.
My DOS and even some C64 source code made it to this day on backups (DVDs, HDDs, SSDs, USB memory sticks, etc., both online and offline) and to ZFS pools. Medium that didn't exist in the 80s/early 90s.
Floppy disks -> 40 MB HDD -> 6.4 GB HDD -> 80 GB HDD -> 500 GB HDD -> 240 GB SSD -> 1 TB NVMe SSD.
You get the idea.
The way you get sure you still have your data is by not focusing on the medium but by focusing on the fact that data is data.
Medium comes and goes. Data can (and should) be copied to new medium.
Not unlike:
/home/pub/backups/oldBackups/DOSbackups/...
...Conner80MBHDDbackups/backups/oldBackups/Commodore64backups/...
Some people are going to complain about the naming but I have all my emails except for six months back since I started using the Internet. And I still have all nearly a lot of my data since I started using computers. 8-bit computers.Do you?
I don't care about naming much. "search, don't sort".
We've got emulators for just about every and any system. My vintage arcade cab has both real PCBs and a Pi running an emulator with thousands of arcade games on it.
You can already, today, emulate, say, the Raspberry Pi model you want using QEMU. There are container file that'll gladly do that for you.
Unless civilization ends there's simply a not a world in which, say, PNG, JPG and x265 files aren't readable. This just won't happen.
FWIW I'm paranoid integrity of my data: I've got my own naming scheme where a cryptographic hash is added to many of my files.
For example:
DSC_91394-b3-ae4f2877d3.jpg
This means "This file's Blake3 checksum begins with ae4f2877d3".I then have a script doing statistical sampling: I enter a percentage and that percentage of files where a cryptographic hash is part of the filename are checked, randomly (if I enter 100 then 100% of the files are tested).
If I enter for example '7', then 7% of the files are tested and then there's high probability all checksums are correct.
> On what physical medium?
That is the wrong question.
There are second order effects of ads that we’d need to consider. Facebook and Google wouldn’t exist as we know them. Maybe that means some of their research doesn’t happen?
To my dismay, some saved files (.htm extension) didn't open when I wanted to open them.
So I'm glad people are discussing ways to archive web pages while that reproduce the original page faithfully.
[1] https://chromewebstore.google.com/detail/singlefile/mpiodijh...
Saving stuff just isn’t fun or useful for me. Never for more than a passing moment have I thought, “Boy I wish I had saved that whatever.”
Old people are the worst about this stuff. They think/hope somebody will want it and then just make it the next generation’s problem.
I told my dad if he thinks it has value, give it away while he’s alive. I have neither the interest nor the space to deal with it so it’s going straight into the trash.
I can still right-click > Save any page as .html, but that doesn't guarantee server streamed stuff, media, images, etc. will be preserved correctly.
https://blog.archive.org/2013/11/22/a-dream-to-preserve-tv-n...
[1] https://newsroom.churchofjesuschrist.org/topic/genealogy
Since they only stored text, they would make a single db entry for each unique line of text that came in and just made more and more references to that.
Even different emails… were mostly the same.
I've noticed people becoming more interested in genealogy when they - let me phrase this delicately - reach a certain age. My speculation is that it is a component of grappling with one's own mortality. As the grays and wrinkles multiply, some obsess over healthy eating and exercise, some wealthier ones invest in immortality research, some get blood boys, and the rest feel an urgent need to research our genealogy; any detritus that shows our progenitors existed proves some trace of us having been here will remain, and perhaps our existence means something, as time cruelly keeps marching on.
Thankfully FastMail interface makes 'search from this address' and 'search to this address' (I'm using per-service addresses) and then 'select all', 'delete' actions a breeze.
> I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.
I purchase, scan, and resell those kinds of things. I'd love to have a centralized, public repository in which to store the data. As our tech gets better at extracting data from that material more and more interesting applications could be developed. Imagine being able to find 100+ year old photos of your ancestors via facial recognition and extracted metadata searches.
I wish I could come up with a non-profit business model that worked for preserving that kind of stuff. I would love to gather up the historical ephemera that's being lost, catalog it via manual and automated processes, and make it available to the public. (Yes, I am aware there are privacy concerns. It's a pie-in-the-sky idea. I just hate to see all of the previously captured and curated effort that went into ephemera cast to the winds.)
Luckily there are (currently) multiple playlists of it on Youtube.
But they might not be there next year.
Mostly it just goes away at death.
It might be interesting to read:
https://en.wikipedia.org/wiki/Digital_hoarding
I have trouble letting go of things, and I found it interesting to read through.
There's a part of me that thinks "It would be so useful to 100% automatically log and cache everything I do and be able to search it". But I think maybe being healthly means not doing it.