@brewsterkahle It's become very acutely apparent that the Internet Archive is an enormous single point of failure. Nearly none of the Archives collections are replicated externally by other orgnaizations. The eggs are all in one basket.
Should the Archive fail for any number of legal, policy, or financial reasons, most of the archived information ever collected will be erased.
@kazriko @brewsterkahle That's completely the wrong way to look at it. It doesn't NEED and SHOULDN'T BE only one other organization. Or even a small number.
It should be many, many organizations hosting whatever sub-sets of the data make sense. Every university and educational institution on earth should naturally host some shard of the IA.
- replies
- 1
- announces
- 0
- likes
- 0
@kazriko @brewsterkahle A law school could host the complete collection of all laws, changes to laws, and government publications and messages and stats.
A school of film could host all the film materials of the 1920's.
A biology research institute could mirror the complete collection of all publications in the field.
Many of these schools already host extensive libraries. But there are huge technical, orgnaizational, and taxonomic barriers to them replicating data from the IA.
@kazriko @brewsterkahle The IA as it exists today is a disorganized file pile. It's very hard for an organization to select some shard, and even harder for orgnaizations to communicate what subset they have, which makes gauging coverage between replicating institutions (or who to go to for a specific thing) impossible.
Then there's the fact that the IA just isn't set up for export. You can't just call up the IA as a large university and ask for a box of LTO tapes containing a petabyte of data.
@kazriko @brewsterkahle There doesn't seem to be much in the way of organization at all. You pretty much get either {video, images, software, documents} at several PB each, or you get trillions of zip files each containing *one* upload. Maybe at best all the uploads from *one* of millions of people. That's impossible to organize replication efforts around.
@kazriko @brewsterkahle To make this practical, all the data in the IA would need to be sorted into sets/collections. The ballpark granularity is probably between 100 and 1000 parts, together covering the totality of the archive. There has to be a small enough number that human beings could ask each other questions like "do you have the law collection?" or "do you have the film collection, 1900-1929?"
@kazriko @brewsterkahle Librarians and others who work in taxonomy are screaming in horror. But the exact dividing lines don't matter.
What matters is the list of shards is small enough and comprehensible enough that human beings can fit it in their heads. It has to be practical for people for work out what parts have been replicated and by how many institutions, without anything falling through the cracks. Even if you have to visit 3 different categories before you find what you're looking for
@kazriko @brewsterkahle And new editions of this data need to be possible, as the archive evolves. Maybe people order a new box of tapes every 5 years, or as often as every two years for rapidly evolving fields. Maybe only once for data that is fixed and unchanging. (how many new 1920's films are unearthed every year? etc)