The archive fallacy
One of the most persistent concerns in the digital world is the fear that we might create a large amount of digital data we are subsequently unable to interpret. Specifically, can we guarantee 1) the integrity of the storage media 2) that there will be someone around who is able to interpret it?
In the days before the availability of cheap high capacity disk drives the first question was a real issue. Paper tape, punch cards and the like are fragile media. Magnetic tapes are little better and expensive besides. But in the world of bits rather than atoms the permanence of the physical media is irrelevant, only the bits matter. Books are also an imperfect archive media. If we depended on the survival of single copies the half life of inforation published in books would be 500 years of so. Books survive due to redundancy. Of the 500 or so first editions of Copernicus' De revolutionibus, printed in 1543, almost half survive. Older books survive through copying, we know Aristotle and Plato only through multiple generation copies. Even though the physical media has a half life of only 500 years we can be virtually certain that the ideas contained in the original publication will survive as long as civilization does.
The first question is easily answered through the same processes of massive redundancy and intergenerational copying. Disk drives are unreliable but a distributed storage system with massive redundancy and multiple sites can offer levels of reliability that are vastly greater than any system for storing paper documents.
As we become increasingly confident that the first question has been solved we become increasingly concerned by the second: will we be able to make use of the information? Various authors have suggested that we may drown in an ever increasing 'data smog'. Michael Hart at Project Gutenberg advocates sticking to US-ASCII encoded plaintext with no form of document markup. similar arguments have found favor in the IETF.
I don't find these arguments at all persuasive. It is true that the use of proprietary document formats has caused problems in the past. In particular it may prove difficult to find a modern wordprocessor capable of reading files created for (say) a Wang 1200. But to extrapolate from such anecdotes to conclude that every document markup scheme risks rendering content inaccessible is complete nonsense.
Billions of documents have been written in HTML, PDF and Word. The file formats are widely documented and the idea that this knowledge might somehow be mislaid is somewhat odd.
But let us for the sake of argument imagine that through some form of global collapse of civilization somehow lose the ability to interpret HTML, would the documents encoded in HTML be irretrievably lost? The answer is very clearly no. If we can decipher linear B, the Egyptian and Mayan Hieroglyphs, if a few hundred cryptanalysists could defeat the ENIGMA, PURPLE and VENONA ciphers, it should be abundantly clear that decoding HTML is not going to tax future generations of archeologists no matter how far the standards may have diverged in a century or more.