07 May 2010
SETI League
PriUPS Project

Add to Technorati Favorites

Listen to or download MP3 version of this blogitem

Saving Western Civilization Part 2

Yesterday I suggested that vendors of external hard drive backup products should include the full Wikipedia as a part of their offering.  I averred that it would be a shame to be deprived of this critical resource if the internet was destroyed by the crazies.  I received a torrent of comments.

A torrent, i.e., more than one, is unusual for this blog, which often goes many issues with little more than a harrumph from the odd millireader.  Even better, the replies revised and extended my suggestion.  In particular, Kirk had this to say:

I like your Wikipedia idea a lot! I would even go so far as to > encourage you to broaden your net to include other irreplaceable items. Some random suggestions:

  • The MIT "rad lab" series. Now digitized and publicly available. Add the Tek "Scopes", HP Journals and app notes, GenRad "experimenters", etc.

  • Online college "distance learning" classes. I think MIT and others have put many of these in the "free" category.

  • Heritage documents (bill of rights, constitution, etc. Hell even include the ongoing "federal record"!)
    Scans of books that are in the public domain (Google might be able to help here)

Google might be a particularly good partner in this endeavor as they could easily add a slight hack to search for "local" results
first. Plus, bandwidth has got to be a really big thing for them too!

I'm getting more excited the more I think about it :)

To which I replied:

Thanks. Good suggestions all! Actually, I've always wanted my own copy of the whole internet. If I had a program that watched where I browsed and transferred everything I saw and everything linked to it to a special hard drive, it would come close. Unfortunately, if everyone did that, that would succeed in shutting down the internet without cybercrimes.

There is a large amount of public domain material beyond Wikipedia that would be appropriate for the program. It would have to be text because video and pictures and audio would fill up the hard drive so quickly that it would be useless for its main purpose. Even so, selected items, say another 50GB worth, would be quite the worthwhile haul.

As I indicated, there is no reason to limit the information to Wikipedia.  Kirk's suggestions are good ones, and I would love to have the complete series of HP Journals so readily available.  Of course, HP and the general public might feel different about such a specialized and well-illustrated byte absorber, albeit for different reasons.  The difficulty is in winnowing the great mass of public domain material to that generally regarded as useful and is also conclusively in the public domain.  I'm sure the vendors would be looking for commercial advantage, not an endless series of discussions, arguments, and even lawsuits.  Wikipedia is available, free, and non-controversial in its usefulness.  Another comment from Sam pointed to the on-line availability of Wikipedia dumps, although their format is so specialized that a normal PC user won't want to be bothered.

I ended yesterday's blogitem with this promise: I was going to go into the technical details of how this could work.  I shall discuss that now, after my blush of enthusiasm morphed into a morass of practicality. 

The Morass of Practicality

Although other sites are extremely popular, they don't suggest themselves for this application.  Obviously E-commerce sites such as eBay and Amazon and social sites such as Facebook and Twitter depend on continuous connection to the internet to perform their function.  News sites would turn into archives overnight, and the good ones would lose their ability to charge for searches, so they won't go for it.  On the other end of the scale, relatively stable personal sites and blogs such as this one could usefully be archived but each would be of value only to a small number of fans.  One might argue that Wikipedia isn't stable enough to be archived, but that's only partly true.  Of course the corpus of Wikipedia is changing by the second, but individual articles might be edited anywhere from daily to never, depending on their currency and general interest.  Eventually, the archived Wikipedia will become stale.  Which, of course, gives everyone the opportunity to sell and buy and donate to it all over again!

Let's say you have bought a hard drive with a complete, searchable "snapshot" of Wikipedia installed.  It will also have a small program, perhaps a browser "plug in," that can divert the Wikipedia URL to the hard drive.  Since Wikipedia users will, at least before the deluge, want the current version of the article if it is newer than the disk-based one, this program will:

  • First, actually go to the on-line Wikipedia version of the article.  Assuming Wikipedia is cooperating in this venture, the first few bytes of the article header will have either a "hash" or more simply just a byte count of the article download itself.  (The "hash" would be a mathematically more precise way to confirm that the article hasn't changed.)

  • The browser plug-in would compare these few bytes with those of the same article on the disk.  If they are identical, the browser would download the article from the hard drive and spare Wikipedia the bandwidth.  If they are not, it would download the Wikipedia article from that source, display it in the browser, and, while it's at it, perhaps replace the article on the hard drive with the new one (at some increased software complexity, to be sure).

  • In either case, the up-to-date article would be displayed for the user.  Wikipedia would be spared the bandwidth of serving articles.  And if the internet were down, the user could still read the article from the hard drive, if not necessarily the absolutely most recent one.

I'm not sure how complex the software must be that is provided on the external hard drive.  But since these are sold by the hundreds of thousands in chain stores, the development cost will be spread out over enough units to make it minimal.  The job of adding a hash or a byte-count header falls to Wikipedia, but that can be accomplished by a script that takes a few milliseconds to run every time an article is updated.  Now what about you, the owner of your own personal Wikipedia?

Flaws and Energy Efficiency

There is one flaw in my scheme:  In order to be transparent and appear instantaneous to the user, the external backup hard drive must remain on all the time.  Normally these are powered up only when needed.  Although this is a small increment of power roughly equivalent to a cellphone charger or similar wall wart, it isn't nothing, and adds up over hundreds of thousands of units.  To be "green," it would be nice to keep the power drain low.  Here are a few semi- or full solutions.

  • Of course, one could just use it manually.  I.e., turn it on when civilization otherwise collapses, and use the "real" Wikipedia at other times. 

  • Alternatively, one could leave the drive on all the time, at the small cost of the wasted energy, and have all the benefits. 

  • An intermediate suggestion:  Transfer just the article titles and the hash or byte count values to the computer hard drive, which is always on.  This file would be in the tens of megabytes, and could be used to determine whether to turn on the hard drive to get the article.  The user could set a threshold:  Download from Wikipedia for a short article, turn on the drive for a long one, with the penalty of a short wait while the drive spins up.  This threshold might be adjusted automatically depending on the speed of the user's internet connection.

Personally, I'd pick the third solution.  I'm impatient, but not terminally so.

Although I promised "action" in the form of beseeching the external drive manufacturers to consider this, I've deferred epistolary activities until I finished writing part two, whose end we have just reached.  I'll do it shortly.

"The House at Pooneil Corners"
Jefferson Airplane





Eventide Stompboxes promotional T-shirt.  Nobody consulted with me on the color scheme.

So what else is new?

Richard Factor

Yesterday  |  Tomorrow