“A nearly impenetrable thicket of geekitude…”

Link Rot

Posted on March 26, 2018 at 08:58

I have been writing here (or on the predecessor site) since 1996. That means that at the time of writing in 2018, some of that content is over twenty years old. If your reaction to that statement is “that’s plenty of time for something to break” then your instincts are perfectly sound.

It’s Entropy, Man

This site has around 900 outbound links. Some of these are critical to a full understanding of an article, while others are essentially throw-away jokes. Once in a while, someone has pointed out that a link has broken in some way, but until recently I haven’t had a practical way of testing them in bulk.

Since my conversion of the site to the Nanoc static-site generator, I have been able to use the nanoc check external_links command to validate all of them, and I now run that job weekly along with checks on HTML and CSS validity.

The first time I ran the external_links check, I was greeted with more than 200 failures. In other words, over time something like 25% of the external links from this site had stopped functioning. It has taken a couple of months to work through all of those failures and correct them, and some common themes have emerged.

Some Things Just Don’t Last

There’s more “churn” in terms of company creation and destruction than I had thought. A reference to an innovative new startup is unlikely to survive its demise or its absorption by a mega-corporation by more than a few years. An individual might host some useful information and then disappear from sight overnight.

There’s an unfortunate “unseen iceberg” issue here, too: just because a link appears to validate, that is no guarantee that the content you’re referencing is still the same as it was; it’s entirely possible that the domain is now owned by someone else for something entirely different. If you’re lucky, it’s just a domain holding page or a bizarrely unconnected art retail enterprise; if you’re less lucky, it could be an actively malign malware vector or part of an advertisement network looking for link juice.

I don’t see a way to make this issue visible without manually examining every link. For the future, my hope is that domains don’t transfer instantaneously and that my new weekly check will catch things in transition.

Elephants Have No Memories

You might think that large corporations like Apple, HP or Oracle would be most likely to have the resources available to keep their sites consistent over time. You might think that entities whose purpose is to promulgate information, such as news outlets, standards organisations, national governments and international law bodies would have an interest in stable URLs.

You might think those things, but you’d be wrong. Oh, so wrong.

If you’re a very large organization, there’s a good chance you have an entire department dedicated to re-organizing your web site every few years. In many cases, that department has little or no incentive to make sure that links to a product manufactured for a couple of years back in the last century still work.

Another common pattern is for a site to be based around a custom Content Management System (CMS) bought in from a commercial vendor. These systems tend to have very strong opinions as to what your URLs should look like; when a few years later a different vendor wins the bidding process to provide your CMS, everything changes again and old references are unlikely to survive intact. On the plus side, such organisations tend to have enough cash to design very attractive “404 Not found” pages for you to stare at, so it’s not all doom and gloom in these cases.

Some news sites have experimented with “paywalls” over the years; some are still doing so. I don’t link to these sites in general, because it makes it hard for a reader to understand the context.

A more curious behaviour I’ve seen with only a few sites (Amazon, the ACM digital library and one technology news outlet) is that although they are happy for you to visit their site in a browser without charge, any attempt to dereference their URLs automatically (for example, in a link verifier) is rebuffed.

The simplest way to fix this class of issue is, of course, to link to somewhere else. It’s very rare these days for there to be only one usable reference on the web for any given topic.

Sources of Stability

A lot of the broken links I have repaired have been easy to fix, because the page I was referencing had just moved somewhere else. In many cases, the whole site had moved, but in some cases it was possible to find the same page at a new location using either a search engine or the site’s index.

With many other links, the original material had just disappeared from the web. In these cases, I had to decide whether the reference was critical to the article or not; if the latter, I’ve usually just removed the link. In some cases, I’ve mentioned that fact to explain why the article seems to reference something without actually linking to it. In a very cases, I have rewritten a section of the article to avoid the need for the reference.

When the reference was critical, more often that not I have ended up making use of one or the other of the World Wide Web’s great sources of stability: Wikipedia and the Wayback Machine.

I have used Wikipedia as a source of stable references for a long time now. Particularly for things that have been around for a few years, the neutral point of view used by its articles makes it an ideal reference for concepts and even products. I don’t find it as good a way of referring to recent events, so I tend to use links to more conventional news sources for those: of course, when those links rot as described earlier, Wikipedia is a possible alternative to migrate them to.

Today, I have a little under 100 links to Wikipedia, which is around 10% of all links.

When all else fails, or when Wikipedia’s article (if it has one) is too neutral to reflect a controversy, I find myself using the Internet Archive Wayback Machine to replace old links. In most cases, using the Wayback Machine means that the reader can see exactly what I’m referencing exactly as it appeared at the time, without the benefit of later changes. I have about 60 (7%) of my links pointing to the Wayback Machine today.

Sometimes, It’s Just Gone

In 2007, I wrote an article about a conference at which I had made a presentation. The conference site itself vanished some years ago, and I had already replaced my link to it with a link to an archive of the conference site made by the conference organizer.

In my recent checks, I found that the link from this site to that archive had broken in its turn. The archive site had not moved, but had vanished from the web. My next port of call was of course the Wayback Machine, where I found… nothing.

The archive site had instructed the Wayback Machine not to archive it, and had then disappeared.

At the time of writing, the top search engine hit on “Networkshop 35” is a page that is blank except for the dates of the conference. The second hit is my article on this site. The sixth hit is for my Flickr set from the event. Almost all other hits are false positives.

We, or at least I, have become used to the World Wide Web as giving us instant access to anything we want to know, if our Google-fu is strong. The most trivial things are out there somewhere, and I at least had grown into believing that all of that was permanently recorded and available for future historians to wonder at. That’s not true, though, and I was slightly shocked by that realisation.