“A nearly impenetrable thicket of geekitude…”

Follow That Link

Posted on May 19, 2019 at 21:23

I’ve written before about link rot on this site and the various forms it takes. As mentioned there, I now run Nanoc’s check external_links subcommand weekly to catch links when they stop working so that I can fix them.

Not all link rot shows up as a 404 Not Found status, though. Read on for a couple of classes of problem that a recent version of Nanoc helped me uncover and resolve.

Sometimes you can get advance knowledge of possible future problems by paying attention to the 3xx redirection codes. In particular, when dereferencing a URL results in 301 Moved Permanently, RFC 7231 section 6.4.2 suggests (in a non-normative way) that you should update your reference before it stops working:

The 301 (Moved Permanently) status code indicates that the target resource has been assigned a new permanent URI and any future references to this resource ought to use one of the enclosed URIs. Clients with link-editing capabilities ought to automatically re-link references to the effective request URI to one or more of the new references sent by the server, where possible.

(I will note in passing that a normative OUGHT TO is one of the terms defined in RFC 6919.)

I was therefore very pleased when Nanoc 4.11.3 started reporting 301 and 308 permanent redirection status codes as errors (while reporting the target of the redirection so that they are easy to fix). I was less pleased when this turned out to affect more than 30% of my outbound links; that’s a lot of work.

I have worked through all 350+ redirecting links now, and a couple of interesting categories seem to dominate: the move to an encrypted web and the unwillingness of large content sites to ever let you go.

Let’s Encrypt Everything

When I started this site, the vast majority of URLs were http:-scheme. Encryption was, after all, resource intensive and frowned upon by governments; the required X.509 certificates were eye-wateringly expensive. There really wasn’t any public concern about the privacy implications of an unencrypted web: the feeling was that encryption was probably needed for banks and online stores, but not really anything else.

Today, things are very different. Encryption is not so resource intensive on modern processors due to both general improvements in processor speed and because of the addition of instructions which accelerate cryptographic operations. The monetary cost of TLS certificates has been reduced to zero for many sites by initiatives such as Let’s Encrypt. Perhaps most importantly, encrypted connections are expected for much more, post-Snowden and post-RFC 7258’s thesis that “Pervasive Monitoring Is an Attack.”

This shift towards encryption has been dramatic: to take one statistic from Google’s transparency report, the percentage of pages loaded over HTTPS by the Chrome browser in the USA has grown from 45% in 2015 to 90% today, just four years later. This level of change has allowed many browsers to start to warn that http:-scheme pages are “insecure”.

At least 70% of the redirections I have corrected have fallen into this category: as I have done myself, the site has moved to an always-HTTPS policy, and put a blanket redirection in place so that all http: accesses are redirected to the same location on the https: version of the site.

Perhaps a third of those redirections have also changed the resulting host name, most often from www.example.com to just example.com. If you’re putting one big mass redirection in place, you might as well take the opportunity to tidy things up. I was ahead of this one, for once: this site went www.-free back in 2012.

A Maze of Twisty Little Redirections

I’d like to preface this section with a quotation from the W3C’s 2004 document Architecture of the World Wide Web, Volume One, section 3.5.1:

…the term URI persistence is used to describe the desirable property that, once associated with a resource, a URI should continue indefinitely to refer to that resource.

In the previous article, I observed:

If you’re a very large organization, there’s a good chance you have an entire department dedicated to re-organizing your web site every few years. In many cases, that department has little or no incentive to make sure that links to a product manufactured for a couple of years back in the last century still work.

In some ways, returning a 404 Not Found when a resource no longer exists is a pretty reasonable approach. The question “why would it ever be necessary to stop a resource from existing” is one for the philosophers among us.

It’s also useful, within reason, to use redirections when a resource (or something closely approximating it) can be found elsewhere. For example, the description of a unique product might start off in one location and then move to a new one when the product becomes part of a category. The marketing department might tell you that all your products in a category are being rebranded with a snazzy new name and logo, and that it’s essential for the URLs to reflect this… and so on.

Sites which reorganize frequently can end up handing you a whole chain of redirections before you get to your destination. Your browser won’t tell you about this, but it’s an indicator of future fragility: I’m glad that Nanoc now warns me about it.

What’s worse, in my view, than building long redirection chains in this way is the refusal of some sites to let you go when they finally do retire a resource for whatever reason. Nanoc uncovered dozens of these for me: previously-valid URLs that still claim to be valid (by returning a 301 Moved Permanently) but actually lead to the site’s front page, or a generic search page. I suppose the feeling must be that if I have come to the site, I will be able to find something there that will interest me, even if the thing I came to see has disappeared.

This behaviour is a disservice to the user and completely unnecessary. It is perfectly possible to present a 404 Not Found status page with search resources, hints for finding mislaid comment or Babylon 5 quotes, or anything else you’d like.

[Nerdy aside: that last link is to a rendered version of this site’s 404 page. If you visit the 404 page using that link, or something like curl, you’ll get the page with a 200 OK status. If instead you visit this site at a location that doesn’t resolve to a document, you’ll get see the same page in your browser, but it will be served up with a 404 Not Found status. I’m not actually linking to a non-existent resource here because Google would never let me hear the end of it.]