Follow That Link
I’ve written before about link rot on this site and the various
forms it takes. As mentioned there, I now run Nanoc’s
subcommand weekly to catch links when they stop working so that I can fix them.
Not all link rot shows up as a
404 Not Found status, though. Read on for a
couple of classes of problem that a recent version of Nanoc helped me
uncover and resolve.
Sometimes you can
get advance knowledge of possible future problems by paying attention to the
3xx redirection codes. In particular, when dereferencing a
URL results in
301 Moved Permanently, RFC 7231 section 6.4.2 suggests (in
a non-normative way) that you should update your reference before it stops
The 301 (Moved Permanently) status code indicates that the target resource has been assigned a new permanent URI and any future references to this resource ought to use one of the enclosed URIs. Clients with link-editing capabilities ought to automatically re-link references to the effective request URI to one or more of the new references sent by the server, where possible.
(I will note in passing that a normative OUGHT TO is one of the terms defined in RFC 6919.)
I was therefore very pleased when Nanoc 4.11.3 started reporting 301 and 308 permanent redirection status codes as errors (while reporting the target of the redirection so that they are easy to fix). I was less pleased when this turned out to affect more than 30% of my outbound links; that’s a lot of work.
I have worked through all 350+ redirecting links now, and a couple of interesting categories seem to dominate: the move to an encrypted web and the unwillingness of large content sites to ever let you go.
Let’s Encrypt Everything
When I started this site, the vast majority of URLs were
Encryption was, after all, resource intensive and frowned upon by governments;
the required X.509 certificates were eye-wateringly expensive. There really
wasn’t any public concern about the privacy implications of an unencrypted web:
the feeling was that encryption was probably needed for banks and online stores,
but not really anything else.
Today, things are very different. Encryption is not so resource intensive on modern processors due to both general improvements in processor speed and because of the addition of instructions which accelerate cryptographic operations. The monetary cost of TLS certificates has been reduced to zero for many sites by initiatives such as Let’s Encrypt. Perhaps most importantly, encrypted connections are expected for much more, post-Snowden and post-RFC 7258’s thesis that “Pervasive Monitoring Is an Attack.”
This shift towards encryption has been dramatic: to take one statistic from
Google’s transparency report, the percentage of pages loaded over
HTTPS by the Chrome browser in the USA has grown from 45% in 2015 to 90% today,
just four years later. This level of change has allowed many browsers to start
to warn that
http:-scheme pages are “insecure”.
At least 70% of the redirections I have corrected have fallen into this
category: as I have done myself, the site has moved to an
always-HTTPS policy, and put a blanket redirection in place so that all
http: accesses are redirected to the same location on the
https: version of
Perhaps a third of those redirections have also changed the resulting host
name, most often from
www.example.com to just
example.com. If you’re putting
one big mass redirection in place, you might as well take the opportunity to
tidy things up. I was ahead of this one, for once: this site went
back in 2012.
A Maze of Twisty Little Redirections
I’d like to preface this section with a quotation from the W3C’s 2004 document Architecture of the World Wide Web, Volume One, section 3.5.1:
…the term URI persistence is used to describe the desirable property that, once associated with a resource, a URI should continue indefinitely to refer to that resource.
In the previous article, I observed:
If you’re a very large organization, there’s a good chance you have an entire department dedicated to re-organizing your web site every few years. In many cases, that department has little or no incentive to make sure that links to a product manufactured for a couple of years back in the last century still work.
In some ways, returning a
404 Not Found when a resource no longer exists is a
pretty reasonable approach. The question “why would it ever be necessary to stop
a resource from existing” is one for the philosophers among
It’s also useful, within reason, to use redirections when a resource (or something closely approximating it) can be found elsewhere. For example, the description of a unique product might start off in one location and then move to a new one when the product becomes part of a category. The marketing department might tell you that all your products in a category are being rebranded with a snazzy new name and logo, and that it’s essential for the URLs to reflect this… and so on.
Sites which reorganize frequently can end up handing you a whole chain of redirections before you get to your destination. Your browser won’t tell you about this, but it’s an indicator of future fragility: I’m glad that Nanoc now warns me about it.
What’s worse, in my view, than building long redirection chains in this way is
the refusal of some sites to let you go when they finally do retire a resource
for whatever reason. Nanoc uncovered dozens of these for me: previously-valid
URLs that still claim to be valid (by returning a
301 Moved Permanently) but
actually lead to the site’s front page, or a generic search page. I suppose the
feeling must be that if I have come to the site, I will be able to find
something there that will interest me, even if the thing I came to see has
This behaviour is a disservice to the user and completely unnecessary. It is
perfectly possible to present a
404 Not Found status page with search
resources, hints for finding mislaid comment or Babylon 5 quotes,
or anything else you’d like.
[Nerdy aside: that last link is to a rendered version of this site’s 404
page. If you visit the 404 page using that link, or something like
you’ll get the page with a
200 OK status. If instead you visit this site at a
location that doesn’t resolve to a document, you’ll get see the same page in
your browser, but it will be served up with a
404 Not Found status. I’m not
actually linking to a non-existent resource here because Google would never let
me hear the end of it.]