Short and Canonical URLs

Slashdot just posted an article about using canonical URLs for URL shortening which does a bad job in my opinion of explaining the problem. So here’s a breakdown that I hope helps.

Multiple URLs

Let’s take a step back and look at an earlier problem. A particular resource on the internet might be located in multiple places. Let’s say we have an article about swedish fish candy (thanks, Google). This article might be found at any number of locations on my website:

  • http://www.example.com/products/swedish-fish
  • http://www.example.com/product.php?item=swedish-fish
  • http://www.example.com/product.php?item=swedish-fish&category=gummy-candy
  • http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678
  • http://www.example.com/psf

Suppose any one of those URLs will direct you to the same, or effectively the same article about swedish fish. Wouldn’t it be nice if I, as the author of that article, could specify which URL is the best address? This way when Google sees all of those different links, it treats them as links to a single, best address.

Turns out, Google has a solution. All you need to do is add a single line of HTML code at the top of each of these articles that looks like this:

<link rel="canonical" href="http://www.example.com/product/swedish-fish" />

This tells Google when it processes those links and pages that they all really refer to this single canonical URL. At least Ask, Microsoft and Yahoo also respect this directive. So if we have multiple URLs for a single resource, we can tell other services which URL we designate is the best URL.

Short URLs

One useful alternate URL is a shorter version of the canonical URL. Consider the following two URLs:

  • http://tinyurl.com/db2myk
  • http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=lamma+island,+hk&ie=UTF8&ll=22.238578,114.147263&spn=0.264079,0.215607&z=12

Both URLs refer to the same map, but the first is much easier to pass along in an email, SMS, twitter post, etc. Despite being useful, it also has all sorts of problems including being opaque and a possible point of failure.

There’s a new idea going around that would let us have our short URLs and avoid these problems by using rev=”canonical”. Rev? Well, it turns out rev is the opposite of rel for link elements. So that we might have the following:

<!-- included in the code at 'http://www.example.com/psf' -->
<link rel="canonical" href="http://www.example.com/product/swedish-fish" />

<!-- included in the code at 'http://www.example.com/product/swedish-fish' -->
<link rev="canonical" href="http://www.example.com/psf" />

The first says, “Hello. The preferred address for this resource is http://www.example.com/product/swedish-fish. Have a nice day.” The second says, “Hello. I am the preferred address for http://www.example.com/psf. Have a nice day.”

The proponents of “revcanonical,” as they’re tagging it, suggest that we can use this second scenario to let services know about a possible shorter URL for a given longer URL. It also allows for websites to provide their own URL shortening service instead of relying on third parties like Tiny URL. In this case, when someone submits the longer URL to a service like Twitter, Twitter could inspect the HTML at that URL, discover the suggested shorter URL and thus use and display the shorter address. To reduce the processing overhead, there’s even a suggestion to use this rev=canonical information in HTTP headers.

A solution to linkrot?

I’ve long been a fan of Tiny URL. I use it for emails and tweets regularly. But I’m also keenly aware of the flaws and issues in introducing a third party URL shortening service. I welcome any idea that helps solve this problem.

The trouble is, I’m not sure revcanonical is a good solution and this is beyond just the challenges of getting services like Twitter (let alone browsers) to support it. First off, rev here means reverse. In other words, the reverse canonical. Just because some address is a reverse canonical address doesn’t mean that it’s necessarily shorter. The revcanonical proposal requires us to assume this intent.

Even worse, revcanonical mixes up the idea of alternative URLs with the idea of a canonical URL. The proposal would have us suggest that the following:

<!-- included in the code at 'http://www.example.com/product/swedish-fish' -->
<link rev="canonical" href="http://www.example.com/psf" />

Means ”http://www.example.com/psf is a shorter URL for http://www.example.com/product/swedish-fish” when in fact it means ”http://www.example.com/product/swedish-fish claims to be the canonical URL for http://www.example.com/psf.” Think about that for a second. It’s the difference between one person saying, “Yeah, Joe speaks for me” versus Joe saying, “Yeah, I speak for this whole group.” To confirm Joe’s statement, we’d have to ask each person in the group. Likewise, that means if one URL says it’s the canonical address for another, we’d have to check that second URL to make sure it agrees. Otherwise, we open ourselves up to highjacking.

There are two other issues with this proposal. First, while rev is a valid HTML 4 element, it isn’t in the latest HTML 5 drafts. That means the proposal is pushing forward something which will soon been invalid markup. That alone should kill this particular idea immediately.

Finally, as Ben Ramsey pointed out, the proposal is confusing. The difference between rel and rev is subtle and will definitely lead to misunderstanding and misinterpretation. It was confusing enough that I felt it worthwhile to write this article just to make sure I understood it. A good solution should be immediately apparent.

So, while I applaud the effort of those who are looking to solve our tiny url issues, I sincerely hope this proposal receives more inspection before wide adoption. Personally, I’d much rather see something like a rel="shorter" attribute used on link and a elements.