Our move to a CMS provides an ideal opportunity to examine issues
of resource identifier (URI) stability, which virtually all of the locally-hosted resources on our website currently lack.
What is a URI?
One of the ways to avoid this disruption (other than not going ahead with it) is to use HTTP redirects, which send a "30x Moved Permanently" message telling the web browser that a resource has moved and to what URL. The problem with building redirects for every resource is that:
http://www.library.unlv.edu/arch/archdb/
What is a URI?
You probably already know what a URL is, and a URI is similar.
Whereas speaking of a URL emphasizes the location of a resource,
speaking of a URI emphasizes the resource itself. In practice, these
may be the same thing. It turns out that on the web, the one sure-fire
way of uniquely identifying something is by its URL. So, that URL
happens to be its URI.
But resources can move. Their URLs can change. Although a
particular resource's URL will always be unique, it may not always
remain the same. Unfortunately, as is often the case, when a resource's
URL changes, its URI changes as well. Which means that when a newspaper
(for example) has an article with a URL like this:
And then changes its CMS, causing the URL to change to:
Both the URL and URI change. That is, the identifier changes even
though the resource that it represents didn't change - only its
location did. A whole lot of links break as a result.
It doesn't have to be this way!
Implications of a CMS on linking
A CMS will negate internal linking issues (i.e. linking from page
to page within the library website) by transparently assigning
all resources unique IDs which it will use to construct links
dynamically. Our resources will still have URLs, but they will change,
causing some disruption (link breakage). This is likely to affect:
- Browser bookmarks
- Search engine rankings
- Internal hyperlinks (from one of our pages to another of our pages)
- Hyperlinks from other sites to pages on our site
- Citations pointing to resources on our site
One of the ways to avoid this disruption (other than not going ahead with it) is to use HTTP redirects, which send a "30x Moved Permanently" message telling the web browser that a resource has moved and to what URL. The problem with building redirects for every resource is that:
- It's hard to compile a list of exactly what we have to redirect to where
- It's a pain to write a whole bunch of redirect scripts and/or set up redirect maps on the web server
- It's messy to merge an old URI scheme in with a new URI scheme
- Since we are continuing to maintain the old resource locations, they never fully "go away;" we have to continue to maintain the redirect table over time, tracking everything that has ever moved
Making URIs more stable
One way of stabilizing URIs is to hide resource implementation
details. What does that mean? Here is an example of the URI of a
resource on the library website:
This URI is bad because it is revealing; it reveals the fact
that the resource it identifies is being served from an HTML file in a
particular directory on our web server. It locates the resource, but
does not provide a very stable identifier for it. What if we were to
want to change the name of the undergrad.html file, or move it, or change the encoding of
its content (for example, HTML to XML) or make it a server-side script
(for example, PHP with a .php extension)? What would happen is that the
resource would have to get a new URL and this URL would break.
For another example, here is a page within the Architecture Studies Library's Las Vegas Architects and Buildings Database version 2 (LVABD2):
Let's break that down. From left to right, we have:
- UNLV library web server domain name
- ASL website subdirectory
- LVABD2 subfolder (called "archdb2" since it replaces an earlier version which was called "archdb")
- An "index.php" script which is used internally by the LVABD2 web application
- A "projects/view" parameter which tells the LVABD2 application that we want to view...
- ...an entity in the database with ID 251
Now, this is a totally functional URL which, although not very
good, is not the worst URL ever. But it requires us to ask what would
happen in the future if we were to change the specific implementation
of the LVABD2 application to one that did not use the same
"index.php/projects/view/251" parameter scheme. In fact, we had to deal with
this very issue last summer when we deployed LVABD2 to replace LVABD1.
If you access the URL of the previous version:
You'll notice that it loads a page that links to the
home page of the new version. This is a file that we had to create.
But it turns out that only this page redirects; none of the project
view pages etc. within the old application redirect to their new
equivalents. They are, in fact, all gone. It turns out that by
upgrading to LVADB2 and its new URI scheme, we broke the links for
everyone anywhere on the Internet who was linking to any of the
resources within LVABD1 - even though the availability of the resources
never changed at all. Whoops! (By the way, this is sort of my fault.)
When links break all over the place, people get annoyed. This
causes us to avoid changing resource locations and implementations,
which leads to our website falling behind the curve and becoming difficult to manage and use. Improvements in web technology frequently lead to gains in
efficiency, productivity, ease of use, ease of management, and
improvement in available features. In order to benefit from these
improvements, websites have to change by adopting this new technology.
But the URLs of their resources don't have to change, as long as those
resources are kept separate from the technology used to provide access
them.
So, getting back to the LVABD example:
What happens if we remove the implementation details of the LVABD web app from this URI? How about this hypothetical improved URI:
Now, the great thing about this URI is that it's totally
abstracted away from the resource it represents. There is not really a
"projects" subdirectory within the "arch" directory. There may
not even be an "arch" directory at all. As a patron, why should I care what
directories there are, or that the LVABD2 application is written in
CakePHP or Rails or TurboGears or whatever? I don't, and in any case, it's none of my
business! I'm only requesting the content corresponding to the URI that I provide. In fact, in this example,
LVABD hasn't changed at all; it's still there in the same location. We
have simply configured the web server to automatically serve the LVABD
URL equivalent whenever it receives a request for a URL matching the
pattern of the "improved" version. The user requests a resource; we
figure out how to deliver it. The
URI can remain stable for as long as we have a web server and
electricity. It can be persistent.
In summary
- We should think of all of the information on the library website in terms of resources, of which a web page is only one possible manifestation. (Others might include XML/JSON output for an iPhone app; vCard output for the staff directory; iCalendar output for the library calendar; etc.)
- The CMS is going to break a great deal of links on the library
website. If we take the proper precautions, this will only have to happen once. If not, we will continue to get bit again and again by broken links.
- This is not the CMS's fault; it's our fault for not abstracting away resource implementations from their URLs and designing a stable URI scheme a long time ago.
- In the future, we should think about mapping our resources to stable URIs and consider planning a site-wide URI schema into which our resources can fit. A CMS will greatly simplify this process, automating most of what would otherwise be painstaking manual work.
Comments
Pages
Add new comment