Fix 303 with client-side redirects
April 6, 2012
I am trying to stay far away from the current TAG discussions of httpRange-14 (now just HR14). I did my time, years ago. I came up with the best solution to date: use “303 See Other”. It’s not pretty, but so far it is the best we’ve got.
I gather now the can of worms is open again. I’m not really hungry for worms, but someone mentioned that the reason it’s open again is that use of 303 is just too inefficient. And if that’s the only problem, I think I know the answer.
If a site is doing a lot of redirects, in a consistent pattern, it should publish its rewrite rules, so the clients can do them locally.
Here’s a strawman proposal:
We define an RFC 5785 well-known URI pattern: .well-known/rewrite-rules. At this location, on each host, the host can publish some of its rewrite and redirection rules. The syntax is a tiny subset of the Apache RewriteRule syntax. For example:
# We moved /team to /staff RewriteRule /team/(.*) /staff/$1 301 # All the /id/ pages get 303'd to the doc pages RewriteRule (.*)/id/(.*) $1/doc/$2 303
The syntax here is: comments start with a slash; non-comments have four fields, separated by whitespace. The first field is the word “RewriteRule”. The second is a regular expression. The third is a string with back-references into the regular expression. The fourth is a numeric redirect code. Any line not matching this syntax, or otherwise not understood by the client, is to be ignored.
Clients that do not implement this specification will function unchanged, not looking at this file. Clients that do implement this specification keep track of how many times they get an HTTP redirect from a given host. If they get three or more redirects during one small period of time (such as a minute, or one run of the client if the client is short-lived), they perform a GET on /.well-known/rewrite-rules.
If the GET succeeds, the result should be cached using normal HTTP caching rules. If the result is not cached, this protocol is less efficient than server-side redirects. If the result is cached too long, clients may see incorrect data, so clients must not cache the result for longer than permitted by HTTP caching rules. (Maybe we make an exception for simple-minded clients and say they MAY ignore cache control information and just cache the document for up to 60 seconds.)
If a client has a non-stale set of rewrite-rules from a given host, it should attempt to perform those rewrite rules client-side. For any GET, PUT, etc, it should match the URL (after the scheme name and colon) against the regular expression; if the match succeeds, it should perform the match-substitution into the destination string and use that for the operation, as if it had gotten a redirect (with the given redirect code).
As an example deployment, consider DBPedia. Everything which is the primary subject of a Wikipedia entry has a URL has the form http://dbpedia.org/resource/page_title. When the client does a GET on that URL, it receives a 303 See Other redirect to either http://dbpedia.org/data/page_title or http://dbpedia.org/page/page_title, depending on the requested content type.
So, with this proposal, DBPedia would publish, at http://dbpedia.org/.well-known/rewrite-rules this content:
RewriteRule /resource/(.*) /data/$1 303
This would allow clients to rewrite their /resource/ URLs, fetch the /data/ pages directly, and never going through the 303 redirect dance again.
The content-negotiation issue could be handle by traditional means at the /page/* address. When the requested media type is not a data format, the response could use a Content-Location header, or a 307 Temporary Redirect. The redirect is much less painful here; this is a rare operation compared to the number of operations required when a Semantic Web client fetches all the data about a set of subjects
My biggest worry about this proposal is that RewriteRules are error prone, and if these files get out of date, or the client implementation is buggy, the results would be very hard to debug. I think this could be largely addressed by Web servers generating this resource at runtime, serializing the appropriate parts of the internal data structures they use for rewriting.
This could be useful for the HTML Web, too. I don’t know how common redirects are in normal Web browsing or Web crawling. It’s possible the browser vendors and search engines would appreciate this. Or they might think it’s just Semantic Web wackiness.
So, that’s it. No more performance hit from 303 See Other. Now, can we close up this can of worms?
ETA: dbpedia example. Also clarified the implications for the HTML Web.
April 11, 2012 at 2:33 am
Rather than a well-known uri, why not an http header in the response the fist time?
April 11, 2012 at 7:14 am
You mean the rewrite rules would be transmitted in a header?
That might work if you could figure out some appropriate boundaries. In general, I don’t think the person in charge of host/d1/d2/d3/r has the right to say what the rewrite rules for the entire host are. Maybe for d3 and down.
Also, for whatever reasons, the IETF makes it relatively easy to define new .well-known URIs but as far as I know it’s still hard to define new headers.