RDF 2 Wishlist

October 30, 2009

Here’s what I think should be standardized at some point, soon, in the Semantic Web infrastructure. These items are at various levels of maturity; some are probably ready for a W3C Working Group right now, while others are in need of research. They are mostly orthogonal and most can be handled in independent efforts. (I would lean against forming a single RDF Working Group to handle all of this; that would be slower, I think.)

To be clear, when I say “RDF 2” I mean it like OWL 2: an important step forward, but still compatible with version 1. I’m not interested in breaking any existing RDF systems, or even in causing their users significant annoyance. In some traditions, where the major version number is only incremented for incompatible changes, this would be called a 1.1 release. In contrast, at W3C we normally signal a major, incompatible change by changing the name, not the version number. (And we rarely do that: the closest I can think of is CSS->XSL, PICS->POWDER, and HTML->XHTML). The nice thing about using a different name is it makes clear that users each decide whether to switch, and the older design might live on and even win in the end. So if you want to make deep, incompatible changes to RDF, please pick a new name for what you’re proposing, and don’t assume everyone will switch.

This is partially a trip report for ISWC, because the presentations and especially the hallway and lounge conversations helped me think about all this.

Note that although I work for W3C, this is certainly not a statement of what W3C will do next. It’s not my decision, and even if it were, there would be a lot of community discussion first. This is just my own opinion, subject to change after a little more sleep. Formally the decisions about how to allocate W3C resources among the different possible standards efforts are made by W3C management guided by the the folks who provide those resources, via their representatives on the Advisory Committee (AC). If the direction of the W3C is important to you or your business, it may be worthwhile to join and participate in that process.

1. RDF and XML interoperation

There’s a pretty big divide between RDF and XML in the real world. It’s a bit like any divide between different programming languages or different operating systems: users have to pick which technology family to adopt and invest in. It’s hard to switch, later, because of all the investment in tools, built systems, educations, and even socially networks. (People who use some technology build social and professional relationships other people who use the same technology. Thus we have an XML community, an RDF community, etc. Few people are motivated to be in both communities.)

I think we should have better tools for bridging the gap, technologically, so that when data is published in XML, it’s easy for RDF consumers to use it, and when the data is published in RDF, it’s easy for XML consumers to use it.

The leading W3C answer is GRDDL, which I think is pretty good, but could use some love. I’d like to see support for the transforms being in Javascript, which I think is probably the dominant language these days for writing code that’s going to run on someone else’s computer. It certainly has a bigger community than XSLT. I’d probably support Java bytecode, too.

I would also like to see some way to support third-party GRDDL, where the transform is provided by someone not associated with either the data provider or data consumer. Nova Spivack gave a keynote where he talked about this feature of T2. They’re focused on HTML not XML, but the solution is probably the same.

Beyond GRDDL, I think there’s room for a special data format that bridges the gap. I’ve called it “rigid rdf” or “type-tagged xml” in the past: it’s a sub-language of RDF/XML, or a style of writing XML, which can be read by RDF/XML parsers and is also amenable to validation and processing using XML schemas. Basically you take away all choices one has in serializing RDF/XML.

I note the The Cambridge Communiqué is ten years old, this month. It proposed schema annotation as an approach, and that’s not a bad one, either. I haven’t heard of anyone working on it recently, but maybe that will change if the XML community starts to see more need to export RDF.

Amusingly, while I was talking to Gary Katz from MarkLogic about this, he mentioned XSPARQL as a possible solution, and I pointed out Axel Polleres (xsparql project leader) was sitting right next to us. So, they got to talk about it. XSPARQL doesn’t excite me, personally, because I don’t use either SPARQL or XQuery, but objectively, yes, it might solve the problem for some significant userbase.

2. Linked Data Inference

For me, an essential element of a working Linked Data ecosystem is automatic translation of data between vocabularies. If you provide data about the migration of frogs in one vocabulary, and my tools are looking for it in another one, the infrastructure should (in many cases) be able to translate for us. We need this because we can’t possibly agree on one vocabulary (for any given domain) that we’ll all use for all time. Even if we can agree for now, we’ll want this so that we can migrate to another vocabulary some time in the future.

Inference using OWL (and its subsets like RDFS) provides some of this, but I don’t think it’s enough. RIF fills in some more, but the WG did not think much about this use case, and there’s might be some glue missing. Maybe we can get WG Note out of RIF to help this along.

I’d like us to be clear about first principles: when you’re given an RDF graph, and you’re looking for more information that might be useful, you should dereference the predicate IRIs to learn about what kinds of inference you’re entitled to do. And then, given resources and suitable reasoners, you should do it. That is, the use of particular IRIs as predicates implies certain things, as defined by the IRI’s owner. The graph is invoking certain logics by using those IRIs. (Of course you can always infer things that were not implied, but as among humans, those “inferences” are really just guesses you are making. They have quite a different status from true implications.)

If this is put together properly, and the logics are constructed in the right form, I think we’ll get the dynamic, on demand translation I’m looking for. I imagine RIF could be very useful for this, but reasoner plugins written in Javascript of Java bytecode could be a better solution in some cases.

Some of my thinking here is in my workshop keynote slides, but later conversations with various folks, especially Pat Hayes and TimBL, helped it along. There’s more work to do here. I think it’s pretty small, but crucial.

3. Presentation Syntaxes

RDF, OWL, and RIF all have hideous primary exchange syntaxes and some decent not-W3C-recommended alternative serializations. I’m not really sure what can practically be done here that hasn’t been done.

At very least, I’d like to see a nice RDF-friendly presentation syntax for RIF. A bit like N3, I suppose. I did some work on this; maybe I can finish it up, and/or someone else can run with it.

OWL 2 has 3+n syntaxes, where n is the number of RDF syntaxes we have. Exactly one of those syntaxes is required of all consumers, for interchange. I’ll be interested to see how this plays out in the market.

4. Multi-Graph Syntax

Most systems that work with RDF handle multiple graphs at the same time. Sometimes they do this by storing the triples in a quad store, with the fourth entry being a graph identifier. This works pretty well, and SPARQL supports querying such things.

We don’t have a way to exchange multiple graphs in the same document, however. N3 has graph literals (originally called contexts), and there was some work under the term named graphs, which is kind of the opposite approach.

Personally, I don’t yet understand the use case for interchanging multiple graphs in one document, so I’m not sure where to go with this.

Hmmm. I guess RIF could be used for this. You can write RDF triples as RIF frame facts, and the rif:Document format allows multiple rulesets, each with an optional IRI identifier, in the same document. ETA: RIF also gives you an exchange syntax where you can syntactically put literals in the subject and use bnodes as predicates, if you want. But now you’re technically exchanging RIF Frames instead of RDF Triples.

5. RDF Graph Validation

When writing software that operates on RDF data, it’s really nice to know the shape of the data you’ll find. It’s even nicer, if software can check to see if that’s actually what you got. And if reasoners can work to fill in any missing peices.

I don’t exactly understand how important or unimportant this is. It’s closely related to the Duck Typing debate. Whatever mechanisms make duck typing work (eg exception handling, reflection, side-effect-free programming) probably help folks be okay without graph validation. But I think folks trained on C++/Java or XML Schema would be much happier with RDF if it had this

The easiest solution might be using rigid RDF. One could probably also do it with SPARQL, essentially publishing the graph patterns that will match the data in the expected graphs.

The most interesting and weird approach is to use OWL. Of course, OWL is generally used to express knowledge and reason about some application domain, like books, genes, or battleships. But it’s possible to use OWL to express knowledge about RDF graphs about the application domain. In the first case, you say every book has one or more authors, who are humans. In the second case, you say every book-node-in-a-valid-graph has one or more author links to a human-node in the same graph. At least that’s the general idea. I don’t know if this can actually be made to work, and even if it can, it risks confusing new OWL users about one of the subjects they’re already seriously prone to get wrong.

6. Editorial Issues

Finally, I’d like some portions of the 2004 RDF spec rewritten, to better explain what’s really going on and guide people who aren’t heavily involved in the community. This could just be a Second Edition — no need for RDF 2 — because no implementations changes would be involved.

I’d like us to include some practical advice about when/how to use List/Seq/Bag/Alt, and reification, maybe going so far as to deprecate some of them (IMHO, all but List). Maybe bring in some of the best-practice stuff on publishing and n-ary relations.

I understand Pat Hayes would like to explain blank nodes differently, explicitly introducing the notion of “surfaces” (what I would call knowledge bases, probably). Personally, I’d love to go one step farther and get rid of all “graph” terminology, instead just using N-Triples as the underlying formalism, but I might a minority of one on that.

ETA: Of course we should also change “URI-Reference” to “IRI”, and stuff like that.

Okay, that’s my list. What’s yours? (For long replies, I suggest doing it on your own blog, and using trackback or posting a link here to that posting.) Discussion on semantic-web@w3.org is fine, too.

8 Responses to “RDF 2 Wishlist”

  1. Olivier Rossel Says:

    A remark concerning named graph and how to handle them in the RDF/XML world.
    COnsidering that a named graph should map to a document is absolutely bad. What if you have to manage a thousand named graphs, each containing a single triple . Then what? A thousand tiny files?
    And more importantly, how to manage that a given triple is in several named graphs at once?
    Ah ah back to the reification issue, where you want to address the same triple multiple times, without duplicating it.

    Basically you need a reference on that triple.
    Which is perfectly available in RDF/XML through the rdf:ID attribute on properties.
    Unfortunately there is no equivalent in other dialects of RDF.

    So please please please consider the idea that RDF describe triples, and their reference (== RDF should have 4 fields per triple).

    PS: most implementations of RDF systems store the triple as quads. To be able to adress them unarily.

    PPS: i would appreciate clarification concerning the implementation of rdf:ID on properties, in RDF/XML. As far as I know, this (crucial) feature is implemented as the creation of … triples, alongside the original triple. If this assumption is correct, it is absolutely UGLY, completely denies the fact that RDF implementations are already quad stores (i.e would map natively with that rdf:ID stuff). Any clarification?

  2. Olivier Rossel Says:

    …as the creation of ___rdf:Statement___ triples, alongside the original triple…

  3. sandhawke Says:

    I don’t see any problem with a thousand (or a googol) named graphs, since (1) while it’s best practice, you don’t HAVE to serve a representation of the graph at URI, and (2) the webserver certainly doesn’t have to use a file for each graph. In particular, you can have http://example.com/graph/NNN query the quadstore for graph number NNN and serve up its contents. You say “basically you need a reference on that triple” and yes, absolutely, but that’s what an IRI is. There’s an engineering design tradeoff between identifying triples and identifying sets of triples, but I think identify sets of triples (some of which contain only one triple) is okay.

    I’m not sure what to say about rdf:ID, except that I think the community generally considers it harmful, and it probably should be deprecated, both on subjects (use rdf:about instead) and on predicates (avoid rdf:reificiation).

  4. Olivier Rossel Says:

    RDF/XML weakness:
    RDF/XML does not provide a way to mix several named graph together in the same XML structure, *while* keeping graphtriple information (i.e graph provenance).

    LOD solution:
    This can be circunvented in LOD where you rely on SPARQL and HTTP and all that on top of RDF/XML.

    But LOD is just an architecture of RDF data exchange.
    Whereas RDF/XML is a standard spec of a serialization format. It must stay agnostic of the surrounding architecture. And try to circunvent its own weaknesses by itself.

    To take another example, HTML spec does not rely on HTTP to circumvent its own shortcomings. These are different businesses.

    Back to RDF/XML, and conclusion:
    IMHO, the lack of semantic for named graph and triple IDs must be adressed at the spec level, not with some “one size fits all” architecture. (basically because you must not tighten all the possible usages of a spec).

    PS: i think this triple ID will also permit to solve a lot of issues about provenance. Whereas (current) LOD proposes no good solutions for that.

  5. Nice list!

    Well, I’d like XML to die, or at least become less relevant to RDF than it is today, but clearly, it is a long way to go.

    Also, I consider lists and sequences as a bigger issue than merely editorial, I feel we need to come up with a radically better solution.

    Finally, I think RDF has a serious shortcoming with respect to units of numerical literals. Almost all numbers have units, and this is an issue that is orthogonal to the datatype, but today we have no way to do this well.

    Expressing uncertainty would also be nice.

  6. […] 2.0″ et RDF syntaxes 2.0. Sur le même sujet, je vous conseille ausssi la lecture de ce billet de Dave Beckett, développeur de redland, inventeur de turtle et grand avocat des technologies du Web […]

  7. […] formulas. With the RDF next steps workshop coming up and Pat Hayes re-thinking RDF semantics Sandro thinking out loud about RDF2, I'd like us to think about RDF in more traditional terms. The scala programming language […]

  8. […] months ago, when Ivan Herman and I began to plan a new RDF Working Group, I posted my RDF 2 Wishlist. Some people complained that the Semantic Web was not ready for anything different; it was still […]

Comments are closed.

%d bloggers like this: