I’m just past halfway through nine consecutive days of all day meetings, clustered around the W3C’s semi-annual Advisory Committee meeting. When this series started, I didn’t have any particular responsibilities, beyond attending. But then John Sheridan had to canceled his trip, and I was tapped to stand in for him on panels at both the AC meeting (this morning) and FOSE tomorrow.

My slides don’t stand on their own very well, but here they are anyway. There might be video of the FOSE talk, later. And if you’re in Washington, by all means, drop by!

1. Why linked data is important for government open data initiatives. This was for a panel showing how Linked Data is being adopted, in various ways, in various industries, for an audience (W3C staff and member representatives) who were, perhaps, tired of hearing about the promise of the Semantic Web.

2. How To Do Open Data (in 10 minutes). This is for an audience of federal IT managers/vendors who are mostly unfamiliar with the concept of RDF.

RDF meets NoSQL

March 9, 2010

On Thursday, I have 20 minutes to address 200 people (plus a video audience) at NoSQL Live … from Boston. My self-appointed mission is to start building bridges between the NoSQL community and the Linked Data/RDF/W3C community. These are two sets of people working on different problems, but it’s pretty clear to me they are heading in the same direction, in similar spirit, and could gain a lot from working together.

I’m organizing my talk around the question of standardization for NoSQL, and I’ll talk about W3C process and such, but the interesting part is where NoSQL touches RDF. So here are some of my thoughts on that, written for an audience already familiar with RDF. I’d love to get some feedback on the basic ideas now, to make my talk better.

  1. While they both want to move beyond SQL, their reasons are different. I understand the key for NoSQL is:

    • SQL doesn’t scale big enough. Once your read-write dataset gets too big for a single machine, you have to develop and maintain a messy sharding system. (But see below.) Sometimes it’s an economy/efficiency thing; if you don’t need ACID, a NoSQL solution might give you better performance on cheaper hardware.

    Meanwhile, I see two big motivation for the RDF folks:

    • Decentralization. On the open Web, data comes from many different sources, mostly beyond your control. Everyone might be providing data to everyone, and no one has the authority to run a central database, ever if they had the technology and the iron.

    • Inference. Some people find formal semantics and well defined inference to be very useful. (How is this different from triggers and views? Good questions.)

  2. There’s a freedom, a joy, and in some case enormous practicality to not having to pre-create your database schema. Nearly all NoSQL and RDF systems are schemaless or allow a fully dynamic schema. (But I understand most bigtable systems have fixed column families. That could be a problem in using a bigtable as an RDF store.)

  3. Both tend follow Web Architecture, using HTTP and often using REST.

  4. Reading about CouchDB and MongoDB (JSON document databases), as well as Neo4j (a graph database), I noticed an undercurrent about SQL being awkward to program against. I guess this is the O/R Impedance Mismatch, especially the structural differences. RDF’s design is very close to the relational model, so it doesn’t help on this front. Within the RDF community, however, there are some systems which attempt to partly bridge this gap, including node-centric APIs (which I happen to prefer, myself). I would also argue that duck typing closes the gap from the programming-language side.

  5. I don’t see NoSQL going anywhere near linked data or having a vocabulary ecosystem. I expect it will want to, someday. Decentralization is a key difference in requirements.

  6. I only see RDF dealing with inference. Why is this? My first thought is that the RDF community has a lot of AI roots, and the NoSQL community doesn’t. But maybe it’s about economics and motivations: formalizing the notion of inference makes it possible, in theory, to easily deploy very sophisticated data transformation (and making them complete before the sun goes out). At NoSQL scale, folks are much more concerned about techniques for being able to run even simple transformations (and making them complete before the power bill comes due). I note that AllegroGraph manages to be in both communities, with a very practical, high-performance Prolog element; I don’t know how parallel it is. A few RDF folks are working on using map/reduce. Presumably, with the rise of multi-core systems, even single-user inference engines will want to be made parallel.

  7. How many of the reasons for NoSQL rejecting SQL also apply to SPARQL? Does the scaling issue apply? Actually, does the scaling issue really apply to SQL? Michael Stonebraker (more or less the Voice of God) claims automatic sharding can and should be done while still using SQL. Some people reply: Perhaps, but in Enterprise-Grade Open Source? Also, maybe “SQL” is a euphemism for ACID, and that’s really what doesn’t scale and/or is too expensive. Perhaps that issue needs to be settled before considering SPARQL? Actually, the state of ACID in SPARQL is an open issue right now; maybe NoSQL can inform that decision.

  8. RDF is standardized. Some would argue it’s more standardized than SQL; that case will be stronger when SPARQL 1.1 is done. Here’s a diagram Steve Harris made, which I reformatted. “KV” refers to key-value stores, the simplest, most scalable kind of NoSQL database.

    diagram showing sparql being more standard the sql which is more standard than key-valye stores, while kv and sparql scale more then sparql.

So… What does that all boil down to?

Bottom line: RDF could learn a lot from NoSQL about scaling and ease-of-programming; NoSQL could learn a lot from RDF about decentralization and inference.

Some closing questions, ideas….

Can someone make a SPARQL endpoint with Cassandra’s performance and scaling properties? I haven’t studied this idea much, but I’m afraid the static column families will make it impossible to get much performance without building the store for a particular set of SPARQL queries. But it could still be useful, even with that drawback. Or maybe some SPARQL endpoint is already there; has anyone really tried a comparative benchmark?

How does the SPARQL endpoint description and aggregation work compare to database sharding. Are there designs for doing it automatically?

RDF 2 Wishlist

October 30, 2009

Here’s what I think should be standardized at some point, soon, in the Semantic Web infrastructure. These items are at various levels of maturity; some are probably ready for a W3C Working Group right now, while others are in need of research. They are mostly orthogonal and most can be handled in independent efforts. (I would lean against forming a single RDF Working Group to handle all of this; that would be slower, I think.)

To be clear, when I say “RDF 2” I mean it like OWL 2: an important step forward, but still compatible with version 1. I’m not interested in breaking any existing RDF systems, or even in causing their users significant annoyance. In some traditions, where the major version number is only incremented for incompatible changes, this would be called a 1.1 release. In contrast, at W3C we normally signal a major, incompatible change by changing the name, not the version number. (And we rarely do that: the closest I can think of is CSS->XSL, PICS->POWDER, and HTML->XHTML). The nice thing about using a different name is it makes clear that users each decide whether to switch, and the older design might live on and even win in the end. So if you want to make deep, incompatible changes to RDF, please pick a new name for what you’re proposing, and don’t assume everyone will switch.

This is partially a trip report for ISWC, because the presentations and especially the hallway and lounge conversations helped me think about all this.

Note that although I work for W3C, this is certainly not a statement of what W3C will do next. It’s not my decision, and even if it were, there would be a lot of community discussion first. This is just my own opinion, subject to change after a little more sleep. Formally the decisions about how to allocate W3C resources among the different possible standards efforts are made by W3C management guided by the the folks who provide those resources, via their representatives on the Advisory Committee (AC). If the direction of the W3C is important to you or your business, it may be worthwhile to join and participate in that process.

1. RDF and XML interoperation

There’s a pretty big divide between RDF and XML in the real world. It’s a bit like any divide between different programming languages or different operating systems: users have to pick which technology family to adopt and invest in. It’s hard to switch, later, because of all the investment in tools, built systems, educations, and even socially networks. (People who use some technology build social and professional relationships other people who use the same technology. Thus we have an XML community, an RDF community, etc. Few people are motivated to be in both communities.)

I think we should have better tools for bridging the gap, technologically, so that when data is published in XML, it’s easy for RDF consumers to use it, and when the data is published in RDF, it’s easy for XML consumers to use it.

The leading W3C answer is GRDDL, which I think is pretty good, but could use some love. I’d like to see support for the transforms being in Javascript, which I think is probably the dominant language these days for writing code that’s going to run on someone else’s computer. It certainly has a bigger community than XSLT. I’d probably support Java bytecode, too.

I would also like to see some way to support third-party GRDDL, where the transform is provided by someone not associated with either the data provider or data consumer. Nova Spivack gave a keynote where he talked about this feature of T2. They’re focused on HTML not XML, but the solution is probably the same.

Beyond GRDDL, I think there’s room for a special data format that bridges the gap. I’ve called it “rigid rdf” or “type-tagged xml” in the past: it’s a sub-language of RDF/XML, or a style of writing XML, which can be read by RDF/XML parsers and is also amenable to validation and processing using XML schemas. Basically you take away all choices one has in serializing RDF/XML.

I note the The Cambridge Communiqué is ten years old, this month. It proposed schema annotation as an approach, and that’s not a bad one, either. I haven’t heard of anyone working on it recently, but maybe that will change if the XML community starts to see more need to export RDF.

Amusingly, while I was talking to Gary Katz from MarkLogic about this, he mentioned XSPARQL as a possible solution, and I pointed out Axel Polleres (xsparql project leader) was sitting right next to us. So, they got to talk about it. XSPARQL doesn’t excite me, personally, because I don’t use either SPARQL or XQuery, but objectively, yes, it might solve the problem for some significant userbase.

2. Linked Data Inference

For me, an essential element of a working Linked Data ecosystem is automatic translation of data between vocabularies. If you provide data about the migration of frogs in one vocabulary, and my tools are looking for it in another one, the infrastructure should (in many cases) be able to translate for us. We need this because we can’t possibly agree on one vocabulary (for any given domain) that we’ll all use for all time. Even if we can agree for now, we’ll want this so that we can migrate to another vocabulary some time in the future.

Inference using OWL (and its subsets like RDFS) provides some of this, but I don’t think it’s enough. RIF fills in some more, but the WG did not think much about this use case, and there’s might be some glue missing. Maybe we can get WG Note out of RIF to help this along.

I’d like us to be clear about first principles: when you’re given an RDF graph, and you’re looking for more information that might be useful, you should dereference the predicate IRIs to learn about what kinds of inference you’re entitled to do. And then, given resources and suitable reasoners, you should do it. That is, the use of particular IRIs as predicates implies certain things, as defined by the IRI’s owner. The graph is invoking certain logics by using those IRIs. (Of course you can always infer things that were not implied, but as among humans, those “inferences” are really just guesses you are making. They have quite a different status from true implications.)

If this is put together properly, and the logics are constructed in the right form, I think we’ll get the dynamic, on demand translation I’m looking for. I imagine RIF could be very useful for this, but reasoner plugins written in Javascript of Java bytecode could be a better solution in some cases.

Some of my thinking here is in my workshop keynote slides, but later conversations with various folks, especially Pat Hayes and TimBL, helped it along. There’s more work to do here. I think it’s pretty small, but crucial.

3. Presentation Syntaxes

RDF, OWL, and RIF all have hideous primary exchange syntaxes and some decent not-W3C-recommended alternative serializations. I’m not really sure what can practically be done here that hasn’t been done.

At very least, I’d like to see a nice RDF-friendly presentation syntax for RIF. A bit like N3, I suppose. I did some work on this; maybe I can finish it up, and/or someone else can run with it.

OWL 2 has 3+n syntaxes, where n is the number of RDF syntaxes we have. Exactly one of those syntaxes is required of all consumers, for interchange. I’ll be interested to see how this plays out in the market.

4. Multi-Graph Syntax

Most systems that work with RDF handle multiple graphs at the same time. Sometimes they do this by storing the triples in a quad store, with the fourth entry being a graph identifier. This works pretty well, and SPARQL supports querying such things.

We don’t have a way to exchange multiple graphs in the same document, however. N3 has graph literals (originally called contexts), and there was some work under the term named graphs, which is kind of the opposite approach.

Personally, I don’t yet understand the use case for interchanging multiple graphs in one document, so I’m not sure where to go with this.

Hmmm. I guess RIF could be used for this. You can write RDF triples as RIF frame facts, and the rif:Document format allows multiple rulesets, each with an optional IRI identifier, in the same document. ETA: RIF also gives you an exchange syntax where you can syntactically put literals in the subject and use bnodes as predicates, if you want. But now you’re technically exchanging RIF Frames instead of RDF Triples.

5. RDF Graph Validation

When writing software that operates on RDF data, it’s really nice to know the shape of the data you’ll find. It’s even nicer, if software can check to see if that’s actually what you got. And if reasoners can work to fill in any missing peices.

I don’t exactly understand how important or unimportant this is. It’s closely related to the Duck Typing debate. Whatever mechanisms make duck typing work (eg exception handling, reflection, side-effect-free programming) probably help folks be okay without graph validation. But I think folks trained on C++/Java or XML Schema would be much happier with RDF if it had this

The easiest solution might be using rigid RDF. One could probably also do it with SPARQL, essentially publishing the graph patterns that will match the data in the expected graphs.

The most interesting and weird approach is to use OWL. Of course, OWL is generally used to express knowledge and reason about some application domain, like books, genes, or battleships. But it’s possible to use OWL to express knowledge about RDF graphs about the application domain. In the first case, you say every book has one or more authors, who are humans. In the second case, you say every book-node-in-a-valid-graph has one or more author links to a human-node in the same graph. At least that’s the general idea. I don’t know if this can actually be made to work, and even if it can, it risks confusing new OWL users about one of the subjects they’re already seriously prone to get wrong.

6. Editorial Issues

Finally, I’d like some portions of the 2004 RDF spec rewritten, to better explain what’s really going on and guide people who aren’t heavily involved in the community. This could just be a Second Edition — no need for RDF 2 — because no implementations changes would be involved.

I’d like us to include some practical advice about when/how to use List/Seq/Bag/Alt, and reification, maybe going so far as to deprecate some of them (IMHO, all but List). Maybe bring in some of the best-practice stuff on publishing and n-ary relations.

I understand Pat Hayes would like to explain blank nodes differently, explicitly introducing the notion of “surfaces” (what I would call knowledge bases, probably). Personally, I’d love to go one step farther and get rid of all “graph” terminology, instead just using N-Triples as the underlying formalism, but I might a minority of one on that.

ETA: Of course we should also change “URI-Reference” to “IRI”, and stuff like that.


Okay, that’s my list. What’s yours? (For long replies, I suggest doing it on your own blog, and using trackback or posting a link here to that posting.) Discussion on semantic-web@w3.org is fine, too.

Linked Government Data

September 11, 2009

A few weeks ago, I was drafted to replace José Manuel Alonso as W3C eGovernment Lead (in addition to my Semantic Web standards work). I’ve never worked with eGovernment before, but it turns out that the W3C group working in the area had recently resolved to:

identify ways for governments and computer science researchers to continue working together to advance the state-of-the-art in data integration and build useful, deployable proof-of-concept demos that use actual government information and demonstrate real benefit from linked data integration.

I wish they’d said that with fewer words, but, yes, I do know something about that.

So this week, I was in Washington, wearing a suit (but no tie — this was an O’Reilly crowd), attending the Gov2.0 Expo Showcase and the Gov2.0 Summit. It’s like a forest in spring after a fire, with wildflowers blooming everywhere, and grand, government proclamations about future growth. My overwhelming impression, from events on the stage, was that:

  1. the Federal government is undergoing a massive effort to begin systematically publishing machine-readable data, across all agencies, in a sustainable way, driven from both the top (Obama) and by all the grassroots awareness of the new technology;
  2. state and local governments also see the benefits, and some brave souls are leading the way, often with much success;
  3. there is an army of developers, commercial and hobbiest, open source and proprietary, waiting to develop mind-blowing (and genuinely useful) applications using this data.

I had a little re-adjustment panic, feeling like there was nothing interesting left to do. Like when you show up to help some friends move, and they’re already done and sitting around eating the pizza.

But there’s still lots to do. People who’ve been in Washington a long time assure me it will be many years before a significant fraction of the data is really being published, and, as we well know, getting the data released does not mean it’s easy to use. It’ll be a long time before we have stable feeds of accurate data using appropriate formats and vocabularies.

So how do we get there from here? What should the agencies start doing now, to help ease the transition, and get us there faster? And where, exactly is “there”?

(As if I would know? Well, it’s what I’m thinking about this week, so here’s my braindump, before it all gets swapped out for the weekend, and something else gets swapped in Monday morning.)

The Linked Data Ecosystem

Our goal, my goal, is to have a scaleable and robust platform for decentralized apps to share their data. Right now, we have a handful of decentralized apps (such as e-mail), but most social software (multi-user apps) runs off a single vendor’s server farm. Yes, youtube, twitter, facebook, google docs, flikr, and others have figured out how to make it work, and how to make it scale, but they still run the systems. We want all that functionality, but without any one organization being a bottleneck, let alone potentially behaving badly.

This goal, I should note, is considerably less compelling now than it was when many of us signed up, in some cases decades ago, since we now have an effective app distribution system (with modern web browsers), and a vibrant, competative, and healthy market for these individual centralized services. But some of us still would rather not trust facebook or google with too much of our personal and professional lives.

I suspect the government is best not relying on them too much, either.

Over the years, the Semantic Web community has been developing a set of techniques, based on open standards, for achieving these goals. They are not perfect, and they are not always polished, but most of them work well and the rest look pretty close.

  1. Present all your data in as triples, building everything out of Subject-Property-Value statements. This roughly lines up with rock-solid technologies like relational databases and object-oriented programming, and new techniques, including JSON and terascale databases (Google’s BigTable app engine datastore, Amazon’s SimpleDB). The details are ironed out in RDF, a W3C standard first published in 1999, and then rearticulated with clarification in 2004.

    When you use triples as your building blocks, you have to be more explicit about the structure of your data, and this eases accurate interoperation. The simplicity also supports metadata, reasoning about data, data distribution, and shared infrastructure even among disparate applications.

  2. Use Web addresses (URLs/URIs/IRIs) to name things. Each thing that someone might need to refer to — cities, agencies, people, projects, items in inventory, websites, events, etc — should be assigned a universally unambiguous identifier (nothing else uses the same identifier), so that every source of data about that item can be merged, always knowing it’s the same item. It’s okay (and unavoidable, in some cases) to have multiple identifiers for some items; the important thing is to be able to use the same identifier, when we know it.

    We use Web addresses to build these identifiers, because we already have a structure for making sure there are no unintended duplicates, and because it allows an important element of of centralization. Each time some organization mints a new web-based identifier, it gets the ability to provide (through its web server) some core data about that item. It’s not that their data will always be correct, but it provides a starting point, a seed around which a shared understanding of the item can grow.

  3. Build for change and diversity, with backward and forward compatibility, and multiple views on the same data, knowing that the world changes, and our understanding of the world changes. (This needs its own article, but Why I Want RIF? covers the hardest parts of the territory.)

How Do We Get There?

The mood of the summit, and perhaps the mood everywhere near technology corners of the White House, seems to be one of ecstatic urgency. Folks seem to think everything can and should be done at a record pace, with crowd sourcing and agile development. The perfect is the enemy of the good. Get something good out there, then refine it. Fail early, fail fast. This is awesome…. but how do we iterate from here to a fully-evolved linked data ecosystem? And let’s not just assume the crowd will do all the right things.

There are two parts to the problem, and we can separate them, running them in parallel:

  1. Get more and more public data streams flowing.
  2. Make the data streams more and more usable.

There is a synergy between these two. As more data appears, more useful things can be done with it. As more good things are done with it, the incentives to release data become stonger. (Of course, some bad things will be done with it, too. But that’s a subject for another day.)

I don’t have any insights into part 1. That’s political, whether it’s workplace politics or national politics. I don’t know if there are any shortcuts. The case just has to be made, better and better, in all the right places, until the right stakeholders and decision makers and people who really do the work are convinced.

But part 2 can be crowd-sourced. It can be turned into a market, a community effort.

Supporting the Community Around a Data Source

People need to see what others are doing, and people doing good things need to be rewarded. People need to have some confidence they’ll be rewarded for doing good things. And mostly: the community needs to be arranged so that the more people participate, the better it gets. (See the video of Clay Shirky’s talk at the Summit about how to do this, in general.)

In this case, good things include:

  • Clarifying what elements of the data feed mean
  • Documenting the accuracy, methodology, etc, of the feed
  • Finding and potentially correcting bugs in the data
  • Developing software which somehow uses the data, and (1) sharing lessons learned in developing the software, (2) making the software available in various forms (web service, commercial download, open source, etc)
  • Linking to related feeds
  • Providing derived feeds, reformating, correcting, or otherwise improving or re-targetting this data
  • Providing merged feeds, which process this data along with other data, to create something new
  • Summary/Analysis feeds
  • Review, commentary, support for all these related items
  • Career and project-funding opportunities related to this feed
  • Social events related to this feed
    • There’s lots more to be said here, but … deadlines are the antidote to perfectionism. So I’ll stop here for now. Comments welcome. If this stuff really interests you, please consider joining joining the W3C eGovernment Interest Group. (It’s an open group; W3C membership is encouraged but not required.)

Here are some things I want to do:

  • Put my pictures of my latest family vacation on the web, where my friends and family can see them. Let my mother put her pictures of the same events there, too. Let me select some for my friends to see, and her select some for her friends to see. Know that the pictures will be there, safe, for as long as anyone in the family wants them to be.
  • Put my movie collection on the web, so I can watch my movies wherever I happen to be, and so can my kids. If one of them is at a friend’s house, why should it matter if she remembered to bring the DVD, if the house has a good network connection?
  • Borrow my friends’ movies and music from time to time. If I want to watch some old college favorite (let’s say Heathers), and I know my friend Keith has a copy, must I drive 20 minutes to his house to borrow his DVD? No, I want him to have it on his server, and let me borrow it. (Note that I’m not asking to make my own copy. What I want might or might not break some laws, but it seems to me that if I can legally borrow his physical DVD for a few days, I should also be able to legally borrow the bits on his DVD. As long as there’s still only one working copy at any point in time, I expect the MPAA wont be quite so furious with me.
  • I want this copy-protection to be only advisory; if Greg borrows my copy, and somehow loses it or it breaks, I can just reclaim it. If he finds it again, his copy is an illegal one. There’s no need for security here (someone would just crack it anyway); I just want to make it easy for people who want to follow the rules to do so, and I hope the system would usually make it easier to follow the rules than to break them.
  • I want this to be private. I’m talking about sharing with my friends here, not sharing with the world. I may not want the world looking through the family vacation pictures, and I don’t want strangers borrowing my movies if it’s going to prevent me from watching them. I certainly don’t want strangers copying movies I paid for if it’s going to make the MPAA apoplectic (as I’m sure it would).

I know: youtube, myspace, flickr and their less-popular but more-exciting rivals offer something like this. But I want it decentralyzed. When I say private, I mean private from everyone, including Google and News Corp and Yahoo/Microsoft. And when I say I want the pictures to be there, “safe”, more-or-less forever, I don’t just mean until some corporation decides to change its terms of service, or capriciously enforce some clause I never noticed.

I think we can do this.

Servers

The heart of any system like this is the servers, of course. I suggest that a few people in each cluster of friends would be willing to pay a few dollars a month for some managed web space. Imagine it as a locker, and they tell their friends the combination. Or a private island, and they give their friends the right to land, in their private jets. (Virtual private jets are free in this modern world.) It’s a web location where they and their friends can store their digital media, where it’s easy to access. (When users access these storage sites, I picture them generally not caring about location; I look for “Healthers” and it finds all the copies all my friends are willing to share — no need to specifically ask Keith. With this, redundancy should be convenient and encouraged.)

I picture the server software being pretty simple (web+db), and something any web hosting provider could offer with one-click installation. Like wordpress, etc, but even simpler. The point is that if you want your own locker/island, you could rent one for $5/mo (low end) to $50/mo (high end), from commodity providers.

The data would be stored encrypted, so the host computers and hosting services never get to see what bits are really stored. They can’t see the family pictures or hollywood movies, even if they want to, no matter their terms of service or pressure from the government.

(These secure network volumes already exist, in various forms, most famously Amazon S3. There’s no magic here, but we might want a few tweaks to the API, and there’s definitely a need to package it differently so that my mother can comfortable sign up for hers. Also, I think this would support a market for cheaper and less-reliable storage, since reliability could be provided by other means.)

Clients

The clients would be more than just browsers. They have to understand the media tagging and indexing, and handle the crypto. They’d probably offer discussion boards, and such, as part of the media metadata.

I’m imagining three different kinds of clients:

  1. A purely web-based client would allow users an experience very much like flickr and youtube. It could show ads, and it would know your encryption keys, but at least it wouldn’t be hosting the media data, so you could move from one to another without risk of losing your photos, music, and movies.
  2. A plug-in for a media player like VLC.
  3. A command-line, virtual-file-system client. This is the kind of thing I might hack together as a prototype, using fuse. All the lockers to which I have keys would appear as a big filesystem, and there would be some tools for searching and manipulating it. I could use my existing media players, unmodified.

That’s the gist of it. I have lots more details floating around in my
head. Does it exist already? Shall I go ahead and make a prototype
(in my copious free after-hours time)?

Why I Want RIF

July 9, 2009

RIF is done, more or less.

When I say “done”, I don’t mean “done” like toast in the toaster is done, when it’s just perfectly crunchy, without quite being dry or hard. And I don’t mean “done” like a hacking project which is done at that precise moment when it stops being more fascinating than sleep or food or sunshine. No, RIF is “done” like a term paper, the night before it’s due. It meets the requirements, more or less, and the time has come to ship it.

Of course, the W3C process favors quality over speed, so instead of turning it in and walking away, we’ll have to do several rewrites, to address the teacher’s comments. In this case, we have at least three rounds of that. In the first round (called “last call”, which started last friday), the “teacher” is anyone who feels like reading and commenting. Then comes “candidate recommendation”, when we try to get everyone to implement it and give us comments as they do. (This is where OWL 2 is now). Finally, we’ll ask for a high level review from all W3C member organizations, as they decide whether to promote it from Proposed Recommendation to a full W3C Recommendation.

But still, it’s done like that term paper. It’s turned in, and now we wait for the review comments.

So what is RIF good for, anyway?

The consensus, Working Group answer is 26 pages long and rather in need of some polish, so here’s my short answer. Here’s why I’ve spent the last five years working on it. (No need to cry for my lost youth, I did some other fun things during that time, too.)

We need RIF so that we don’t need standards any more.

If you’ve ever tried to use FOAF (arguably the most popular Semantic Web vocabulary), you may have noticed a little problem with representing names. Am I:

  • [ foaf:firstName “Sandro”; foaf:surname “Hawke”], or
  • [ foaf:givenname “Sandro”; foaf:family_name “Hawke” ], or just plain
  • [ foaf:name “Sandro Hawke” ] ?

Who knows? How can anyone decide? It’s a mess.

And, of course, this problem is repeated everywhere. Every ontology has its share of coin-flip design decision — decisions where you have no overwhelming engineering reason to make one choice over another. And every problem space has, or will soon have, a vast array of ontologies addressing it from many slightly-different angles.

I want data providers to publish using whatever ontology they know and love.

I want data consumers to consume (use) data in whatever ontology they know and love.

I expect RIF to be the glue in the middle, behind the scenes, in a fuzzy ball of linked-data rule-engine goodness.

Imagine Jos publishes using foaf:firstName and foaf:surname. Imagine Chris publishes using foaf:givenname and foaf:family_name. Imagine Gary writes an app which looks for foaf:name data. As long as the right RIF rules are present on the Web, in the right places, this should work. People using Gary’s app should see the data from Jos and the data from Chris, even though Gary never knew or cared about the vocabularies they used.

Of course, there’s some question as to what those rules should say. In the US, the givenname and the firstName can be treated as the same. Meanwhile, in Japan, the family_name is the firstName! And if you try to split a name back into firstName and surname, do you use the last space (as in “Sarah Jessica Parker”) or the first space (as in “Hillary Rodham Clinton”)?

I think the solution is to accept that there may be multiple rulesets, suitable for different purposes, and they may not be perfect. I explored this space to some degree in a different context: XTAN associates some “impact” (which might be called “semantic damage”) with each transformation (or ruleset). I think that can work.

So, there are still some details to work out. I’m not presenting the solution here; I’m just explaining why RIF interests me. Now you know.

(And no, I don’t really think this will entirely obviate the need for standards, but I think it will significantly reduce that need, taking some pressure off, and shifting some more work to the machines.)

notes from semtech2009

June 19, 2009

I’m at SFO with three hours to kill, and not many brain cells left, after spending the week at the Semantic Technologies Conference. I am, it turns out, so short on functioning brain cells that I’m writing a blog post, after all these months of self-censoring because I didn’t have anything worth saying here.

The dominant question at SemTech was whether we’ve finally reached the point in the adoption curve where (to mix 2d curve metaphors) it’s all downhill from here. Has this thing really caught on? Can we just coast and let the Semantic Web take over the world now?

I still think there are some vital pieces of the architecture missing, but I can’t deny that more and more people seem to “get it”, and be spreading the word. People I’ve never even heard of. That’s a pretty good indicator.

Of course, maybe they don’t really get it. I’m not quite sure anyone really gets it, since they don’t seem to notice those missing pieces.

I was happy to see Evren Sirin talking about their work at Clark and Parsia on using OWL for integrity constraints. I’m not sure they’ve got it quite right — it’s hard to tell — but I’m really glad they are trying.

What else?

The divisions within the community are still great. You have the rules folks, who really don’t get this whole Description Logic thing. You have the Description Logic folks who usually try to not be too condescending to the rules folks. We have the natural language folks, who I still mostly ignore.

RIF is pretty badly misunderstood and mis-characterized. That might not be my fault, but it’s kind of my responsibility. As one step, I put a little more time into starting a RIF FAQ this afternoon. Feel free to send me more questions that I don’t know how to answer.

Looking over my notes…

Peter Rip had some words of wisdom for startup founders that I think may apply much more broadly. He said to predict your own behavior, then live up to it. That’s what gives people confidence in you. So step one is know thyself, eh? I’m working on it, I’m working on it. One of the things I think I’ve learned about myself is that I don’t want to found a startup, despite all the fantasies about it, ever since ninth grade. What’s changed, I think, is that I’m being more realistic about what it would do to my life, what it would cost. (I suppose at some level, I’ve always known that, since I never actually did start a company with employees.)

Allegro Graph 4.0 filled me with triplestore lust. I don’t know if it could live up to the impression Jans painted of it, but … I want one. Interestingly, I didn’t feel envy; perhaps I’m also ready to give up on the fantasies of writing a killer triplestore.

In the Semantic Search session, David Booth and someone I don’t know separately expressed the concern that computers are getting too smart for our own good. Not really in the Skynet sense, but in the sense that when google tweaks their algorithms, in ways even they don’t understand, or the web just shifts a little, suddenly you can’t find some page any more. I thought Peter Norvig looked apologetic, when he responded with a terse “use plus as a workaround!”. I need to remember that more, when I get frustrated with Google not really doing keyword searching. Conclusion: make sure the compu-smarts always have an off-switch, and that humans never forget where the off-switch is.

I’m a big fan of Tom Gruber, but I’d short sell Siri if it were publically traded. I’m not sure I can put my finger on it. All the arguments he makes for agent computing are compelling, but my gut says thumbs down. Could it be because there is (as far as I know) not a drop of SemWeb technology in there? I don’t think that’s it. I think it’s that I can’t imagine it will actually work better than more conventional alternatives. People don’t want an assistant; they want tools that are simple yet powerful enough for them to complete the task themselves. It’s a bit like GUI vs command-line. But who knows… (I wonder how this bunch of folks from SRI decided to name their new company and product SIRI. Odd….)

Data Portability. Wow, is this a difficult space. I am not optimistic here, either. I think we have many years of stumbling around in our future on this one. And that’s even if facebook isn’t being evil. On the plus side, the community is starting to realize what a hard problem it is. I’ve heard that admitting you have a problem is a good place to start. Still, the meme that there exists a quick solution, if we just get a few smart people together, … it’s damn compelling.

And here we are, going backward in time, back to the first session I attended, Monday morning, from some freebase folks (Jamie Taylor and Colin Evans). I should play with freebase more. If I ever get back to playing with scripts to manage my movie collection, I should probably use their movie data instead of IMDB. Jamie kept saying great things about W3C; I wanted to ask him why they don’t just join, but though I saw him many more times at the conference, he was always walking by in a hurry.

It’s not like I could have made much of a case, anyway. One of the problems with the W3C’s current business model is that folks no longer join W3C just because they (a) use our stuff, and (b) think we’re cool. They actually want business value in return for their dues! Losers.

(How do you measure the business value of a standard existing, anyway? In most cases, it’s a commons problem, where lots of folks get enormous value from the standard, but we don’t know how to monetize that. We can only charge for being part of the conversation when the standard is drafted, and sometimes that doesn’t seem to be worth very much.)

Speaking of W3C, OWL 2 went to CR last week. I think that was my most challenging round of publications yet. Do I say that every time? Still, stepping in as editor of rdf:PlainLiteral, and doing the whole transition process, in the midst of a getting a new manager, … it was challenging.

I’m not actually worried about the CR phase itself. OWL 2 looks pretty darn good. Until SemTech, I was a bit worried about us getting RL implementations, but several people mentioned they planned to try it, and after a while I realized it’s kind of a no-brainer. If you play in the SemWeb space and have a rule engine (as many folks do), why wouldn’t you give OWL RL a try? Of course, you might not get around to running all the test cases, and reporting back by July 30, but at this poing I’m no longer very worried about it.

Okay. One hour to flight time. That’s better. Have fun on the ground, everyone.

(topics I’m leaving out: Ian Horrocks taught a kick-ass class in description logics. The Jena team are pretty optimistic about their post-HP future (as am I). I’m sad that Jamie Taylor didn’t understand the need for 303 redirects.)

What Is OWL Good For?

October 28, 2008

Here’s another middle-of-the-night-while-travelling post. It’s about OWL….

Last week, the W3C OWL Working Group decided that it was essentially done with the design of the OWL 2 language. All that remains is some editorial work and fixing bugs that might crop up. In W3C process terminology, the Working Group decided it was (mostly) ready for “Last Call” of its multi-part specification. Happily, it’s pretty much on-schedule

I’ve been the primary W3C staff contact for this Working Group, and I was the secondary staff contact for the OWL 1 Working Group for its last six months (back in 2003-2004). I’ve tried to support both groups however I could, in infrastructure, process advice, management advice, and in some limitted areas with technical design advice. But, awkwardly, I’m not really an OWL user, so I sometimes feel like this OWL work is only my “day job”.

That’s okay — lots of us do things for work that we’re not passionate about — but it isn’t how I usually operate. More interestingly, it’s kind of odd. I mean, OWL is on the very short list of Semantic Web standards, and I’m quite passionate about decentralization, which is more-or-less what the Semantic Web is about. So why am I not passionate about OWL?

Personal Reasons?

Okay, there may be some personal reasons. I’ll describe them in the hope of factoring them out, and in the hope of generally learning more about how people interact with standards committees.

  1. Although I was involved fairly near the beginning of work on OWL, I still feel late-to-the-party. For me, when I’m late to a party, I tend to have a nagging feeling that I missed something important, and I stay very cautious. I’m less likely to become enthusiastic.

  2. Perhaps because of that, or perhaps because I’m not a Description Logics researcher, I don’t feel like part of the “in-crowd” — even though I know and like and feel personally comfortable with many people who are.

  3. I don’t think I could ever get into the top tier of experts in either OWL implementation or OWL usage. There are some very, very smart people who have been dedicated to those subjects for years already. I doubt I could catch up.

These wouldn’t stop me from being an implementor (I did one strange OWL implementation called surnia) and/or user, but it makes it harder for me to feel ownership. Maybe that’s related to excitement.

Technical Reasons?

The important issue here, though, is technical. Is OWL important for decentralization? Perhaps it is, but I’m not convinced. Really, my gut says “no”, the experts say “yes”, and I’m just confused. And so we end up with a blog post.

What purposes might OWL serve?

  1. It might be a data interface definition language, along the lines of BNF, ASN.1, and XML Schema, except on some subtley-different level. Decentralized systems certainly need something like this — some computer format for defining the interfaces — but is it OWL? (cf my work last year on asn07 (rif wiki, rif e-mail)). Some people tell me it’s entirely unsuitable for this; some people say they’ve been using it for this for years.

  2. It might give people some computer support for defining a vocabulary, something like a spell-checker helps people writing prose. I have the feeling this is where the core Ontology community is. In this view, OWL is there to help you find errors and get insights into your vocabulary specification (aka your ontology).

  3. It might be used as a declarative programming language for expressing Semantic Web shims, the transformations needed so systems using related-but-different RDF vocabularies can interoperate. But maybe this is better done with RIF or conventional programming languages? This is the Ontology Mapping problem. Clearly OWL isn’t expressive enough to do all kinds of mapping, and it probably is the best option for trivial kinds of mapping (stuff that just uses owl:sameAs), but what about everything in between?

Sometimes in standards work there is constructive ambiguity in the spec and also in the charter. It makes the technology easier to sell, and draws in a wide base of potential support. What is this technology for? Well, it’s for lots of things. It might be just what you need! If the charter were more specific, picking out only points 1, 2, or 3, then we’d have a narrower spec, serving a narrower community. Maybe serving it better, but still with fewer economies of scale and network effects.

If we look back at the OWL Use Cases and Requirements, we see use cases somewhat in line with these three options, although taking a rather different approach. The first two are about using automated reasoning in mapping between vocabularies; that’s sort of number 3, but with human vocabularies à la wordnet. The other four are less clear cut. Number five look like classic decentralization, but I don’t see any case being made for OWL in there. Number six looks like classic planning; is OWL the best logic language for automated planning?

Anyway, I’d love to hear comments (below or via trackback/pingback) about what
you think OWL is good for.

[Oh, look, it sounds like there’s some interesting discussion of this going on at OWLED. A shame I wasn’t up for travelling three weeks in a row.]

Please note that I’m really not complaining about OWL. I’m fairly comfortable with it’s design, and quite happy with the design process. I’m just saying that I have trouble seeing why it’s as cool as some people seem to think it is. Maybe I’m missing something. Maybe my goals are different. Maybe it’s utility is exactly the same to both of us, and for whatever reason it doesn’t turn me on the same way. I’m just trying to understand that….