Linked Government Data
September 11, 2009
A few weeks ago, I was drafted to replace José Manuel Alonso as W3C eGovernment Lead (in addition to my Semantic Web standards work). I’ve never worked with eGovernment before, but it turns out that the W3C group working in the area had recently resolved to:
identify ways for governments and computer science researchers to continue working together to advance the state-of-the-art in data integration and build useful, deployable proof-of-concept demos that use actual government information and demonstrate real benefit from linked data integration.
I wish they’d said that with fewer words, but, yes, I do know something about that.
So this week, I was in Washington, wearing a suit (but no tie — this was an O’Reilly crowd), attending the Gov2.0 Expo Showcase and the Gov2.0 Summit. It’s like a forest in spring after a fire, with wildflowers blooming everywhere, and grand, government proclamations about future growth. My overwhelming impression, from events on the stage, was that:
- the Federal government is undergoing a massive effort to begin systematically publishing machine-readable data, across all agencies, in a sustainable way, driven from both the top (Obama) and by all the grassroots awareness of the new technology;
- state and local governments also see the benefits, and some brave souls are leading the way, often with much success;
- there is an army of developers, commercial and hobbiest, open source and proprietary, waiting to develop mind-blowing (and genuinely useful) applications using this data.
I had a little re-adjustment panic, feeling like there was nothing interesting left to do. Like when you show up to help some friends move, and they’re already done and sitting around eating the pizza.
But there’s still lots to do. People who’ve been in Washington a long time assure me it will be many years before a significant fraction of the data is really being published, and, as we well know, getting the data released does not mean it’s easy to use. It’ll be a long time before we have stable feeds of accurate data using appropriate formats and vocabularies.
So how do we get there from here? What should the agencies start doing now, to help ease the transition, and get us there faster? And where, exactly is “there”?
(As if I would know? Well, it’s what I’m thinking about this week, so here’s my braindump, before it all gets swapped out for the weekend, and something else gets swapped in Monday morning.)
The Linked Data Ecosystem
Our goal, my goal, is to have a scaleable and robust platform for decentralized apps to share their data. Right now, we have a handful of decentralized apps (such as e-mail), but most social software (multi-user apps) runs off a single vendor’s server farm. Yes, youtube, twitter, facebook, google docs, flikr, and others have figured out how to make it work, and how to make it scale, but they still run the systems. We want all that functionality, but without any one organization being a bottleneck, let alone potentially behaving badly.
This goal, I should note, is considerably less compelling now than it was when many of us signed up, in some cases decades ago, since we now have an effective app distribution system (with modern web browsers), and a vibrant, competative, and healthy market for these individual centralized services. But some of us still would rather not trust facebook or google with too much of our personal and professional lives.
I suspect the government is best not relying on them too much, either.
Over the years, the Semantic Web community has been developing a set of techniques, based on open standards, for achieving these goals. They are not perfect, and they are not always polished, but most of them work well and the rest look pretty close.
Present all your data in as triples, building everything out of Subject-Property-Value statements. This roughly lines up with rock-solid technologies like relational databases and object-oriented programming, and new techniques, including JSON and terascale databases (Google’s BigTable app engine datastore, Amazon’s SimpleDB). The details are ironed out in RDF, a W3C standard first published in 1999, and then rearticulated with clarification in 2004.
When you use triples as your building blocks, you have to be more explicit about the structure of your data, and this eases accurate interoperation. The simplicity also supports metadata, reasoning about data, data distribution, and shared infrastructure even among disparate applications.
Use Web addresses (URLs/URIs/IRIs) to name things. Each thing that someone might need to refer to — cities, agencies, people, projects, items in inventory, websites, events, etc — should be assigned a universally unambiguous identifier (nothing else uses the same identifier), so that every source of data about that item can be merged, always knowing it’s the same item. It’s okay (and unavoidable, in some cases) to have multiple identifiers for some items; the important thing is to be able to use the same identifier, when we know it.
We use Web addresses to build these identifiers, because we already have a structure for making sure there are no unintended duplicates, and because it allows an important element of of centralization. Each time some organization mints a new web-based identifier, it gets the ability to provide (through its web server) some core data about that item. It’s not that their data will always be correct, but it provides a starting point, a seed around which a shared understanding of the item can grow.
Build for change and diversity, with backward and forward compatibility, and multiple views on the same data, knowing that the world changes, and our understanding of the world changes. (This needs its own article, but Why I Want RIF? covers the hardest parts of the territory.)
How Do We Get There?
The mood of the summit, and perhaps the mood everywhere near technology corners of the White House, seems to be one of ecstatic urgency. Folks seem to think everything can and should be done at a record pace, with crowd sourcing and agile development. The perfect is the enemy of the good. Get something good out there, then refine it. Fail early, fail fast. This is awesome…. but how do we iterate from here to a fully-evolved linked data ecosystem? And let’s not just assume the crowd will do all the right things.
There are two parts to the problem, and we can separate them, running them in parallel:
- Get more and more public data streams flowing.
- Make the data streams more and more usable.
There is a synergy between these two. As more data appears, more useful things can be done with it. As more good things are done with it, the incentives to release data become stonger. (Of course, some bad things will be done with it, too. But that’s a subject for another day.)
I don’t have any insights into part 1. That’s political, whether it’s workplace politics or national politics. I don’t know if there are any shortcuts. The case just has to be made, better and better, in all the right places, until the right stakeholders and decision makers and people who really do the work are convinced.
But part 2 can be crowd-sourced. It can be turned into a market, a community effort.
Supporting the Community Around a Data Source
People need to see what others are doing, and people doing good things need to be rewarded. People need to have some confidence they’ll be rewarded for doing good things. And mostly: the community needs to be arranged so that the more people participate, the better it gets. (See the video of Clay Shirky’s talk at the Summit about how to do this, in general.)
In this case, good things include:
- Clarifying what elements of the data feed mean
- Documenting the accuracy, methodology, etc, of the feed
- Finding and potentially correcting bugs in the data
- Developing software which somehow uses the data, and (1) sharing lessons learned in developing the software, (2) making the software available in various forms (web service, commercial download, open source, etc)
- Linking to related feeds
- Providing derived feeds, reformating, correcting, or otherwise improving or re-targetting this data
- Providing merged feeds, which process this data along with other data, to create something new
- Summary/Analysis feeds
- Review, commentary, support for all these related items
- Career and project-funding opportunities related to this feed
- Social events related to this feed
There’s lots more to be said here, but … deadlines are the antidote to perfectionism. So I’ll stop here for now. Comments welcome. If this stuff really interests you, please consider joining joining the W3C eGovernment Interest Group. (It’s an open group; W3C membership is encouraged but not required.)