I’m disappointed in the pace of development of the Semantic Web, and I’m optimistic that the Lean Startup ideas can help us move things along faster.

I’ve been a fan of Eric Ries and the Lean Startup ideas for while, but last night I was lucky enough to get to see him speak, and to chat with some other adherents. There are a lot of ideas here, but the bit that jumps out at me today is this, loosely paraphrased:

Reality distortion fields are bad. Instead of using charisma, style, and emotions to motivate your colleagues to act on faith, motivate them with experimental evidence.

I think we have scant evidence that the Semantic Web will work, and that most of us have been working on this as an act of faith. We believe, without solid evidence, that it can work and will be a good thing when it does. You could say we’re operating in an RDF (resource description framework) RDF (reality distortion field).

The Lean Startup methodology says that we should get out of that field as quickly as possible, doing the fastest experiments possible that will teach us what really works and does not work. On faith we can do 5+ year projects, hoping to show something interesting. Instead, we should be doing ❤ month projects to test a hypothesis about how this is all going to be useful.

It’s a shame that most of us are funded in ways that don’t support or reward this at all. It’s a shame the research funding agencies operate on such a glacial and massive scale; in many ways they seem geared more towards keeping people busy and employed than actually innovating and producing knowledge for the world.

Below are my notes taken during Eric’s talk. I have not cleaned them up at all, so you can see just how badly my fingers spell “entrepreneur” when my brain has moved on to something else. I believe slides and the talk itself are available on line; it’s a talk he often gives, so if you have the time, watch it instead of just skimming my notes. (eg this one at Stanford.) Someone else with much better formatting and spelling posted their notes from last night’s talk. You probably want to read them instead, and then come back here and share your insights with us.

$$ Thu Jan 27 18:20:41 EST 2011 ((( EricReis

2 yrs ago at Hobies in palo alto, 6 people, first talking about this...

silicon valley is parochial, rarely getting out of the bublle.


new conversation -- what is entrepreneurship.

strsaight from unhear of to overhypes, without people having learned about it.

put entreneurship on a more solid footing.

What is a startup?

     A startup is a human institution deseigned to delivera a new
     product or service under conditions of extreme uncertainty.

Nothing to do with size of company, sector of the economigy, or

ALL THE BORING STUFF, and how to get better at it.

    Startup = Experiment

Web 2.0 chart --- lots failed at 3 years.

they all failed for BAD reasons.

and how many really lived up to their potential....???!!!   SO FEW.

"If you do everything I did, you can fail like I did."

We need a giant industrial support group.

"Hi, I'm eric, and most of my startups failed."

It's all Taylor's fault.  :-)
father of scientific management.

1911.   birth of management
"In the past, the man was first.  In the future, the system will be first."

"Work should be done efficiently"
"Work should be divided into tasks"
"It's possible to organize craftsmen"
   Management by exception -- only have them report their exceptions.

  Now, decomposing work into tasks is 100 years old.

Everything in this room was constructed under the supervision of managers trained by Taylor and his disciples.

Shadow Beliefs:
   * We know what customers want   (reality distortion field)
   * We can accurately predict the future
          just dont believe the hockey stick spreadsheet
   * Andvancing the plan is progress
         eg keep everyone busy, write code, do your functional job!
	 -- if we're building something no one wants, is it progress!

           [[ NO -- real progress is LEARNING ]]

The Lean Revolution   (Lean Manufacturing)
    W E Demming,    Taiichi Ohno

it's not Tim Quality Money -- pick two

we can get all three by being customer focused.

Agile Development

Alas, Agile development comes out big IT departments.

works IF you know what the customer really needs.

Steve Blank.    
     Customer Development
     Agile (Product) Development

imvu story
      im networks -- join them all.

      he wrote this, in 6 months, to ship to customers.
      5 years before that.

      had to pivot to standalone network.

      GREAT code, but no one wanted it.

claim to have learn something -- about to get fired.  :-)

learning is a 4 letter word in management
     -- bad plan -- fired
     -- failure to execute -- fired

Ask yourself: IF my goal was to learn this, could I have done this
without writing the code?

YES --- just make the landing page!!!

   you do whatever you need to to get there

Entreprenursip is management

  + OUR GOAL is to create an institution, not just a product
  * traditona lamangement preactices fail.  (mba)
  * nee entrepeurial managemt -- working under extreme uncertainyu

The Pivot --- SUDDENLY overhyped.

YOU MUST be able to do this.    The successes can do this.
They can find the good ideas from the bad, inside the distortion field.

   how many pivots you have left.

   if we can reduce the time betwwen pivts,
   we can increas the odds of our success.


startup=    turns ideas into code


MINIMIZE total time through the loop.  Cycle time dominates.

gnl mgmt is about efficiency, not cycle time.

GOING THROUGH THE LOOP -- thats how you you settle arguments between founders.

How much design -- a reasonable balance.

FIVE principals.

   -- entreps are everywhere
         anywhere we seek out uncertainty
	 which is everywhere, given uncertainty from IT rev.

   -- entrp is mgmt
   -- validated learning

   -- innovation accounting

        normally just compliance reporting

	but:  drive accountablity -- hold mgrs accountable
	GM = "std volume"  to compute how many cars each division
	is expected to sell.   allows gm to give bonuses.

	NOT good for entreps.

	"success theater"      (cumulative total registrations  Heh.)


	facebook per-customer behaviors were exciting.  Customers were
	heavily engaged in voluntary exchange with company.  And very
		NEED an accounting paradigm for entrps to prove
		they've done validated learning.

		so you never take credit for random stuff, but only
		take credit for what you derve it for.

   -- build measure learn

HOW do we know when to pivot?

    as if it were obvious when there's a failure.

    land of the living dead.   

    persevere straight into the ground.

    Right answer: (acocunting)
    	  pattern, like in science, 
	  when the experiments are no longer very productive.
	  If When we can't move the needle very much.

Vision, Strategy, or Product
	- what makes a great company?
	500 Auto companies before Ford!!!
	they didnt have the right process.

	Vision doesnt change.  it's about changing the world.
	Strategy is how to build a business around that.

	product dev == optimization

	pivot is changing strategy, not vision.

	THERE is not testing The Vision.  We're NOT trying to
	elimintate vision.

What should we measure?  How do products grow?

     Entrp. accounting

Are we creating value?

What's in the MVP?

       - should a feature be in or out?  Out.

Can we go faster?





$$ Thu Jan 27 19:53:19 EST 2011 

How do you keep engineers having faith in the process given MVP.

How to manage engineers under uncertainty?

    1.   Keep them calm.   Heads down, cranking out code.

    [reeks of frd taylor.]

    2.   Enlist all functions in process of discovering if on right track.
	 ABANDON Reality Distorion Field.

	 People will be way more creative if they know what's going on.

	 The truth will set you free.


Q: Newbie: use of the term "movement"

Eric: I dont want other people to to be doing this.

Eric: I used to be a coder.   What do I do now???

there IS something going on worldwide.   this is science, not religion.

lets be careful.

  if it works for entrepreneur, it's part of Lean Startup.

We're learning a lot over the past two years.

The movement is not me -- the movement is you guys testing these ideas, in changing the world.

this is NOT about proprietary advantage.

Eric used to think the right way to change the world was get the VCs to
evangelize.   Sooooo dumb of me.

vc:  "im not that interested in improving the world, just my profolio"

But now, we should do science.  If we all do it, we'll all improve the
world and live in a better world.

Q: how to test ideas people are not searching for.

eg dropbox -- no one knows they want it.

If customers dont know they have the problem and know the name of it,
you have to find a new way.

at imvu, people didnt know it "outbound is the new inbound" we did ad
compaigns, $5/day, buying keywords of every adjacent product, "*" +
chat.  And drove people to our landing page.  We wanted to learning
the differeences in convertion between these channels.

dropbox's MVP was a video, aimed at DIGG users.  Drove people to
waiting list, beta users.
   justin tv  sl conf  video



how much do MBAs need to re-learn?

er: I'm doing entreprenur in residence at harvard business school.

but why waste time with MBAs?

"what do people say about us when we're not in the room?"

MBAs have one big advantage:  very process and discipline oriented.

if you dont have some failures, yo dont learn.

you need to be able to tell what change to the product/market caused


Some new stuff:

the right things to measure are clear and consistent across all startups.

 1.  value test -- do you know it creates values

 2.  working enging of growth.

two feedback loops:
   -- eg loop in cylendar engine, and driver-and-surroundings
   write down how to get to work == taylor plan

three engines of growth

  -- paid.   you make a $1 per customer, and they cost $0.50 to buy
      (have to be able to buy customers)

  -- viral.   as a necessary consequence of customer using it, they
    get their friends to use it.  "someone has tagged you in a photo"
    you HAVE to click on that.   even some fashion busineses.  
    they "grow themselves" bye xploiting bug in human naturo

   -- sticky, engagment.  addictive, network effects, lockin, ebay,
      wotw, compounding interest.  so small viral can compound it, if
      sticky enough.


easily replicated product, get to market first?

fear: someone will steal my idea.

So: take your second best idea, and try to get someone to steal it.
TRY to get them to steal your idea.    PEOPLE dont steal ideas.


You need a good idea.

Threat by big company to clone you -- they poached a co-founder --
came out with exacty product two years delayed.  $100m failure for

FIRST mover advantage is very rare in reality.     (!!!)


one person from each company.   how to get whole company to buy in?

people often say: that's a really great idea for someone else to do.

the issue is the WORK is a system.  Your company is a perfect robust
system, stable.  very very hard to change -- must be planned

try to find one area where there is painful uncertainty, and say there
is a community of people trying science to solve this problem.

every nods at maximize speed through the loop.

BUT if you do that, you will MAKE PEOPLE FEEL INEFFICNENT.  People
will be interrrupted to do things "not their job".  There will be a
team powwow where people say they hate it making them less efficient.
NEED people bought into theory -- understanfing the value of VALIDATED
LEARNING.  Only do this where you have authority, maybe just yourself.


How does Lean Startup affect managment & sales force.  Lab Equipment
company. --- "sales people are whiners" Really, customers were giving
great feedback, and non of that was making it back to management.

4 steps to ephinary -- steve blanks book -- perfect on enterprise sales.

YOU CANNOT DELEGATE customer development.  founders and senior mgmt
have to be in the room with the customers, at least some of the time.

Salespeople arent supposed to be good listeners.

Mgrs should DO THE SALES THEMSELVES.  THe goal is not to make money,
it's to get validated learning....

ONCE we understand how to do the sales, THEN give it to the salesfolk,
as per Steve Blank.

if using sales force, you are doing Paid Engine Of Growth.


Q: I have a product that people havnts paid for.  mainstream product.
personal keepsake.  needs to look really good.  hard to measure.
style counts.  what's mvp in that scenario.

why cant you do a landing page.

  keepsake book.

goal of MVP --- least amount of work to learn what needs to be
learned.  such as whether customers will pay.   eg get pre-orders.

you can always test demand through fake landing pages.

"Concierge" from food-on-the-table.  did it all by hand, until they
figured out what folks REALLY wanted.

PEOPLE will not truthfully answer what they would do.  "Would you buy
this" turns out to be TERRIBLE DATA.


requires VERY difficult risking rejections.     CUSTOMER SERVICE HURTS!!!

* Eric *HATEs* customer feedback *

does he really want to know what you thought about today's talk?  No!!!


When collective feedback, its NOT ABOUT YOU, it's about the person
giving it to you.

"product is okay" means "product sucks but I'm polite"


very very hard.      but rewards are emense.

think about all the people utterly wasting their time.

let's redeem them;  make it happen!.

$$ Thu Jan 27 20:35:27 EST 2011 )))

Explaining Linked Data

June 8, 2010

I’m going to try explaining linked data again, tonight, at the Cambridge Semantic Web Gathering. I will attempt to keep it simple, while still covering the important details. We’ll see how it goes.

My slides, for people who want a peek ahead of time, are here.

ETA: Thanks to Marco Neumann, there’s a video of my talk. I’m pretty happy with it. Also, my Venn Diagram slide, which you’re welcome to re-use with credit.

Sometimes, if you stand in the right place and squint, JSON and RDF line up perfectly. Each time I notice this, I badly want a way to make them line up all the time, no matter where you’re standing. And, actually, I think it’s pretty easy.

I’ve seen a few proposals for how to work with RDF data in JSON, but the ones I’ve seen put too much burden on JSON folks to accomodate RDF. It seems to me we can let JSON keep doing what it does so well, and meanwhile, we can provide bits of RDF which can be adopted when needed. Instead of pushing RDF on people, allow them to take the parts they find useful.

In thinking about it, I’ve come up with six things RDF can do that are not standard parts of JSON. These are things one can do with JSON, of course, but not in any standard way. My suggestion is these bits of functionally be provided in an RDF-compatible way (as I detail below), so that the JSON world and the RDF world can start to really play well together.

I’m interested to hear what people think of this. Blog comment, email to sandro@hawke.org (maybe cc semantic-web@w3.org?), or catch me in the halls at SemTech. I expect this general topic of RDF-meets-JSON will be discussed at the RDF Next Steps workshop, and if the stars line up right, maybe we can get a W3C Recommendation in this space in the next year or so. Let’s call this particular proposal JRON 0.1 (Javascript RDF Object Notation), not “Sandro’s Proposal”, so I can be freer to like other designs and be properly neutral.

Step 0: Start with ordinary JSON

In general, JSON and RDF are very similar, although they are usually described using different terminology. Of course, they both have strings and numbers. They both have way of encoding a sequence of items: arrays in JSON, lists in RDF (some details below). The main structuring is around key-value pairs, which JSON calls an ‘object’. In RDF we call it the “subject” and focus on its connection with each key-value pair; the three together form an RDF triple.

The point here is that ordinary JSON structures correspond to an important subset of RDF. The don’t exactly match that subset because RDF uses namespace, as detailed in step 5 below. The other steps below show the ways in which JSON is a subset of RDF. If one takes all the steps here, using JSON with these conventions, one has full RDF.

So, here are the steps. Steps 1-3 are pretty simple and not very interesting. They address everyday concerns in data processing. Steps 4-6 may be a little more surprising if you’re not familiar with RDF.

Step 1: Allow Extended Datatypes

Why: For datatypes, JSON only has strings, numbers, booleans. Sometimes people want to store and manipulate other datatypes, such as dates, or application-specific datatypes.

How: RDF uses XML’s datatype mechanism, where data values are conveyed as a pair of items: a lexical representation (a sequence of characters) and a datatype identifier (a sequences of characters which happens to be a URI). Each datatype is a mapping from strings (lexical representations) to values; the datatype identifier tells us which datatype is to be used to interpret this particular representation.

In JRON, we represent this pair like this:

{ "__repr": "2010-03-06",
  "__type": "http://www.w3.org/2001/XMLSchema#date" }

You can put this as a value in a list or in a key-value pair, just like a string or number.

RDF doesn’t restrict which datatypes are used. Some recent standards work selected this list as the set people should implement.

Personally, I’m not sure users need to be able to extend datatypes. I see dates being important, but otherwise I’m not convinced. Still, it’s in RDF, and I like compatibility, so it’s here.

Step 2: Allow Language Tags

Why: When you have text available in several different languages, language tags provide a way to select which of the available strings, if any, matches the language preference of the user.

Also: Text-to-speech systems can handle text better if they know which natural language to use in pronouncing the text.

How: RDF allows language tags on string literals. In JRON, we use a pair like this:

{ "__text": "chat",
  "__lang": "fr" }

Commentary: Personally, I’ve never liked this bit of RDF. I feel like there are better architectures for handling language tagging. But there was a vocal community that felt this was essential, so it’s in the standard. I gather some people like it, and I haven’t seen a good counter-proposal.

Step 3: Allow Non-Tree Structures

Why: Sometimes your data is not tree structured. Sometimes you have an arbitrary directed graph, such as when representing a social network.

How: In RDF, an arbitrary “node id” is available for making non-tree structures. We can do the same in JRON, saying any object may have a node id, and if it does, the object is considered the same as all other objects with the same node id. Like this bit JSON saying my friend Eric and I both know each other:

   { "foaf_name": "Sandro Hawke",
     "foaf_knows: { "__node_id": "n102" },
     "__node_id": "n334" }
   { "foaf_name": "Eric Prud'hommeaux",
     "foaf_knows: { "__node_id": "n334" },
     "__node_id": "n102" }

In the above example, the objects representing me and Eric are given node ids, and then those node ids are used to make the links to each other. We could also do this with only one node id, but we still need at least one:

   { "foaf_name": "Sandro Hawke",
     "foaf_knows: { "foaf_name": "Eric Prud'hommeaux",
                    "foaf_knows: { "__node_id": "n334" },
     "__node_id": "n334" }

Okay, those were the ordinary three things to add to JSON. Here are the interesting three:

Step 4: Allow Cross-Document Structures

Why: Sometimes, there is useful, relevant data available on the Web but it’s not part of the current JSON document. We would not want all the Web pages in the world to be gathered into one big Web page; similarly, it’s good to keep data in different documents. But that shouldn’t stop us from easily combining the data, and keeping the links intact.

How: RDF allows IRIs (unicode web addresses) to be used as node identifiers. They are like node ids, except they work across multiple documents; they are globally unambiguous identifiers, and systems can use Web protocols to dereference them to get other useful information.

In JSON, we can do this:

     { "foaf_name": "Sandro Hawke",
       "__iri": "http://www.w3.org/People/Sandro/data#Sandro_Hawke"

Commentary: So why do we still need __node_id? Because sometimes it’s a pain to make up a good IRI. Some people prefer to always use IRIs, avoiding node_ids in their data, and that’s fine.

Step 5: Put Keys in Namespaces

Why: When data is coming from different sources across the Web, it’s not practical to get all the sources to agree on all the terminology. Instead, by using Web addresses (URLs/IRIs) as our keys, we allow individuals and organizations to make their own decisions. They can decide how much to share their vocabularies, and they avoid accidental name collisions. The web address also provides a handy link to documentation, community sites, schemas, etc.

How: It’s awkward to use a whole, long http IRIs everywhere, so as in many RDF syntaxes, JRON has a prefix expansion mechanism, like this:

    { "foaf_name": "Sandro Hawke",
      "__prefixes": {
         "foaf_" : "http://xmlns.com/foaf/0.1/"

Here the key “foaf_name” gets expanded into “http://xmlns.com/foaf/0.1/name”, which serves as a unique-on-the-Internet identifier for a particular conceptualization of names.

Commentary: Although I’ve left it almost to the end, this is the one mandatory part of this proposal. All the other elements are only present when required by the data. The null JRON document is: {“__prefixes”:{}}

Others have suggested this part can be optional, too, by having a set of standard prefixes for a given API. I’m not entirely opposed to that, but I’m concerned about how those defaults would be communicated in practice.

Also, I’m not sure there’s consensus on what character to use in the short name: should it be foaf_name, foaf.name, foaf:name, or what? The mechanism here is that you can use whatever you want: the __prefixes table keys are matched longest-first. If there’s an entry with an empty string, that provides a default namespace.

Step 6: Allow Multiple Values Per Key

Why: Sometimes it makes sense to have more than one value for some property. For instance, as it turns out, I have more than one friend. I could use a single-value ‘list-of-friends’ property, but sometimes it makes more sense to use a ‘friend’ property that has multiple values. In particular, if we’ll be learning who my friends are from multiple sources, and we were using lists, what order would we put the resulting combined list in?

How: We still just use JSON lists, but we indicate that the order does not matter, so the values can be merged arbitrarily:

   { "foaf.name": "Sandro Hawke",
     "foaf.knows: { "__values": [
                     { "foaf.name": "Eric Prud'hommeaux" },
                     { "foaf.name": "Dan Brickley" },
                     { "foaf.name": "Matt Womer" }

Closing Thoughts

That’s it. Those are the six things that RDF does that normal JSONdoesn’t do. Did I miss something?

The API I’m imagining (but haven’t built yet) would have a few
features like:

jron_reprefix(tree, desired_prefixes)
Returns another JRON tree with all the prefixes matching the ones provided here. If you’re going to use foaf, for instance, you probably want to set a prefix like “foaf.” for foaf, so your code can expect it.
jron_merge_nodes(tree) and jron_treeify(tree)
convert a tree (suitable for transmitting) from/to a graph (suitable for use in memory
Would convert all the __type/__repr objects into suitable local objects, if they exist. Maybe even date/time objects, if there’s a suitable library installed for those.

One technical issue for RDF folks:

Should JSON arrays be considered RDF Lists or RDF Sequences? Perhaps they default to RDF Lists but there can be an option flag in the top-level object:

 { ...
   "__json_array_is_rdf_seq": true 

When that flag is absent or false, arrays would be considered RDF Lists. My sense is no one needs to use both. Maybe soon we’ll know if RDF Sequences can finally be deprecated.

I’m just past halfway through nine consecutive days of all day meetings, clustered around the W3C’s semi-annual Advisory Committee meeting. When this series started, I didn’t have any particular responsibilities, beyond attending. But then John Sheridan had to canceled his trip, and I was tapped to stand in for him on panels at both the AC meeting (this morning) and FOSE tomorrow.

My slides don’t stand on their own very well, but here they are anyway. There might be video of the FOSE talk, later. And if you’re in Washington, by all means, drop by!

1. Why linked data is important for government open data initiatives. This was for a panel showing how Linked Data is being adopted, in various ways, in various industries, for an audience (W3C staff and member representatives) who were, perhaps, tired of hearing about the promise of the Semantic Web.

2. How To Do Open Data (in 10 minutes). This is for an audience of federal IT managers/vendors who are mostly unfamiliar with the concept of RDF.

RDF meets NoSQL

March 9, 2010

On Thursday, I have 20 minutes to address 200 people (plus a video audience) at NoSQL Live … from Boston. My self-appointed mission is to start building bridges between the NoSQL community and the Linked Data/RDF/W3C community. These are two sets of people working on different problems, but it’s pretty clear to me they are heading in the same direction, in similar spirit, and could gain a lot from working together.

I’m organizing my talk around the question of standardization for NoSQL, and I’ll talk about W3C process and such, but the interesting part is where NoSQL touches RDF. So here are some of my thoughts on that, written for an audience already familiar with RDF. I’d love to get some feedback on the basic ideas now, to make my talk better.

  1. While they both want to move beyond SQL, their reasons are different. I understand the key for NoSQL is:

    • SQL doesn’t scale big enough. Once your read-write dataset gets too big for a single machine, you have to develop and maintain a messy sharding system. (But see below.) Sometimes it’s an economy/efficiency thing; if you don’t need ACID, a NoSQL solution might give you better performance on cheaper hardware.

    Meanwhile, I see two big motivation for the RDF folks:

    • Decentralization. On the open Web, data comes from many different sources, mostly beyond your control. Everyone might be providing data to everyone, and no one has the authority to run a central database, ever if they had the technology and the iron.

    • Inference. Some people find formal semantics and well defined inference to be very useful. (How is this different from triggers and views? Good questions.)

  2. There’s a freedom, a joy, and in some case enormous practicality to not having to pre-create your database schema. Nearly all NoSQL and RDF systems are schemaless or allow a fully dynamic schema. (But I understand most bigtable systems have fixed column families. That could be a problem in using a bigtable as an RDF store.)

  3. Both tend follow Web Architecture, using HTTP and often using REST.

  4. Reading about CouchDB and MongoDB (JSON document databases), as well as Neo4j (a graph database), I noticed an undercurrent about SQL being awkward to program against. I guess this is the O/R Impedance Mismatch, especially the structural differences. RDF’s design is very close to the relational model, so it doesn’t help on this front. Within the RDF community, however, there are some systems which attempt to partly bridge this gap, including node-centric APIs (which I happen to prefer, myself). I would also argue that duck typing closes the gap from the programming-language side.

  5. I don’t see NoSQL going anywhere near linked data or having a vocabulary ecosystem. I expect it will want to, someday. Decentralization is a key difference in requirements.

  6. I only see RDF dealing with inference. Why is this? My first thought is that the RDF community has a lot of AI roots, and the NoSQL community doesn’t. But maybe it’s about economics and motivations: formalizing the notion of inference makes it possible, in theory, to easily deploy very sophisticated data transformation (and making them complete before the sun goes out). At NoSQL scale, folks are much more concerned about techniques for being able to run even simple transformations (and making them complete before the power bill comes due). I note that AllegroGraph manages to be in both communities, with a very practical, high-performance Prolog element; I don’t know how parallel it is. A few RDF folks are working on using map/reduce. Presumably, with the rise of multi-core systems, even single-user inference engines will want to be made parallel.

  7. How many of the reasons for NoSQL rejecting SQL also apply to SPARQL? Does the scaling issue apply? Actually, does the scaling issue really apply to SQL? Michael Stonebraker (more or less the Voice of God) claims automatic sharding can and should be done while still using SQL. Some people reply: Perhaps, but in Enterprise-Grade Open Source? Also, maybe “SQL” is a euphemism for ACID, and that’s really what doesn’t scale and/or is too expensive. Perhaps that issue needs to be settled before considering SPARQL? Actually, the state of ACID in SPARQL is an open issue right now; maybe NoSQL can inform that decision.

  8. RDF is standardized. Some would argue it’s more standardized than SQL; that case will be stronger when SPARQL 1.1 is done. Here’s a diagram Steve Harris made, which I reformatted. “KV” refers to key-value stores, the simplest, most scalable kind of NoSQL database.

    diagram showing sparql being more standard the sql which is more standard than key-valye stores, while kv and sparql scale more then sparql.

So… What does that all boil down to?

Bottom line: RDF could learn a lot from NoSQL about scaling and ease-of-programming; NoSQL could learn a lot from RDF about decentralization and inference.

Some closing questions, ideas….

Can someone make a SPARQL endpoint with Cassandra’s performance and scaling properties? I haven’t studied this idea much, but I’m afraid the static column families will make it impossible to get much performance without building the store for a particular set of SPARQL queries. But it could still be useful, even with that drawback. Or maybe some SPARQL endpoint is already there; has anyone really tried a comparative benchmark?

How does the SPARQL endpoint description and aggregation work compare to database sharding. Are there designs for doing it automatically?

RDF 2 Wishlist

October 30, 2009

Here’s what I think should be standardized at some point, soon, in the Semantic Web infrastructure. These items are at various levels of maturity; some are probably ready for a W3C Working Group right now, while others are in need of research. They are mostly orthogonal and most can be handled in independent efforts. (I would lean against forming a single RDF Working Group to handle all of this; that would be slower, I think.)

To be clear, when I say “RDF 2” I mean it like OWL 2: an important step forward, but still compatible with version 1. I’m not interested in breaking any existing RDF systems, or even in causing their users significant annoyance. In some traditions, where the major version number is only incremented for incompatible changes, this would be called a 1.1 release. In contrast, at W3C we normally signal a major, incompatible change by changing the name, not the version number. (And we rarely do that: the closest I can think of is CSS->XSL, PICS->POWDER, and HTML->XHTML). The nice thing about using a different name is it makes clear that users each decide whether to switch, and the older design might live on and even win in the end. So if you want to make deep, incompatible changes to RDF, please pick a new name for what you’re proposing, and don’t assume everyone will switch.

This is partially a trip report for ISWC, because the presentations and especially the hallway and lounge conversations helped me think about all this.

Note that although I work for W3C, this is certainly not a statement of what W3C will do next. It’s not my decision, and even if it were, there would be a lot of community discussion first. This is just my own opinion, subject to change after a little more sleep. Formally the decisions about how to allocate W3C resources among the different possible standards efforts are made by W3C management guided by the the folks who provide those resources, via their representatives on the Advisory Committee (AC). If the direction of the W3C is important to you or your business, it may be worthwhile to join and participate in that process.

1. RDF and XML interoperation

There’s a pretty big divide between RDF and XML in the real world. It’s a bit like any divide between different programming languages or different operating systems: users have to pick which technology family to adopt and invest in. It’s hard to switch, later, because of all the investment in tools, built systems, educations, and even socially networks. (People who use some technology build social and professional relationships other people who use the same technology. Thus we have an XML community, an RDF community, etc. Few people are motivated to be in both communities.)

I think we should have better tools for bridging the gap, technologically, so that when data is published in XML, it’s easy for RDF consumers to use it, and when the data is published in RDF, it’s easy for XML consumers to use it.

The leading W3C answer is GRDDL, which I think is pretty good, but could use some love. I’d like to see support for the transforms being in Javascript, which I think is probably the dominant language these days for writing code that’s going to run on someone else’s computer. It certainly has a bigger community than XSLT. I’d probably support Java bytecode, too.

I would also like to see some way to support third-party GRDDL, where the transform is provided by someone not associated with either the data provider or data consumer. Nova Spivack gave a keynote where he talked about this feature of T2. They’re focused on HTML not XML, but the solution is probably the same.

Beyond GRDDL, I think there’s room for a special data format that bridges the gap. I’ve called it “rigid rdf” or “type-tagged xml” in the past: it’s a sub-language of RDF/XML, or a style of writing XML, which can be read by RDF/XML parsers and is also amenable to validation and processing using XML schemas. Basically you take away all choices one has in serializing RDF/XML.

I note the The Cambridge Communiqué is ten years old, this month. It proposed schema annotation as an approach, and that’s not a bad one, either. I haven’t heard of anyone working on it recently, but maybe that will change if the XML community starts to see more need to export RDF.

Amusingly, while I was talking to Gary Katz from MarkLogic about this, he mentioned XSPARQL as a possible solution, and I pointed out Axel Polleres (xsparql project leader) was sitting right next to us. So, they got to talk about it. XSPARQL doesn’t excite me, personally, because I don’t use either SPARQL or XQuery, but objectively, yes, it might solve the problem for some significant userbase.

2. Linked Data Inference

For me, an essential element of a working Linked Data ecosystem is automatic translation of data between vocabularies. If you provide data about the migration of frogs in one vocabulary, and my tools are looking for it in another one, the infrastructure should (in many cases) be able to translate for us. We need this because we can’t possibly agree on one vocabulary (for any given domain) that we’ll all use for all time. Even if we can agree for now, we’ll want this so that we can migrate to another vocabulary some time in the future.

Inference using OWL (and its subsets like RDFS) provides some of this, but I don’t think it’s enough. RIF fills in some more, but the WG did not think much about this use case, and there’s might be some glue missing. Maybe we can get WG Note out of RIF to help this along.

I’d like us to be clear about first principles: when you’re given an RDF graph, and you’re looking for more information that might be useful, you should dereference the predicate IRIs to learn about what kinds of inference you’re entitled to do. And then, given resources and suitable reasoners, you should do it. That is, the use of particular IRIs as predicates implies certain things, as defined by the IRI’s owner. The graph is invoking certain logics by using those IRIs. (Of course you can always infer things that were not implied, but as among humans, those “inferences” are really just guesses you are making. They have quite a different status from true implications.)

If this is put together properly, and the logics are constructed in the right form, I think we’ll get the dynamic, on demand translation I’m looking for. I imagine RIF could be very useful for this, but reasoner plugins written in Javascript of Java bytecode could be a better solution in some cases.

Some of my thinking here is in my workshop keynote slides, but later conversations with various folks, especially Pat Hayes and TimBL, helped it along. There’s more work to do here. I think it’s pretty small, but crucial.

3. Presentation Syntaxes

RDF, OWL, and RIF all have hideous primary exchange syntaxes and some decent not-W3C-recommended alternative serializations. I’m not really sure what can practically be done here that hasn’t been done.

At very least, I’d like to see a nice RDF-friendly presentation syntax for RIF. A bit like N3, I suppose. I did some work on this; maybe I can finish it up, and/or someone else can run with it.

OWL 2 has 3+n syntaxes, where n is the number of RDF syntaxes we have. Exactly one of those syntaxes is required of all consumers, for interchange. I’ll be interested to see how this plays out in the market.

4. Multi-Graph Syntax

Most systems that work with RDF handle multiple graphs at the same time. Sometimes they do this by storing the triples in a quad store, with the fourth entry being a graph identifier. This works pretty well, and SPARQL supports querying such things.

We don’t have a way to exchange multiple graphs in the same document, however. N3 has graph literals (originally called contexts), and there was some work under the term named graphs, which is kind of the opposite approach.

Personally, I don’t yet understand the use case for interchanging multiple graphs in one document, so I’m not sure where to go with this.

Hmmm. I guess RIF could be used for this. You can write RDF triples as RIF frame facts, and the rif:Document format allows multiple rulesets, each with an optional IRI identifier, in the same document. ETA: RIF also gives you an exchange syntax where you can syntactically put literals in the subject and use bnodes as predicates, if you want. But now you’re technically exchanging RIF Frames instead of RDF Triples.

5. RDF Graph Validation

When writing software that operates on RDF data, it’s really nice to know the shape of the data you’ll find. It’s even nicer, if software can check to see if that’s actually what you got. And if reasoners can work to fill in any missing peices.

I don’t exactly understand how important or unimportant this is. It’s closely related to the Duck Typing debate. Whatever mechanisms make duck typing work (eg exception handling, reflection, side-effect-free programming) probably help folks be okay without graph validation. But I think folks trained on C++/Java or XML Schema would be much happier with RDF if it had this

The easiest solution might be using rigid RDF. One could probably also do it with SPARQL, essentially publishing the graph patterns that will match the data in the expected graphs.

The most interesting and weird approach is to use OWL. Of course, OWL is generally used to express knowledge and reason about some application domain, like books, genes, or battleships. But it’s possible to use OWL to express knowledge about RDF graphs about the application domain. In the first case, you say every book has one or more authors, who are humans. In the second case, you say every book-node-in-a-valid-graph has one or more author links to a human-node in the same graph. At least that’s the general idea. I don’t know if this can actually be made to work, and even if it can, it risks confusing new OWL users about one of the subjects they’re already seriously prone to get wrong.

6. Editorial Issues

Finally, I’d like some portions of the 2004 RDF spec rewritten, to better explain what’s really going on and guide people who aren’t heavily involved in the community. This could just be a Second Edition — no need for RDF 2 — because no implementations changes would be involved.

I’d like us to include some practical advice about when/how to use List/Seq/Bag/Alt, and reification, maybe going so far as to deprecate some of them (IMHO, all but List). Maybe bring in some of the best-practice stuff on publishing and n-ary relations.

I understand Pat Hayes would like to explain blank nodes differently, explicitly introducing the notion of “surfaces” (what I would call knowledge bases, probably). Personally, I’d love to go one step farther and get rid of all “graph” terminology, instead just using N-Triples as the underlying formalism, but I might a minority of one on that.

ETA: Of course we should also change “URI-Reference” to “IRI”, and stuff like that.

Okay, that’s my list. What’s yours? (For long replies, I suggest doing it on your own blog, and using trackback or posting a link here to that posting.) Discussion on semantic-web@w3.org is fine, too.

Linked Government Data

September 11, 2009

A few weeks ago, I was drafted to replace José Manuel Alonso as W3C eGovernment Lead (in addition to my Semantic Web standards work). I’ve never worked with eGovernment before, but it turns out that the W3C group working in the area had recently resolved to:

identify ways for governments and computer science researchers to continue working together to advance the state-of-the-art in data integration and build useful, deployable proof-of-concept demos that use actual government information and demonstrate real benefit from linked data integration.

I wish they’d said that with fewer words, but, yes, I do know something about that.

So this week, I was in Washington, wearing a suit (but no tie — this was an O’Reilly crowd), attending the Gov2.0 Expo Showcase and the Gov2.0 Summit. It’s like a forest in spring after a fire, with wildflowers blooming everywhere, and grand, government proclamations about future growth. My overwhelming impression, from events on the stage, was that:

  1. the Federal government is undergoing a massive effort to begin systematically publishing machine-readable data, across all agencies, in a sustainable way, driven from both the top (Obama) and by all the grassroots awareness of the new technology;
  2. state and local governments also see the benefits, and some brave souls are leading the way, often with much success;
  3. there is an army of developers, commercial and hobbiest, open source and proprietary, waiting to develop mind-blowing (and genuinely useful) applications using this data.

I had a little re-adjustment panic, feeling like there was nothing interesting left to do. Like when you show up to help some friends move, and they’re already done and sitting around eating the pizza.

But there’s still lots to do. People who’ve been in Washington a long time assure me it will be many years before a significant fraction of the data is really being published, and, as we well know, getting the data released does not mean it’s easy to use. It’ll be a long time before we have stable feeds of accurate data using appropriate formats and vocabularies.

So how do we get there from here? What should the agencies start doing now, to help ease the transition, and get us there faster? And where, exactly is “there”?

(As if I would know? Well, it’s what I’m thinking about this week, so here’s my braindump, before it all gets swapped out for the weekend, and something else gets swapped in Monday morning.)

The Linked Data Ecosystem

Our goal, my goal, is to have a scaleable and robust platform for decentralized apps to share their data. Right now, we have a handful of decentralized apps (such as e-mail), but most social software (multi-user apps) runs off a single vendor’s server farm. Yes, youtube, twitter, facebook, google docs, flikr, and others have figured out how to make it work, and how to make it scale, but they still run the systems. We want all that functionality, but without any one organization being a bottleneck, let alone potentially behaving badly.

This goal, I should note, is considerably less compelling now than it was when many of us signed up, in some cases decades ago, since we now have an effective app distribution system (with modern web browsers), and a vibrant, competative, and healthy market for these individual centralized services. But some of us still would rather not trust facebook or google with too much of our personal and professional lives.

I suspect the government is best not relying on them too much, either.

Over the years, the Semantic Web community has been developing a set of techniques, based on open standards, for achieving these goals. They are not perfect, and they are not always polished, but most of them work well and the rest look pretty close.

  1. Present all your data in as triples, building everything out of Subject-Property-Value statements. This roughly lines up with rock-solid technologies like relational databases and object-oriented programming, and new techniques, including JSON and terascale databases (Google’s BigTable app engine datastore, Amazon’s SimpleDB). The details are ironed out in RDF, a W3C standard first published in 1999, and then rearticulated with clarification in 2004.

    When you use triples as your building blocks, you have to be more explicit about the structure of your data, and this eases accurate interoperation. The simplicity also supports metadata, reasoning about data, data distribution, and shared infrastructure even among disparate applications.

  2. Use Web addresses (URLs/URIs/IRIs) to name things. Each thing that someone might need to refer to — cities, agencies, people, projects, items in inventory, websites, events, etc — should be assigned a universally unambiguous identifier (nothing else uses the same identifier), so that every source of data about that item can be merged, always knowing it’s the same item. It’s okay (and unavoidable, in some cases) to have multiple identifiers for some items; the important thing is to be able to use the same identifier, when we know it.

    We use Web addresses to build these identifiers, because we already have a structure for making sure there are no unintended duplicates, and because it allows an important element of of centralization. Each time some organization mints a new web-based identifier, it gets the ability to provide (through its web server) some core data about that item. It’s not that their data will always be correct, but it provides a starting point, a seed around which a shared understanding of the item can grow.

  3. Build for change and diversity, with backward and forward compatibility, and multiple views on the same data, knowing that the world changes, and our understanding of the world changes. (This needs its own article, but Why I Want RIF? covers the hardest parts of the territory.)

How Do We Get There?

The mood of the summit, and perhaps the mood everywhere near technology corners of the White House, seems to be one of ecstatic urgency. Folks seem to think everything can and should be done at a record pace, with crowd sourcing and agile development. The perfect is the enemy of the good. Get something good out there, then refine it. Fail early, fail fast. This is awesome…. but how do we iterate from here to a fully-evolved linked data ecosystem? And let’s not just assume the crowd will do all the right things.

There are two parts to the problem, and we can separate them, running them in parallel:

  1. Get more and more public data streams flowing.
  2. Make the data streams more and more usable.

There is a synergy between these two. As more data appears, more useful things can be done with it. As more good things are done with it, the incentives to release data become stonger. (Of course, some bad things will be done with it, too. But that’s a subject for another day.)

I don’t have any insights into part 1. That’s political, whether it’s workplace politics or national politics. I don’t know if there are any shortcuts. The case just has to be made, better and better, in all the right places, until the right stakeholders and decision makers and people who really do the work are convinced.

But part 2 can be crowd-sourced. It can be turned into a market, a community effort.

Supporting the Community Around a Data Source

People need to see what others are doing, and people doing good things need to be rewarded. People need to have some confidence they’ll be rewarded for doing good things. And mostly: the community needs to be arranged so that the more people participate, the better it gets. (See the video of Clay Shirky’s talk at the Summit about how to do this, in general.)

In this case, good things include:

  • Clarifying what elements of the data feed mean
  • Documenting the accuracy, methodology, etc, of the feed
  • Finding and potentially correcting bugs in the data
  • Developing software which somehow uses the data, and (1) sharing lessons learned in developing the software, (2) making the software available in various forms (web service, commercial download, open source, etc)
  • Linking to related feeds
  • Providing derived feeds, reformating, correcting, or otherwise improving or re-targetting this data
  • Providing merged feeds, which process this data along with other data, to create something new
  • Summary/Analysis feeds
  • Review, commentary, support for all these related items
  • Career and project-funding opportunities related to this feed
  • Social events related to this feed
    • There’s lots more to be said here, but … deadlines are the antidote to perfectionism. So I’ll stop here for now. Comments welcome. If this stuff really interests you, please consider joining joining the W3C eGovernment Interest Group. (It’s an open group; W3C membership is encouraged but not required.)