March 9, 2010
On Thursday, I have 20 minutes to address 200 people (plus a video audience) at NoSQL Live … from Boston. My self-appointed mission is to start building bridges between the NoSQL community and the Linked Data/RDF/W3C community. These are two sets of people working on different problems, but it’s pretty clear to me they are heading in the same direction, in similar spirit, and could gain a lot from working together.
I’m organizing my talk around the question of standardization for NoSQL, and I’ll talk about W3C process and such, but the interesting part is where NoSQL touches RDF. So here are some of my thoughts on that, written for an audience already familiar with RDF. I’d love to get some feedback on the basic ideas now, to make my talk better.
While they both want to move beyond SQL, their reasons are different. I understand the key for NoSQL is:
- SQL doesn’t scale big enough. Once your read-write dataset gets too big for a single machine, you have to develop and maintain a messy sharding system. (But see below.) Sometimes it’s an economy/efficiency thing; if you don’t need ACID, a NoSQL solution might give you better performance on cheaper hardware.
Meanwhile, I see two big motivation for the RDF folks:
Decentralization. On the open Web, data comes from many different sources, mostly beyond your control. Everyone might be providing data to everyone, and no one has the authority to run a central database, ever if they had the technology and the iron.
Inference. Some people find formal semantics and well defined inference to be very useful. (How is this different from triggers and views? Good questions.)
There’s a freedom, a joy, and in some case enormous practicality to not having to pre-create your database schema. Nearly all NoSQL and RDF systems are schemaless or allow a fully dynamic schema. (But I understand most bigtable systems have fixed column families. That could be a problem in using a bigtable as an RDF store.)
Both tend follow Web Architecture, using HTTP and often using REST.
Reading about CouchDB and MongoDB (JSON document databases), as well as Neo4j (a graph database), I noticed an undercurrent about SQL being awkward to program against. I guess this is the O/R Impedance Mismatch, especially the structural differences. RDF’s design is very close to the relational model, so it doesn’t help on this front. Within the RDF community, however, there are some systems which attempt to partly bridge this gap, including node-centric APIs (which I happen to prefer, myself). I would also argue that duck typing closes the gap from the programming-language side.
I don’t see NoSQL going anywhere near linked data or having a vocabulary ecosystem. I expect it will want to, someday. Decentralization is a key difference in requirements.
I only see RDF dealing with inference. Why is this? My first thought is that the RDF community has a lot of AI roots, and the NoSQL community doesn’t. But maybe it’s about economics and motivations: formalizing the notion of inference makes it possible, in theory, to easily deploy very sophisticated data transformation (and making them complete before the sun goes out). At NoSQL scale, folks are much more concerned about techniques for being able to run even simple transformations (and making them complete before the power bill comes due). I note that AllegroGraph manages to be in both communities, with a very practical, high-performance Prolog element; I don’t know how parallel it is. A few RDF folks are working on using map/reduce. Presumably, with the rise of multi-core systems, even single-user inference engines will want to be made parallel.
How many of the reasons for NoSQL rejecting SQL also apply to SPARQL? Does the scaling issue apply? Actually, does the scaling issue really apply to SQL? Michael Stonebraker (more or less the Voice of God) claims automatic sharding can and should be done while still using SQL. Some people reply: Perhaps, but in Enterprise-Grade Open Source? Also, maybe “SQL” is a euphemism for ACID, and that’s really what doesn’t scale and/or is too expensive. Perhaps that issue needs to be settled before considering SPARQL? Actually, the state of ACID in SPARQL is an open issue right now; maybe NoSQL can inform that decision.
RDF is standardized. Some would argue it’s more standardized than SQL; that case will be stronger when SPARQL 1.1 is done. Here’s a diagram Steve Harris made, which I reformatted. “KV” refers to key-value stores, the simplest, most scalable kind of NoSQL database.
So… What does that all boil down to?
Bottom line: RDF could learn a lot from NoSQL about scaling and ease-of-programming; NoSQL could learn a lot from RDF about decentralization and inference.
Some closing questions, ideas….
Can someone make a SPARQL endpoint with Cassandra’s performance and scaling properties? I haven’t studied this idea much, but I’m afraid the static column families will make it impossible to get much performance without building the store for a particular set of SPARQL queries. But it could still be useful, even with that drawback. Or maybe some SPARQL endpoint is already there; has anyone really tried a comparative benchmark?
How does the SPARQL endpoint description and aggregation work compare to database sharding. Are there designs for doing it automatically?
November 5, 2009
This morning, I gave the keynote address (my slides) at RuleML 2009. I assumed the audience would be fairly familiar with rule systems and rule technologies, but not necessarily with RIF, the Semantic Web in general, or my sense of the future of the Semantic Web (for which RIF is important).
Those slides may be boring for some of you. The interesting new bits:
- I made a new diagram for the RIF dialects. During the talk, I presented it as a successive reveal: the chaos of rule system features, the BLD grouping, the PRD grouping, and the Core intersection. Here’s the final slide:
- I wanted to convey that the Semantic Web means real change, and I wanted emotional impact, so I took a shotgun approach and enumerated a list of likely data sources that was big enough to have some surprises for most folks; then I moved into a long list of things you could do with that data. Some things on the list were boring (having impact only by showing how long the list is), but some got the desired wide eyes and shocked sounds as people realized this just might happen. I think it went over well.
My list of types of data we’ll be seeing:
- From producers: product information
- From sellers: product and service offerings
- Customer support (instructions, upgrades, …)
- Social network (who you trust, who interests you)
- Personal information shared with friends
- Public records (financial, legal, political, …)
- Science (medical, environmental, economic, …)
- News, Blogs, Public photos, videos
- Event listings (performances, meetings)
- Review, opinions, product experiences, preferences
- Personal location, location history
- Financial transactions
Which led into my general scenario: you’re in a store about to buy something, but first you scan it with your phone and look up a little more information about it. What might you look up?
- Its price at other stores, nearby
- Its price for delivery, and how how long you’d have to wait
- Maybe: where it was made, and under what conditions
- Is it’s producer a good corporate citizen?
- Does its producer agree with your political views (uh oh)
- How many houses does its CEO own?
- Did your spouse/housemate just buy some? Or something like it?
- How your friends feel about this product
- How Consumer Reports (or some such service) reviewed it
- Product liability suits
- Maybe: Payment for Endorsements
- For electronics: compatibility information
- For mechanical items: How repairable is it? MTTF, MTTR
- For food: nutritional information, health benefits and risks
- Demographics of this brand, these products
That’s all for now. I saw the Bellagio Fountain from my 29th floor
hotel room late last night, but I’d like to see it up close.
October 18, 2008
I gave a talk Friday — a 10-minute-intro on a panel — (slides) in which I pointed out that there is a ton of powerful data that could be collected right now. I used the example of all the information your cell phone could, in theory, be collecting. I suggested putting it all on the Semantic Web, and then working on our access control models. I wasn’t a great fit for this panel, but I tried.
Alas we had almost no chance for discussion. This was a shame, since we had some great discussions among the panelists in preparing for the panel. Also, the audience was a mere 20 people or so.
On the plus side, I met substitute panelist Daniela Barbosa and was thus reassured that the Data Portability Work Group wasn’t evil. I tried to talk to her over lunch about how her experience in this effort suggests W3C should evolve, but we got interrupted and never had that conversation.
[ Edit: See Posting by Lidija Davis about the talk... ]
October 18, 2008
I gave a talk on Thursday in which I tried to argue the case for why one should use URLs to name classes, properties, and instances when publishing one’s data. It was supposed to be 10-15 minutes. Honestly, my heart wasn’t really in it, my argument wasn’t very cogent, and (no one has to know this) it was presented poorly. Fundamentally, the talk was out of place at this event because it was arguing one side of a technical design choice, and this was pretty much a business strategy conference. Looking at the forty people in the audience, I saw no sign that any of them had even considered it an issue. Either it was obvious to them that using URLs like this was good and proper, or (in most cases) they didn’t even think about it being a choice.
That said, here are my slides.
(Along the way to making the case for URLs, I tried to also make the case for standards. Again, most people are probably already convinced or don’t even see any choice there. There’s certainly a choice between the de facto and de jure approaches to making standards, and in some markets a standard may never emerge without concerted effort.)
Perhaps this whole bit is best forgotten. Or maybe I’ll get inspired to make the more cogent argument in a blog post. Maybe I need to find someone to argue against, at least in my head — an imagined audience member I was trying to convince. Maybe someone from the XML-without-namespaces crowd. Or maybe I could do the technical argument for shims, slightly refactoring my xtan story.