RDF meets NoSQL

March 9, 2010

On Thursday, I have 20 minutes to address 200 people (plus a video audience) at NoSQL Live … from Boston. My self-appointed mission is to start building bridges between the NoSQL community and the Linked Data/RDF/W3C community. These are two sets of people working on different problems, but it’s pretty clear to me they are heading in the same direction, in similar spirit, and could gain a lot from working together.

I’m organizing my talk around the question of standardization for NoSQL, and I’ll talk about W3C process and such, but the interesting part is where NoSQL touches RDF. So here are some of my thoughts on that, written for an audience already familiar with RDF. I’d love to get some feedback on the basic ideas now, to make my talk better.

  1. While they both want to move beyond SQL, their reasons are different. I understand the key for NoSQL is:

    • SQL doesn’t scale big enough. Once your read-write dataset gets too big for a single machine, you have to develop and maintain a messy sharding system. (But see below.) Sometimes it’s an economy/efficiency thing; if you don’t need ACID, a NoSQL solution might give you better performance on cheaper hardware.

    Meanwhile, I see two big motivation for the RDF folks:

    • Decentralization. On the open Web, data comes from many different sources, mostly beyond your control. Everyone might be providing data to everyone, and no one has the authority to run a central database, ever if they had the technology and the iron.

    • Inference. Some people find formal semantics and well defined inference to be very useful. (How is this different from triggers and views? Good questions.)

  2. There’s a freedom, a joy, and in some case enormous practicality to not having to pre-create your database schema. Nearly all NoSQL and RDF systems are schemaless or allow a fully dynamic schema. (But I understand most bigtable systems have fixed column families. That could be a problem in using a bigtable as an RDF store.)

  3. Both tend follow Web Architecture, using HTTP and often using REST.

  4. Reading about CouchDB and MongoDB (JSON document databases), as well as Neo4j (a graph database), I noticed an undercurrent about SQL being awkward to program against. I guess this is the O/R Impedance Mismatch, especially the structural differences. RDF’s design is very close to the relational model, so it doesn’t help on this front. Within the RDF community, however, there are some systems which attempt to partly bridge this gap, including node-centric APIs (which I happen to prefer, myself). I would also argue that duck typing closes the gap from the programming-language side.

  5. I don’t see NoSQL going anywhere near linked data or having a vocabulary ecosystem. I expect it will want to, someday. Decentralization is a key difference in requirements.

  6. I only see RDF dealing with inference. Why is this? My first thought is that the RDF community has a lot of AI roots, and the NoSQL community doesn’t. But maybe it’s about economics and motivations: formalizing the notion of inference makes it possible, in theory, to easily deploy very sophisticated data transformation (and making them complete before the sun goes out). At NoSQL scale, folks are much more concerned about techniques for being able to run even simple transformations (and making them complete before the power bill comes due). I note that AllegroGraph manages to be in both communities, with a very practical, high-performance Prolog element; I don’t know how parallel it is. A few RDF folks are working on using map/reduce. Presumably, with the rise of multi-core systems, even single-user inference engines will want to be made parallel.

  7. How many of the reasons for NoSQL rejecting SQL also apply to SPARQL? Does the scaling issue apply? Actually, does the scaling issue really apply to SQL? Michael Stonebraker (more or less the Voice of God) claims automatic sharding can and should be done while still using SQL. Some people reply: Perhaps, but in Enterprise-Grade Open Source? Also, maybe “SQL” is a euphemism for ACID, and that’s really what doesn’t scale and/or is too expensive. Perhaps that issue needs to be settled before considering SPARQL? Actually, the state of ACID in SPARQL is an open issue right now; maybe NoSQL can inform that decision.

  8. RDF is standardized. Some would argue it’s more standardized than SQL; that case will be stronger when SPARQL 1.1 is done. Here’s a diagram Steve Harris made, which I reformatted. “KV” refers to key-value stores, the simplest, most scalable kind of NoSQL database.

    diagram showing sparql being more standard the sql which is more standard than key-valye stores, while kv and sparql scale more then sparql.

So… What does that all boil down to?

Bottom line: RDF could learn a lot from NoSQL about scaling and ease-of-programming; NoSQL could learn a lot from RDF about decentralization and inference.

Some closing questions, ideas….

Can someone make a SPARQL endpoint with Cassandra’s performance and scaling properties? I haven’t studied this idea much, but I’m afraid the static column families will make it impossible to get much performance without building the store for a particular set of SPARQL queries. But it could still be useful, even with that drawback. Or maybe some SPARQL endpoint is already there; has anyone really tried a comparative benchmark?

How does the SPARQL endpoint description and aggregation work compare to database sharding. Are there designs for doing it automatically?

About these ads

14 Responses to “RDF meets NoSQL”

  1. woddiscovery Says:

    Sandro,

    Great post – thanks a lot! I recently also had some thoughts re this issue [1]. Further, I think we should have a dedicated session during WWW/W3C/LOD track concerning this.

    Cheers,
    Michael

    [1] http://webofdata.wordpress.com/2010/03/01/data-and-the-web-choices/


  2. Good thoughts Sandro!

    Let’s have a discussion at NOSQL Live and see what we can come up with regarding distributed query languages for NOSQL. Have you looked at http://gremlin.tinkerpop.com where we are currently having support for SPARQL and trying to get in document stores into the picture? I by a beer :)

    /peter

  3. Juan Sequeda Says:

    Sandro,

    Great post.

    As I have mentioned previously, we started a course at UT Austin called Semantic Web and Cloud Databases. We have 20 grad students looking into all these issues. We hope to have some RDF stores backed by cloud databases in the next few months.


  4. Sandro,

    Initial binding point between NoSQL and Quad Stores is that both are EAV Graph oriented.

    Graph traversal expressions are main thing NoSQL folks assume is missing from SPARQL based RDF Quad Stores.

    As for reasoning, this aspect is sorta lost on the NoSQL community (as is it most). But as huge amounts of Linked Data continue to emerge, the value of reasoning will become self evident.

    As you can imagine, Virtuoso handles graph traversal expressions via SPARQL extensions, here are some live examples based on the 8.5 Billion+ RDF Quad Store at: http://lod.openlinksw.com (/sparql or /isparql will work for custom queries):

    1. http://bit.ly/9QNdec — N degrees of TimBL using Virtuoso’s SPARQL path expressions extension
    2. http://lod.openlinksw.com/demo_queries/ — for other queries you can run against this LOD Cloud Cache instance of Virtuoso .

    Kingsley

  5. Jeff Sayre Says:

    Some people have been critical of using SPARQL as a query language for graph databases, which would seem to put the two at odds working effectively and efficiently with RDF. The argument is that what should be easily queryable in a graph DB, is actually difficult and sometimes impossible when using SPARQL.

    There have been some efforts in the past to create a query language for semantic graph databases. I’m not sure of the status of the current effort but this paper is an interesting read on the topic. You are probably aware of these efforts, but just in case, here is the link: http://www.bearcave.com/misl/misl_tech/dsge_query_language.pdf

    But, as I am just starting to investigate how to use graph DBs in a Social Semantic Web platform that we’re starting to build, I am no expert. Perhaps this issue is now moot.

    I’m encouraged by Peter’s comment as we are looking closely at using Neo4j. The other contender is InfoGrid.

    It seems to me that graph DBs, with their ability to clearly define link types, are powerful systems that can effectively encode semantic relationships between nodes. In fact, they appear to be perfect modelers of RDF triples, with the nodes representing subjects and objects and the edges (relationships) representing predicates. Perhaps creating a SPARQL abstraction layer for graphDBs would allow these data to be effectively discovered.


  6. Sandro,

    I will be at NoSQL Live talking about HyperGraphDB (http://www.kobrix.com/hgdb.jsp), and would love to chat with you about this as well. We should definitely meet up. I’m very skeptical of the possibility to standardize NoSQL, and even more skeptical of the idea of reducing it to a one particular data model like RDF with SPARQL…in the sense that it will be very hard to see all NoSQL products implement SPARQL or some derivative as a standard query language. HyperGraphDB’s model is even more general than RDF or any other graph model/database for that matter, and still it is a model with certain assumptions, certain rules to follow, certain best practices etc., and it would be to its detriment if the interface to it to be a common denominator with other DBs. I suspect the same is true for the other NoSQL DBs.

    That said, your post is right on the mark since there are a lot of benefits to be gained in making semantic web standards part of the mainstream programmer’s toolkit. NoSQL can help with that because it’s lightweight, dynamic, low-cost, buzzy etc. while RDF can help NoSQL by putting in the “industry standard” label in power point presentations :)

    Cheers,
    Boris

  7. Danny Says:

    Nice comparison. Just a random thought – you’ve clearly thought about building RDF stores backed by NoSQL stores, but what about the other way around? Some of the RDF stores are getting pretty darn scalable (like Steve’s :)

    Although the data from KV might be seriously uninteresting and the storage suboptimal, the developer would have the option (somehow?) of later enriching it.

    (Ages ago I did a replacement for Java Property files [straight KV] using a triplestore – worked fine, though no doubt wasn’t worth the coding effort…)

  8. dret Says:

    great post, but i have my issues with the statement that RDF “tends to follow Web Architecture, using HTTP and often using REST.” i would argue that all that RDF mostly does is use HTTP URIs (which is good), but there is little mention in how to actually use HTTP beyond GETing something, and the current design of SPARQL as the main interaction protocol with RDF is pretty far from being RESTful. many NoSQL approaches, on the other hand, tend to be really good with understanding and implementing web architecture, because this is where they need to be good to get their scalability sorted out.

  9. Dan Brickley Says:

    Sandro – really great you’re doing this outreach. Good luck with the talk, I look forward to hearing how it goes.

    @dret – RDF’s focus on ‘GET’ is natural. SemWeb is a project fundamentally about describing stuff. It’s by nature pretty passive and declarative, which is a natural fit with read-only emphasis on HTTP GET. Of course there are places (SPARQL Update) where we go beyond describing stuff, but that really is our core business. So the focus on GET is quite natural…


  10. Sandro, you might be interested in this related discussion on Semantic Overflow:

    http://www.semanticoverflow.com/questions/723/rdf-storages-vs-other-nosql-storages

    Also, I have written up some of my own thoughts on the subject at:

    http://blog.datagraph.org/2010/04/rdf-nosql-diff

  11. Valentin Says:

    for some background you might be interested in our implementation/experiments in using SimpleDB as a back-end for RDF storage (with quite sobering results). The experiments are reported here: http://vzach.de/papers/2010_nefors.pdf (this is a draft, the final version will appear here: http://wasp.cs.vu.nl/larkc/nefors10/). Abstract:

    We examine whether the existing ‘Database in the Cloud’ service SimpleDB can be used as a back end to quickly and reliably store RDF data for massive parallel access. Towards this end we have
    implemented ‘Stratustore’, an RDF store which acts as a back end for the Jena Semantic Web framework and stores its data within the SimpleDB.
    We used the Berlin SPARQL Benchmark to evaluate our solution and compare it to state of the art triple stores. Our results show that for certain simple queries and many parallel accesses such a solution can have a higher throughput than state of the art triple stores. However, due to the very limited expressiveness of SimpleDB’s query language, more complex queries run multiple orders of magnitude slower than the state of the art and would require special indexes. Our results point
    to the need for more complex database services as well as the need for robust, possible query dependent index techniques for RDF.


  12. [...] RDF meets NoSQL « Decentralyze – Programming the Data Cloud [...]


  13. I have been exploring for a little for any high quality articles or blog posts in this sort of house . Exploring in Yahoo I ultimately stumbled upon this site. Reading this info So i’m satisfied to exhibit that I’ve an incredibly just right uncanny feeling I came upon just what I needed. I so much certainly will make certain to do not omit this site and give it a glance regularly.


Comments are closed.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: