RDF as Backing Storage

So often lately, I’ve seen some prominent people in the Semantic Web advising that the possibilities of using RDF as a backing store for an application are great, and that maybe people should use “SPARQL as a query language for the application”.

Stop. Don’t do that.

Right now, we’re in a situation where RDF implementations are really usable – if you are aware of their limitations, and avoid them. This is also true in MySQL: You don’t make every one of your queries include half a dozen joins, so you don’t want to do a similar thing in SPARQL. The problem is that with RDF query languages, unless you’ve spent a lot of time with them, you’re working on something that’s much more difficult to understand the work behind. It’s extremely easy to write a query that is not well optimized that take the application a very long time to compute, at the cost of making RDF and SPARQL look as if they are too slow.

They’re not.

For small size web applications, there is no real reason that a properly built site could not be extremely quick, and built on RDF tools. (At the current point in time, I won’t say the same for large applications: I don’t have any knowledge of what happens to triplestores once they get past 2 million triples.) However, most queries that people write are slow, simply because they are not optimized. Depending on how exactly the application-level query translator works, you may be dealing with something that’s not completely optimized, which can have an extreme negative impact on your query time.

An example? Well, my knowledge is mostly in Redland, so I’ll just toss out a query via julie, the redlandbot.

09:05:18 < crschmidt> ^q select ?i where { <http://crschmidt.net/foaf.rdf#crschmidt> foaf:knows ?p. ?p foaf:nick ?i. }
09:05:18 < julie> solcita, bluemoonshark, telepwen, jayo, littledownpour, jessicacmalfoy, csogilvie, alacrity, wyvernbitch, pne,
luxtiger, danceinacircle, bertho, ursamajor, pie_is_good, jessical, danbri, ryanbyers, shupe, thebubba, kangarooofdoom,
neviachiel, kamara, joanna4136, raventhon, evilcat84, chrisg, nostrademons, coffeechica, fracturedfaerie, nyxie,
siren52684, pthalogreen, ChemicalLace, zach, seymour, adcott, girlxfriday, meinterrupted, biztheinsane,
sarah_mascara, busbeytheelder, tinyjo, rho, xtremesaints, sherm, mendel, acerbic, thewildrose, bobert225,
sporkmistress, isabeau, beginning, supersat, braindrain, ratkrycek, opal1159, maryam, uberzacker, lor22ms, burr86,
comeseptember, rahaeli, pezstar, girlfriday, xavier, jproulx, roy

That’s right, those results on a hot database are returned in less than one second. On the other hand, if you do the query in the opposite order:

09:05:39 < crschmidt> ^q select ?i where { ?p foaf:nick ?i. <http ://crschmidt.net/foaf.rdf#crschmidt< foaf:knows ?p. }
09:18:29 < julie> solcita, bluemoonshark, acerbic, jayo, bobert225, …

You have a multi-minute wait. (I’ll fill this in when it comes back: it’s been 3 minutes so far, and still going.) Ah, it finished. Just shy of 13 minutes. Someone less versed in Redland would look at that and say “Why?”

Redland’s query engine, Rasqal, works based on the triples it finds as it goes through the query. In this case, for the first query, it finds approximately 100 triples: “Here are the foaf:knows triples pointing from the crschmidt node. Now let’s go find their nicks.” It then has approximately 100 distinct subjects to match in the second part of the query. Now, look at the second query: You’ll notice that the first triple pattern is going to match a lot more triples than the first: in this case, I think it’s approximately 20,000 triples. Then, each of those triples will be matched against the second part of the query. It has to ask approximately 200 times as many questions to the triplestore behind the data. That’s 200 times as long to wait to get all the data out. Since most of the data will probably be “cold” (not cached in the MySQL table cache), you’ll not only be waiting to get it out, but you’ll also probably be emptying out the MySQL cache of anything but this one useless query. All because you got your triple patterns mixed up.

Perhaps this is just a Redland problem: I don’t know. I’ve not used any of the Java-based tools, and I don’t know of any other non-Redland tools for working with SPARQL against a large data store. However, it is an image problem when you advise or offer to “just use SPARQL”: People do not, by intuition, recognize that the above two queries will be any different. Since it’s extremely hard to notice the differences on something that is small scale, it’s hard to catch these mistakes when they start showing up without a fair amount of experience in the application level RDF tool. Ah, it finished, so back up to show you timeframes…

Before you start using SPARQL for an application query language, consider the application you’re using, and how you can optimize it. SPARQL queries can be made to run much faster, generally, if you have a good knowledge of what you’re doing. However, working with the raw triples using the methods the library provides will oftentimes provide a level of insight as to what’s actually going on under the hood that can be extremely useful for knowing how things can work better.

I think that people need to stop advocating SPARQL as the end-all, be all solution for everything in RDF. I’m sure that it can be great, and that there’s tons of great ways to use it. However, one of the most popular RDF engines does not work well with most SPARQL queries that I’ve seen people throw at it. The first step in learning how to properly use SPARQL (for any kind of time-neccesary things: anything over HTTP basically counts here) is learning how to properly manipulate the triples in the way that works best in the application without the query language.

11 Responses to “RDF as Backing Storage”

  1. Jimmy Cerra Says:

    Hum… Exactly my SoC project… Won’t stop me from doing it, but I’ll have to do more work thinging about it. πŸ˜‰

    Note that comparing RDF datastores are a lot slower than relational databases, but RDBS have been around for a _long_ time while RDF datastores are still an active research topic. Also, I think it is kinda easy to design relational databases using the RDF data model, so you could just implement everything in (e.g.) MySQL and use a thin adapter to query it. See Ryan Levering’s SoC work for a prototype with that.

    It is also kinda easy to see why the second query is slower than the first one. At least I figured it out before reading your explaniation. It should also be easy to optimize those queries (just order (with certain changes based on the quirks of SPARAL) by concrete subjects queries first, and so on). I’ll suggest it to Ryan and see what he thinks…

    Not that I’m a database expert by any means… My post above is mostly unproven so far, too. Also I have to leave in one minute, so I’ll let you correct the grammar and query google for the links… πŸ˜‰

  2. chimezie Says:

    Excellent points. I’ve had similar frustrations/experiences with my RDF implementation of choice (4RDF). Adhoc, queries (sent w/ minimal thought as to the work required by the query implementation ) will work in a small-scale system. Beyond that (I’ve found that for the MySQL implementation of 4RDF the limit of usablity is about 4M triples), they simply are not feasible for ‘real-world’ problems – especially if RDF querying is the primary (or sole) means by which data is extracted.

    I’ve often had to reimplement standard functions (such as type, which is supposed to return all rdf:Resources of a given rdf:type) to take advantage of datastore specific mechanics.

    I think the main problem is that most current RDF datastores map the RDF data model to RDBMS as 3 – 4 column tables (w/ slight variations with each implementation). See Redland’s MySQL schema (this may be dated) and 4RDFs (below).

    CREATE TABLE ftrdf_%s_statement (
        subject       varchar(150) not NULL,
        predicate     varchar(150) not NULL,
        object        text,
        statementUri  varchar(45) not NULL,
        domain        text,
        otype         varchar(5) not NULL,
        INDEX %s_stmt_spd_index (subject(100),predicate(100),domain(50)),
        INDEX %s_stmt_pod_index (predicate(100),object(50),domain(50)),
        INDEX %s_stmt_d_index (domain(10))) TYPE=InnoDB

    This simply does not scale due to the redundancy or repeating resources URI’s for each statement made about them as well as relying *completely* on the underlying indexing capabilities to locate statements by parts.

    I believe Oracles approach will solve what amounts to an ‘engineering’ problem in the standard way RDF datastores implement ontop of RDBMS. Primarily, it stores RDF URI’s once only. This is almost half the scalability issue alone. RDF datastores will always have a ceiling (much lower than its predecessor database systems), but the current implementations have that ceiling much lower than is feasible for solving ‘real world’ problems. It shifts the emphasis to the how sophisticated the queries are and how the query optimizers execute such queries. It will be the same with *all* RDF querying languages until the datastore implementations catch up.

  3. Dave Beckett Says:

    Rasqal has no query optimising. In fact I took some out that would have made your two queries both run the fastest way, especially as I have no background in databases or query planning. Instead I’m working on correctness first. There’s also the tricky problem that over a raw triple store, there is no extra knowledge about which triple match is quickest, and at the lowest level, that’s all you have. To fix this, you either need some hints from the rdf store or an alternate approach.

    One other approach is to turn SPARQL queries into SQL ones over an underlying relational store, and then you can benefit from the years^wdecades of work in that field. This is already starting to be demonstrated and coming together by various people (Jena, 3store, rdfstore, algae) and I’d expect to see more public demonstrations of this in a few months.

    Finally, the SQL schema for storing RDF can be a lot better than the example given above, fixed limit URI lengths considered,er, harmful πŸ˜‰ There are papers in the literature.

  4. Christopher Schmidt Says:

    There are obviously a number of improvements which can be made to improve performance at a number of levels in query languages over RDF triplestores. Mortenf’s work in SPARQL->SQL translations have shown me that: they improve the speed of all queries immensely.

    However, in the short term, telling people “Just use SPARQL” as a pancea is misleading, and I think leads more to hinder the efforts of those people trying to encourage using RDF as a database. The real answer is to evaluate the tools to find which best meets your needs as far as performance and other needs goes. (If you need to store it in MySQL, you’re limited by certain constraints, if you can’t use BDB for some reason, etc.) However, no matter what, it is probably better to have people at least look at how the underlying triplestores are implementing these queries before using it as a solution. Currently, I’m not aware of any public implementations that have advertised high speeds and reliability in the way neccesary for SPARQL to come out as a solution for many of the use cases it’s proposed for, imho.

    Of course, some people may have contradictory evidence. Leigh mentioned his 200 megatriple store, Swoogle I’m sure has a lot more than the 10 megatriple dump they shared, but I don’t know if they’re using SPARQL at all in the backend. Thus far, I haven’t seen a single case where someone has specifically stated that they are using a SPARQL-based query engine at the core of an application.

  5. Danny Says:

    I’m really grateful for the info on optimisation, very useful material, and I reckon most of your advice is sound, but “Stop. Don’t do that.” seems to me as bad as recommending RDF/SPARQL stores/query for everything. As you say: “The real answer is to evaluate the tools to find which best meets your needs as far as performance and other needs goes.”. I’d add that raw performance isn’t necessarily all that high on the list when it comes to cost/benefit, and that it’s not all that often you can tell in advance where bottlenecks will be. A good model and a correct implementation can save a lot of developer time, and a query language is a lot easier than hand-coding all the data flow. SPARQL implementations may not be as fast as they could be yet – but if everyone opts for hard-coding againt the triplestore or RDF-free SQL they never will be.

    But having said all that, I’ll throw in some more caveats next time I’ve got the cheerleader’s poms-poms πŸ˜‰

  6. Christopher Schmidt Says:

    “Stop. Don’t do that.” is only regarding “Don’t tell everyone SPARQL is the end-all be-all solution to all your problems with working with RDF”. Not that RDF shouldn’t be used – I strongly believe it should, and can be. However, telling people who don’t know RDF to switch to using SPARQL+RDF for data in an application is going to cause a lot of pain in the end.

    The biggest reason I advise against SPARQL (without due consideration) is that so many aspects of data modelling are ignored if you simply use SPARQL. If you have a good model, you really don’t *need* to use it for the most part. I’ll explain this a bit more:

    If I have a SPARQL query to populate my variables, I’m probably going to use a specific query, a la:

    SELECT ?nick, ?name WHERE { ?p foaf:nick ?nick; foaf:name ?name. }

    Now, if someone adds an additional property to ?p, I won’t know about it. I think if someone uses a non-SPARQL solution — iterating over triples fetch from a data store, rather than just fetching the specific triples — they’ll be more likely to be aware of those triples and able to do something with them. And once you’re iterating over a set of results that simple, there’s actually more work in the building up of a query than there is otherwise!

    I do believe SPARQL has some great uses, but I just don’t want people to keep thinking that SPARQL makes things easier for implementors. I suppose it’s possible that it’s just me, but other than some easier syntax, SPARQL queries don’t make things much easier. Cleaner, shorter code wise, possibly. Easier, not in my opinion.

    I think for developing, once you build a good model, SPARQL isn’t going to help you out much. If you can get a unique identifier that you know ahead of time, then you can easily iterate over the statements in the model attached to that key – and you don’t have to mess with the speed of SPARQL, nor do you have to worry that some additional data that got snuck in is going to get in your way.

    My main concern is just with marketing right now though. Something that for most people is going to *look* significantly slower for no discernable reason (people are used to SQL, where you can’t really give it a query that it won’t optimize) is going to make pushing SPARQL or anything else much more difficult in the long run.

  7. Ontogon Says:

    RDF Storage

    John Barstow has a short entry about why he thinks that the relational database will lose ground to RDF based…

  8. Danny Says:

    Ah, ok. I guess I’m probably more upbeat about SPARQL because most of my recent data access needs haven’t needed anything (directly) programmatic. The XML results => XSLT => HTML/dataFormatX pipe has been working very well for me in terms of programming simplicity (or at least separation of concerns – XSLT’s a dog). For traditional 3plus-tier maybe it isn’t such a good fit (yet). Most of the issues I’ve run into haven’t been directly related to the query engine, but then I’m still in early days on this stuff.

  9. Rich Says:

    I’ve thought about optimising queries when precompiling SPARQL in twinql.

    In general, triple patterns should be resolved in order of ‘completeness’ — ‘check’ patterns (those with no variables) should always be run first, as they can short-cut the query. The remaining patterns should be refined in order to produce as few matches as possible (which, all else being equal, means running 1-var, then 2-var, then 3-var matches in that order). It would perhaps be beneficial to allow smart optimisation under normal circumstances, with user-specified optimisation for cases where you can take more time.

    I think you’re quite right that SPARQL is not a data access panacea. (I’d also like to see a comparison with paths, which are a common use-case.)

  10. Leo Sauermann Says:

    the approach suggested in 9):
    -In general, triple patterns should be resolved in order of ‘completeness’-

    I implemented it 2 years ago for RDQL in Jena exactly like Rich suggest. Took me one day. CrSchmidt, your example is not very good as it can be easily avoided by kindergarden style triple reordering. The real lame-ness of SPARQL starts when you do graph origin questions in combination with optional joins, groups and fulltext stuff. then you can restart julie instantly.

    But these things will be solved, the standard is very good and its optimisable, database guys have been doing this for years. Good to write about sparql anyhow.

  11. Christopher Schmidt Says:

    Leo: Regardless of whether something is “possible”, if it’s not done, then advocating the use of technology which doesn’t do it is not ideal. SPARQL implementations at the moment – or at least Redland – are not optimized stores. Advocating their use is not ideal in the same way that advocating using a SQL store which was not optimized would not be ideal compared to other options before SQL optimization was well understood.

    Things aren’t difficult to do, but until they’re done, and regularly, advocating SPARQL-as-query-solution is dangerous and probably not the best way to do it.