So often lately, I’ve seen some prominent people in the Semantic Web advising that the possibilities of using RDF as a backing store for an application are great, and that maybe people should use “SPARQL as a query language for the application”.
Stop. Don’t do that.
Right now, we’re in a situation where RDF implementations are really usable – if you are aware of their limitations, and avoid them. This is also true in MySQL: You don’t make every one of your queries include half a dozen joins, so you don’t want to do a similar thing in SPARQL. The problem is that with RDF query languages, unless you’ve spent a lot of time with them, you’re working on something that’s much more difficult to understand the work behind. It’s extremely easy to write a query that is not well optimized that take the application a very long time to compute, at the cost of making RDF and SPARQL look as if they are too slow.
They’re not.
For small size web applications, there is no real reason that a properly built site could not be extremely quick, and built on RDF tools. (At the current point in time, I won’t say the same for large applications: I don’t have any knowledge of what happens to triplestores once they get past 2 million triples.) However, most queries that people write are slow, simply because they are not optimized. Depending on how exactly the application-level query translator works, you may be dealing with something that’s not completely optimized, which can have an extreme negative impact on your query time.
An example? Well, my knowledge is mostly in Redland, so I’ll just toss out a query via julie, the redlandbot.
09:05:18 < crschmidt> ^q select ?i where { <http://crschmidt.net/foaf.rdf#crschmidt> foaf:knows ?p. ?p foaf:nick ?i. }
09:05:18 < julie> solcita, bluemoonshark, telepwen, jayo, littledownpour, jessicacmalfoy, csogilvie, alacrity, wyvernbitch, pne,
luxtiger, danceinacircle, bertho, ursamajor, pie_is_good, jessical, danbri, ryanbyers, shupe, thebubba, kangarooofdoom,
neviachiel, kamara, joanna4136, raventhon, evilcat84, chrisg, nostrademons, coffeechica, fracturedfaerie, nyxie,
siren52684, pthalogreen, ChemicalLace, zach, seymour, adcott, girlxfriday, meinterrupted, biztheinsane,
sarah_mascara, busbeytheelder, tinyjo, rho, xtremesaints, sherm, mendel, acerbic, thewildrose, bobert225,
sporkmistress, isabeau, beginning, supersat, braindrain, ratkrycek, opal1159, maryam, uberzacker, lor22ms, burr86,
comeseptember, rahaeli, pezstar, girlfriday, xavier, jproulx, roy
That’s right, those results on a hot database are returned in less than one second. On the other hand, if you do the query in the opposite order:
09:05:39 < crschmidt> ^q select ?i where { ?p foaf:nick ?i. <http ://crschmidt.net/foaf.rdf#crschmidt< foaf:knows ?p. }
09:18:29 < julie> solcita, bluemoonshark, acerbic, jayo, bobert225, …
You have a multi-minute wait. (I’ll fill this in when it comes back: it’s been 3 minutes so far, and still going.) Ah, it finished. Just shy of 13 minutes. Someone less versed in Redland would look at that and say “Why?”
Redland’s query engine, Rasqal, works based on the triples it finds as it goes through the query. In this case, for the first query, it finds approximately 100 triples: “Here are the foaf:knows triples pointing from the crschmidt node. Now let’s go find their nicks.” It then has approximately 100 distinct subjects to match in the second part of the query. Now, look at the second query: You’ll notice that the first triple pattern is going to match a lot more triples than the first: in this case, I think it’s approximately 20,000 triples. Then, each of those triples will be matched against the second part of the query. It has to ask approximately 200 times as many questions to the triplestore behind the data. That’s 200 times as long to wait to get all the data out. Since most of the data will probably be “cold” (not cached in the MySQL table cache), you’ll not only be waiting to get it out, but you’ll also probably be emptying out the MySQL cache of anything but this one useless query. All because you got your triple patterns mixed up.
Perhaps this is just a Redland problem: I don’t know. I’ve not used any of the Java-based tools, and I don’t know of any other non-Redland tools for working with SPARQL against a large data store. However, it is an image problem when you advise or offer to “just use SPARQL”: People do not, by intuition, recognize that the above two queries will be any different. Since it’s extremely hard to notice the differences on something that is small scale, it’s hard to catch these mistakes when they start showing up without a fair amount of experience in the application level RDF tool. Ah, it finished, so back up to show you timeframes…
Before you start using SPARQL for an application query language, consider the application you’re using, and how you can optimize it. SPARQL queries can be made to run much faster, generally, if you have a good knowledge of what you’re doing. However, working with the raw triples using the methods the library provides will oftentimes provide a level of insight as to what’s actually going on under the hood that can be extremely useful for knowing how things can work better.
I think that people need to stop advocating SPARQL as the end-all, be all solution for everything in RDF. I’m sure that it can be great, and that there’s tons of great ways to use it. However, one of the most popular RDF engines does not work well with most SPARQL queries that I’ve seen people throw at it. The first step in learning how to properly use SPARQL (for any kind of time-neccesary things: anything over HTTP basically counts here) is learning how to properly manipulate the triples in the way that works best in the application without the query language.