Technical Ramblings

Archive for April, 2005

GovTrack RDF Data

Posted in PHP, Redland RDF Application Framework, SPARQL on April 25th, 2005 at 20:10:02

One of the larger sources of RDF data that I’ve loaded into a database, the GovTrack RDF data is an interesting set with all kinds of information on congressmen adn so on. I recently started paying a bit more attention to the Gargonza Experiment, and found a link to their data source via the wiki.

I’ve been playing with setting up SPARQL stuff all day, and have a couple simple pages set up from my new GovTrack page. Loading the entire dataset (the RDF/XML, at least: the n3 bits I left out for the time being) took a long time, and I did some tweaking of MySQL in the process to allow me to load data faster. Some things I learned, for optimizing loading time with Redland:

1. MySQL’s key cache size is important when loading large data stores.
2. When loading statements, if you really want to optimize your load time, load with contexts. Redland will not check for duplicate statements in this case: This can be a major time saver. However, this may slow down later work, so it will probably not be worth it in the long term.
3. Loading into an already existing Redland database, even in a new model, will not increase speed: since Bnodes, Literals, and Resource tables are database wide, the selects to determine existing statements will still be just as slow as if you were loading into the existing models.

I also discovered that my QueryResults->result() method was returning actual Redland nodes, rather than the wrapped Redland.php::Node. I suppose at one point I probably realized that, but it had slipped my mind. This made it really difficult to do things like deal with optionals: calling the librdf_node_to_string in the PHP bindings causes them to segfault if the node is NULL, and there’s no decent way to check if the node is null that I found.

To compensate, I created a new way to create nodes (basically a copy constructor). This allowed me to check at node creation time whether it was a Resource/Bnode/Literal, which are the only types of Nodes there are. If it’s none of the above, I make it a PHP NULL, which I can check for, and it won’t crash PHP.

I have learned the many different ways to segfault PHP over the past week working on Redland. Of course, they all relate to PHP doing funky things with a SWIG wrapper, but it’s still one of the more interesting experiences I’ve had.

With the new PHP, all of the SPARQL interfaces I’ve got set up: one for Julie, one for XTech, one for GovTrack support Optionals. This has allowed me to create things like the GovTrack Senators page, (example for New Hampshire), listing some profile information about all the Senators from your state. (Representatives are more difficult. I’m still working on that.)

Anyway, the GovTrack data is fun to play with, although I really need to develop some more interesting interfaces over the data. I plan to do that: just haven’t gotten there yet. These tools take time to develop, but they do feel really nifty. I would go into the why’s of why I feel it’s nifty, but I almost always end up feeling like a complete and utter geek when I do it, and it makes people look at me strange, so I’ll skip it this time.

3 Comments »

RDF Query

Posted in Perl, RDF, SPARQL on April 21st, 2005 at 14:50:13

Apparently the anxious type, Greg Williams has thrown together an RDF Query implementation in Perl, with support for the new SPARQL draft as of yesterday.

The library also offers ORDER BY support, something that I’m sure Greg is happy to have for his MT-Redland. Ordering things by date for me is something that I’ve sidestepped, but I’m not looking forward to when I actually have to deal with it.

The code uses Parse::RecDescent to generate a query based only on the SPARQL grammar. Greg mentions that it is slow: most of the time is actually in generating the Query from the Grammar.

If only I was still a Perl hacker… sadly, I’m not, so I suppose I’ll just have to start working on my C in order to help get Redland working with the new draft. (Dave estimates that it will take him about 1.5 months to catch up to the most recent WD of SPARQL.) I’d really love to just be able to use the tools I’ve already written in Python, rather than switching to Perl, or even another backend than Redland. It has worked so well for me so far.

Still, this is the first SPARQL implementation using the new Draft that I’m aware of, even if it is mostly just a hack job, so I think that it’s pretty cool, and my props are out to Greg for his work on it!

1 Comment »

New SPARQL Draft

Posted in RDF, Semantic Web on April 20th, 2005 at 06:48:38

A new version of the SPARQL Working Draft released today. Congratulations to the specification editors, Eric and Andy.

Major change in syntax from the previous version: Rather than using the tuple type query syntax (?a ?b ?c) (?a ?d ?e), the query format has changed to be more turtle-like: (?a ?b ?c; ?d ?e .) This is nice, because it lets you merge data entry and data query: I can add a turtle statement <#crschmidt> a foaf:Person; foaf:nick “crschmidt”. and then query for all other people like that: (?p a foaf:Person; foaf:nick ?n.).

Another thing that was mentioned to me the other day is that the new query format doesn’t allow the optional commas between variables you select. So, SELECT ?p, ?n will now be SELECT ?p ?n. Not a big deal, but something that’ll bite me in the butt quite a bit as I get used to the switch.

I currently track Redland/Rasqal releases for querying, so I’m going to be following along with dajobe as he works to get his Rasqal engine to switch to the new format. I know he’s already working on it, and I’m looking forward to being able to show off the new syntax in some of the tools I use, from the sparql interface to julie to the IRC version of the bot. I may even try my hand at one of the later tarballs and see how little C I actually know, and try and figure out if I can help in any way.

All in all, a major new release, so if you’re using SPARQL, pay attention!

Comments Off on New SPARQL Draft

Redland PHP Wrapper

Posted in PHP, RDF, Redland RDF Application Framework on April 17th, 2005 at 20:01:46

Today, I was working with the XTech Stuff, and decided I wanted to offer some fun Redland-based queries against it. Since the entire website is in PHP, I decided to stick with that theme, and write some PHP code.

I had the PHP bindings installed from a couple days ago, for… something I don’t exactly remember. I had some grand goal in mind… oh, right, I was going to provide my logo information in RDF, and parse it out using PHP.

Something I realized today is that there is no decent Redland Wrapper class like there is for Python and Perl. SWIG provides interfaces, but that basically just gets you to the level of the C API, which is something that is a bit low level for me.

To resolve this, I’ve written a PHP Wrapper class, which I hope to maintain and improve upon. It is stored in a subversion repository: you can check it out using:

svn co http://crschmidt.net/svn/redland/

Please feel free to use the trac project to help with the project.

Status: Beta Quality. Has only been tested using included test.php script. Does not do proper memory checks in any/most cases.
License: This wrapper is released under the same license as Redland itself.
Homepage: phpwrapper.

Comments Off on Redland PHP Wrapper

Recent Work

Posted in default, FOAF, julie, Semantic Web, Subversion on April 14th, 2005 at 21:34:36

I’ve been doing some work with FOAFNaut, SVG, and other related technologies lately. For the most part, the changes in and of themselves are too small to track in a weblog format, but I did build myself a little tool to store recent updates to crschmidt.net last night, so I could share them. crschmidt.net site updates has an HTML view, as well as an RSS 1.0 and RSS 2.0 view, and is used to display information on the front page on what has changed recently.

Today, I spent a big chunk of my afternoon playing with julie alongside DanC. He asked if I planned on implementing SPARQL in the bot any time soon (which I do, as soon as a Redland release supporting the turtle format for SPARQL queries comes out). We also talked about GRDDL support, and some other related things. He offered some interesting files which I added to the database, teaching julie more about W3C proceedings and allowing for some more interesting queries in that respect. I need to start keeping track of my todolist for julie so that I can get organized in the freetime I have to do something about the state she’s in. I’m really starting to think another refactoring may be in order: although I received a pretty gigantic patch at one point, I still really haven’t “thrown one away” yet.

I also decided to install trac earlier today for some reason, something that was reinforced when I was asked to start a wiki foaf FOAFNaut internals as I was playing with it. You can check out the listing of projects I have here, which will grow as time continues, because I’m going to be moving more and more of my stuff into Subversion and more and more of what’s in Subversion to trac. It’s really nifty software, and I’m looking forward to playing with it. Who knows, it might shove a few more people into getting involved in my current projects. It’s got everything I need but have been too lazy to install in one place: wiki, bug tracking, source viewing, revision… quite nice, really.

Other than that, not much going on: Keep an eye on the site updates as I continue to do more little changes in and around crschmidt.net to my various projects.

Comments Off on Recent Work

SVG-Metadata

Posted in RDF, SVG on April 9th, 2005 at 19:40:27

Earlier, I posted about extracting SVG metadata with Redland. However, one of the problems with this is that there isn’t a whole lot of SVG out there, nor is there a whole lot of SVG with metadata out there.

One solution to this is the OpenClipArt Library – thousands of Public Domain SVG images with embedded metadata, totalling a heck of a lot of RDF information that could provide an interesting example of how RDF information can be used in real world scenarios.

However, the metadata provided by this library was, when I looked at it, broken RDF. I sent an email to the clipart list explaining the problems with their metadata, and received friendly and helpful replies letting me know that the data was generated with the SVG-Metadata perl library.

This weekend, I downloaded that code and began working on it, submitting a patch to the maintainer (who is also one of the founders of the Inkscape project, and works on OpenClipart), which was integrated today, improving their license support (now supporting all Creative Commons licenses) and their RDF output (such that it validates).

A new version has been released, uploaded to CPAN, and will soon be propogating its way to the CPAN archives. New SVGs uploaded to openclipart will contain metadata which is valid RDF, and Bryce is looking into regenerating the data on older SVGs as well.

More RDF. Better metadata. That’s something that I think I can live with.

Comments Off on SVG-Metadata

RDF + GPG

Posted in RDF, Redland RDF Application Framework, Semantic Web on April 9th, 2005 at 14:35:34

One of my eventual goals is to have julie replace all the features of wh4 (libby’s query bot) and foafbot (edd’s community IRC bot). One thing that edd’s bot did that julie doesn’t is to verify data based on signed documents, and to use this information as a “provenance” for the data: not just “where was it said”, but actually verifying “who said it”.

Dealing with GPG is not nearly as easy as I really think it should be. Take Redland as an example: You can interact with the library at all kinds of levels, from the base swig wrapper to the hand-written RDF.py module wrapped around it, and you can do just about anything the base library does from within Python.

GPG, on the other hand, is very hard to work with from a library level. There is a Python module for working with GPG, but it equates to simply using the command line tool in the end. You can’t tell it “Check this document”: you basically just have tools to create a pipe to GPG, and pass the options in the same way you would on the command line. Add to that that it’s Yet Another Dependancy which is somewhat of a pain to resolve in Python, and you can see why it’s slightly annoying for people who might want to use GPG.

Wondering what edd had done for FOAFbot back when it was running, I decided to grab his code and play with it. Turns out he just opened a pipe to gpg using the commands module in Python. This seemed simple enough to me, so I ripped out some of his code and turned it into a little script.

With that, I announce the release of rdfgpg, a tool for verifying the signature described by an RDF document. It uses the Redland Python bindings, and the usage is:

python rdfgpg.py http://example.org/urlof.rdf

Optionally, you can add a second argument, to set the debug argument, which will show more information about what’s going on in the background, which may help if something that you expect to work isn’t. Additionally, you can easily import the module and use the function rdfgpg.verify_url(url), which returns a list of email addresses on the signing key.

The code is released under a GPL 2.0 license, and is stolen in large part from the FOAFbot code released by Ed Dumbill. Feature requests via comments or email.

Hopefully with this, I’ll start to actually use it in my tools, to verify provenance when possible, and to start convincing people to sign their files. I hate to think what would happen to the semantic web if people suddenly started creating lots of false documents… but hopefully it’s not quite that popular yet.

2 Comments »

Parsing SVG Metadata

Posted in Python, RDF, Redland RDF Application Framework, Semantic Web, SVG on April 7th, 2005 at 15:12:48

How to Parse SVG Metadata, the Redland + Python way:

import urllib
import xml.dom.minidom as minidom
import RDF

m = RDF.Model()
p = RDF.Parser()
u=urllib.urlopen(“Location Of SVG File”)
svg = u.read()
doc = minidom.parseString(svg)
p.parse_string_into_model(m, doc.getElementsByTagName(“rdf:RDF”)[0].toxml(), “Location of SVG File”)
print m

In other words: Bring in the RDF and minidom modules, Create an RDF model and parser, download the SVG file to a string, parse the string into a minidom compatible variable, then look for RDF in the SVG file, parsing it into the model, and serializing the model.

Problems: What if someone uses something that’s not rdf: as the prefix?
Solutions: mattmcc offers that minidom supports getElementsByTagNameNS, so the parse line would become:
p.parse_string_into_model(m, doc.getElementsByTagNameNS( “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “RDF”)[0].toxml(), “Location of SVG File” )
resolving the Namespace issue.

Of course, since this is Redland, this is taken care of for you. Rather than doing it in this way, which is specific to SVG, we can scan for RDF in any XML doc. Simply:

import RDF
m=RDF.Model(); p=RDF.Parser()
p.set_feature(“http://feature.librdf.org/raptor-scanForRDF”, “1”)
p.parse_into_model(m, “URL Of SVG File”)

There are a number of other features you can use with a Parser. They are available via rapper -f help, but here’s a list: assumeIsRDF, allowNonNsAttributes, allowOtherParsetypes, allowBagID, allowRDFtypeRDFlist, normalizeLanguage, nonNFCfatal, warnOtherParseTypes, checkRdfID.

Naturally, Redland already does what I want it to do. Another pat on the back for Dave (and thanks to him for pointing it out).

Comments Off on Parsing SVG Metadata

todo lists

Posted in julie, RDF, Semantic Web on April 6th, 2005 at 12:20:51

So, a while ago, I was bored and wanted to add something to my todo list. So I created a URI for a todo namespace, and used it a couple times via the ^addturtle function built into the julie IRC bot.

As usual with my todo methods, I totally forgot about it. Recently, I brought julie into a new IRC channel (#svg) and she met raxor, who immediately started going through her commandlist:

13:15:07 < julie> Current commands: allRelated, olb, like-pubs, maintainer, webpage, drankbeerwith, like-same-music-as, alldayevents, depiction, based_near, icbm, keywords, country-population, kissed, todo, authorlinks, like-musicalwork, like-books, title, rsslinktitles, country-background, languages, nick, neighborhoods, commentContains, pub-address, schemaweb, desc, homepage, workplace, available, country-lowestPoint, knows, quote, school, sha, ljinterests, xfn_met, members, country-highestPoint, rangeOf, term, made, name, places, agentknows, dob, like-musicians, domainOf, modified, picOfA, newdepiction, rsstitles, weblog, contact, javaPlatform, biodob, mbox, dranklagerwith, namefromany, rsslinks.

Wondering what todo was, he tried it, and got a todo item I had added long ago. I replied, “Oops. Never did that.” and went to work on investigating how I could make the todo feature more useful.

In the process, I added a command to julie to add a todo item given a string:

^todoItem document built ins

Will add for me a todo item of the following turtle:
[a todo:Item; todo:owner [a foaf:Person; foaf:nick “crschmidt”]; dc:date “currenttime”; todo:text “text given”].

It will then query the model for all existing todo items for me, and return that.

Of course, this has problems: one of them being I have no way to mark a todo item as “done” once it is, and other similar things, so I will have to work a bit more on the interface to the todo list, but it’s interesting, and I thought maybe other people might want to know about it.

I do need to start documenting the built ins like this: listeningTo is another example. They don’t have a ^commandinfo result, so I’ll have to improve julie’s built in help.

julie may also see some codepiction/path searching in the near future: Greg Williams (aka kasei) gave me some Perl code that he uses to find shortest paths in a Redland store, so I’ll hopefully be able to use that and build it into julie. Also need to get code back into subversion: I screwed up my working directory so that it’s not managed in subversion, so I haven’t checked in in weeks. (This was the same problem I had last time, when someone sent me a complete refactoring of julie – only he had done it against SVN, which wasn’t up to date.)

This isn’t as polished as my usual posts: I think sometimes I overthink what I’m writing a bit, so you may see a bit more “Hey, this is my cool semantic web trick of the day” posts in the future.

Comments Off on todo lists

FOAF Names…

Posted in FOAF on April 2nd, 2005 at 22:44:39

A while ago I did a really crappy survey of how many people were using the various forms of “name” properties in the FOAF schema. I say “Crappy” becuase it was incomplete and generated via RDQL queries, which was a really silly way to do things, now that I know how to actually use a few of the Redland API calls. So, since I’m bored and working on a wrapper for a variety of Redland stuff, I figured I’d look at it again.

The model in question is the model for Julie, an IRC based interface to a Redland store. She contains about 2.3 megatriples, in a MySQL backed storage.

Total foaf:Persons: 129,932
Total statements using foaf:name as a predicate: 5549
Total statements using foaf:givenname as a predicate: 5363
Total statements using foaf:firstName as a predicate: 874
Total statements using foaf:family_name as a predicate: 117
Total statements using foaf:surname as a predicate: 6314
Total statements using foaf:nick as a predicate: 120529

Keep in mind that a large chunk of this data is spidered from LiveJournal, so the results are most likely going to be extremely biased to that case, which has no use of any name properties other than foaf:nick.

Nothing all that impressive, really, but interesting as far as statistics go nonetheless.