Redland Updates

Posted in julie, RDF, Redland RDF Application Framework, SPARQL on May 19th, 2005 at 23:58:45

Dave released a new Raptor and new Rasqal today. I’ve built both, and rebuilt my Python bindings so I no longer get segfaults (Almost thought it was a bug, then Dave reminded me of previous “bugs” which were my fault).

As a result, all of my tools on both zeus and athena are now running the latest and greatest in the way of SPARQL, meaning the new query syntax (and I believe, new XML output syntax). I still need to update the examples on my PHP pages, but julie’s code is all up to date.

While I was at it, I took the oppourtunity to do some cleanups that I’ve been wanting to do for a while: You can see the revisions on the rdfpython trunk in trac’s timeline, but here’s a summary:

* Did some rewriting on mortenf’s smusher. I now get owl:sameAs triples in the store, so I have a reversible process to some extent for smushing, as well as making the smusher look for the shortest URI rather than just grabbing the first node it sees as “canonical”. Of course, I did this after a lot of URIs got tossed in my last smushing run… ah well, live and learn.
* Moved more code to use the “parse_anything” function that I wrote, which uses heuristics and logic to try and guess what kind of content we’re dealing with. It depends a lot on Content Types, but is also something I can edit and reload without restarting the bot, which is a major boon for me. This means that if something is broken, I can fix it, and make it more robust, without any kind of guilty concious about flooding channels with joins/parts/quits.
* GRDDL support (with newest raptor) in parse_anything. Since ^add is really now parse_anything, this means that if you add a page with a GRDDL description Redland supports, you’ll get the triples out of it.
* Heuristics of queries, guessing which is which. (Really ‘dumb’ right now: it just looks for ” {“, and considers it Sparql if it has it.)

What does this mean to you, dear user?

Well, quite simply, it means that you will probably support more formats (RSS, SVG, HTML+GRDDL, Atom, Turtle, ntriples) with less work (it’s all done through ^add). You can run queries in either the old format (RDQL) or new format (Sparql), or store either one.

I’d say that’s a benefit.

Thanks to Dave for getting new Redland stuff out the door.

The Origins of the Internet

Posted in default, Reading on May 14th, 2005 at 23:59:49

I’ve got a number of books on my “To read” list, most which were given to me as part of my birthday or as random gifts. One of them which was gifted to me by Tom Croucher from my Amazon wishlist (as a thank you for help in his dissertation, I believe). The book is Where Wizards Stay Up Late: The Origins of the Internet, and it’s one of my favorite non-fiction computer-related books.

The book is a relatively detailed study of how the internet came to be: the development of the theories behind it, the actual hardware, the proposals from the Defense Department for the creation. The origin of RFCs, the way TCP/IP was invented, things like AlohaNet and ethernet as well.

It’s because of this book that I still carry around Vinton Cerf’s business card, which I obtained at a 10th anniversary discussion of Mosaic. Ever since I acquired the card, I was fascinated with the design that was printed. I then planned on getting some for my business from www.dxprintingperth.com.au with exactly the same layout. Cerf was there discussing the idea of an interplanetary internet and how it would work. I was there drooling at the fact that I was in the same room as Vinton Cerf, and actually got up on stage afterwards and shook his hand. I keep that thing well – it used to be in my wallet, now it stays seperate – because it means that much to me, and it was inspired by this book.

So, a big thank you to Tom for the book, and a suggestion that you all go out and read it.

Blocking Port 25

Posted in SMTP on May 14th, 2005 at 09:26:52

So, for the first time this weekend, I’m on a network where outgoing mail on port 25 is blocked. How annoying.

I use a number of mail servers in a number of different ways. Typically, when on one of my Linux boxes (zeus or athena) I’ll send mail directly from those servers by using a localhost Postfix installation, and no smart or relay hosts. I don’t really see a need for my ISP to see my mail, and doing it this way is the default setup for most Linux distros that I’m aware of.

If I’m someplace that doesn’t have a mail server (ie the powerbook, creusa, or the mac mini, hermes), I use athena as a mail host, on which I have installed SASL authentication. Athena is set up to accept mail in a couple cases:

1. Mail from local network. This includes all the IPs in my block on Sagonet.
2. SASL Authenticated users: This users password authentication against the local mail database to check users who can login to the server to send mail.

As such, the server is protected against being an open relay (so long as I don’t get a spammer on the local machine, but I don’t think that’s going to be the case), and I like having it there as a backup for when other mail servers fail me. wedu’s mail uses POP before SMTP for authentication, which is all well and good, but can be a pain since the logs are reset at :45 past the hour, and if you try and send mail right after that, you get a nice “Relay denied” message.

In any case, I tried to send mail this morning via crschmidt.net… and got a timeout. Tried getting there from here, no go. Panicked a bit, since this is my main mail server, and if it’s down, that’s a bad thing. Tested it from zeus: no problems. Tested it locally: no problems. Tried going to another port 25… problems. So it’s on the Ameritech end. Great.

Set up an ssh tunnel: ssh -L 25:localhost:25 crschmidt@crschmidt.net. Set up a server in Mail.app as localhost port 25. Forward my mail. Sigh at Ameritech. Bitch in weblog. And the circle of life continues.

RDF and Images

Posted in Image Description, Semantic Web, SPARQL on May 8th, 2005 at 12:30:42

Tony Lounging

I know that I’m far too lazy to actually describe my images. I never do it. I write tools to help me, and I still don’t. So, my goal is to use tools which do it for me. With Masahide’s EXIF tools, flickr, and flickr2rdf, I can do this, with a little fudging to get the output to flow together better.

I have a lot of photos to describe, and I was going to get to work on it, when I reached for my keyboard… and found the cat laying on it. So, I switched to the other computer (zeus, rather than hermes) and got to work, creating a SPARQL interface for my photos. Maybe if I can search them, I’ll actually describe them.

I haven’t done a whole lot yet, but the start of my work is in place, with a nice SPARQL query against it. Of course, so far there’s only one photo, but this example should get you started, and if you care, you can check out the data to get you started.

Search My Photos – the crschmidt.net album organization service.

PHP and Redland

Posted in PHP, Redland RDF Application Framework on May 8th, 2005 at 09:49:27

Recently, I moved most of my serving to a colocated machine, so I finally have a “Testing” machine and a “stable” machine, leaving me more free to play around locally. This has led to me installing a Rasqal nightly release and installing it, in an attempt to get the newer SPARQL query syntax working in my RDF bot, so that I can test query type detection and the like.

I had no problems installing it: very simple, just download the nightly, ./configure, make, make install. I got it working in my local “julietest”, although I’m waiting until the next release before I consider installing it on the remote server.

I got it working in PHP from the command line, no problems.

However, no matter what I do, the web version still seems to be using the old query syntax, and I don’t have any clue why. If you go to http://zeus.crschmidt.net/julie/sparql, you can test it out, and it only returns data if you use the old query format. However, if I copy the same script locally, and run the exact same query, it doesn’t work, requiring the new format.

I don’t understand it, and I don’t know if anyone else does either. The PHP in Apache2 and CLI both have almost exactly the same phpinfo(), they both have the same extension directory, and there isn’t a second copy of redland.so for the Apache version to load anyway! If anyone has run into this problem before and knows how to fix it, I’d appreciate it, because right now I’ve given up and am waiting for a release before I debug further.

(This post brought to you in part by the effort to bump all of Danny’s off of PlanetRDF while he’s on vaccation. ;))

Planet, GNU Arch

Posted in GNU Arch, Planet Planet, Technology on May 5th, 2005 at 22:08:29

Yesterday, after some discussion regarding Bluemoon (currently offline, LiveJournal syndicated copy available at livejournal temporarily), the idea of a “Planet Swhack” was brought up: an aggregated collection of the weblogs of members of #swhack, much similar to the many other planets run by the Planet Planet software or like Planet RDF, run off the Chumpalogica aggregator.

So, yesterday, I set it up. AaronSw controls Swhack DNS, and wasn’t around at the same time as me at any point, so I set up as a temporary URI to demonstrate it. Picked up some bloggers, and set up the stylesheet to be the same as my other Planet, PlanetMobile. Tonight, as I was preparing to ask AaronSw to set up DNS for Planet Swhack, I noticed that jcowan‘s most reccent entry was messing things up. I looked into the issue a bit more and found out that Planet was using version 2.7 of Mark Pilgrim’s Universal Feed Parser, which barfed quite badly on the XHTML in his Atom content.

So, I looked into it a bit more, and found out that the “nightly tarball” of Planet has not been updated since October. So much for any kind of decent release schedule.

Looking at a mailing list thread on release scheduling, I realized that the issues I was having had been fixed, and set about to check out the latest code from their version control.

Except there’s no instructions on how to do that, just a repository name. And it’s GNU Arch, which I sure as hell don’t know. So, I go to install it… apt-get install tla, on my home machine… apt tells me:

Media change: please insert the disc labeled
‘Ubuntu 5.04 _Hoary Hedgehog_ – Preview i386 Binary-1 (20050310)’
in the drive ‘/cdrom/’ and press enter

Update: Since I get a large number of hits from Google for this issue: The way to fix this is to edit your /etc/apt/sources.list, and remove the first line in it that references the cdrom drive. (You can simply put a # in front of it.) Then type apt-get update. (You’ll have to edit and update as root – type sudo before the commands to do that.) If you need more help, feel free to comment. (2006-01-10)
… right. I’m a 15 minute bike ride from home. Not going to happen. So, I switch from zeus to athena and try it, and get tla installed. Then start looking for instructions on how to check out a repository.

Apparently, the industry standard term “check out” is not part of the Arch repository system. Eventually, I wandered into Logjam’s arch repository, which provides clean instructions for how to get the code out of it:

tla register-archive http://logjam.danga.com/arch/2004
tla get logjam@danga.com–2004/logjam–dev–4.4

I was able to check out the “shiny development branch” of the Planet code, and get it in place on the site, fixing all my issues with Atom and XHTML content. All is well in the world, and Planet Swhack is a go. Never let it be said that checking out code from an arch repository is intuitive though. Anyone who thinks it is is out of their tree.

MySQLdb in Python

Posted in Python on May 5th, 2005 at 07:28:55

I was just looking around for a tutorial on working with MySQL in Python, and found a great Into to MySQLdb in Python page. Since I know some of you reading this are fans of Python, and may have to work with MySQL at some point, I thought this might be interesting to those of you who have to look for it.

It’s not advanced, just an intro, but quite useful, in my opinion, for what it is.

GovTrack RDF Data

Posted in PHP, Redland RDF Application Framework, SPARQL on April 25th, 2005 at 20:10:02

One of the larger sources of RDF data that I’ve loaded into a database, the GovTrack RDF data is an interesting set with all kinds of information on congressmen adn so on. I recently started paying a bit more attention to the Gargonza Experiment, and found a link to their data source via the wiki.

I’ve been playing with setting up SPARQL stuff all day, and have a couple simple pages set up from my new GovTrack page. Loading the entire dataset (the RDF/XML, at least: the n3 bits I left out for the time being) took a long time, and I did some tweaking of MySQL in the process to allow me to load data faster. Some things I learned, for optimizing loading time with Redland:

1. MySQL’s key cache size is important when loading large data stores.
2. When loading statements, if you really want to optimize your load time, load with contexts. Redland will not check for duplicate statements in this case: This can be a major time saver. However, this may slow down later work, so it will probably not be worth it in the long term.
3. Loading into an already existing Redland database, even in a new model, will not increase speed: since Bnodes, Literals, and Resource tables are database wide, the selects to determine existing statements will still be just as slow as if you were loading into the existing models.

I also discovered that my QueryResults->result() method was returning actual Redland nodes, rather than the wrapped Redland.php::Node. I suppose at one point I probably realized that, but it had slipped my mind. This made it really difficult to do things like deal with optionals: calling the librdf_node_to_string in the PHP bindings causes them to segfault if the node is NULL, and there’s no decent way to check if the node is null that I found.

To compensate, I created a new way to create nodes (basically a copy constructor). This allowed me to check at node creation time whether it was a Resource/Bnode/Literal, which are the only types of Nodes there are. If it’s none of the above, I make it a PHP NULL, which I can check for, and it won’t crash PHP.

I have learned the many different ways to segfault PHP over the past week working on Redland. Of course, they all relate to PHP doing funky things with a SWIG wrapper, but it’s still one of the more interesting experiences I’ve had.

With the new PHP, all of the SPARQL interfaces I’ve got set up: one for Julie, one for XTech, one for GovTrack support Optionals. This has allowed me to create things like the GovTrack Senators page, (example for New Hampshire), listing some profile information about all the Senators from your state. (Representatives are more difficult. I’m still working on that.)

Anyway, the GovTrack data is fun to play with, although I really need to develop some more interesting interfaces over the data. I plan to do that: just haven’t gotten there yet. These tools take time to develop, but they do feel really nifty. I would go into the why’s of why I feel it’s nifty, but I almost always end up feeling like a complete and utter geek when I do it, and it makes people look at me strange, so I’ll skip it this time.

RDF Query

Posted in Perl, RDF, SPARQL on April 21st, 2005 at 14:50:13

Apparently the anxious type, Greg Williams has thrown together an RDF Query implementation in Perl, with support for the new SPARQL draft as of yesterday.

The library also offers ORDER BY support, something that I’m sure Greg is happy to have for his MT-Redland. Ordering things by date for me is something that I’ve sidestepped, but I’m not looking forward to when I actually have to deal with it.

The code uses Parse::RecDescent to generate a query based only on the SPARQL grammar. Greg mentions that it is slow: most of the time is actually in generating the Query from the Grammar.

If only I was still a Perl hacker… sadly, I’m not, so I suppose I’ll just have to start working on my C in order to help get Redland working with the new draft. (Dave estimates that it will take him about 1.5 months to catch up to the most recent WD of SPARQL.) I’d really love to just be able to use the tools I’ve already written in Python, rather than switching to Perl, or even another backend than Redland. It has worked so well for me so far.

Still, this is the first SPARQL implementation using the new Draft that I’m aware of, even if it is mostly just a hack job, so I think that it’s pretty cool, and my props are out to Greg for his work on it!

New SPARQL Draft

Posted in RDF, Semantic Web on April 20th, 2005 at 06:48:38

A new version of the SPARQL Working Draft released today. Congratulations to the specification editors, Eric and Andy.

Major change in syntax from the previous version: Rather than using the tuple type query syntax (?a ?b ?c) (?a ?d ?e), the query format has changed to be more turtle-like: (?a ?b ?c; ?d ?e .) This is nice, because it lets you merge data entry and data query: I can add a turtle statement <#crschmidt> a foaf:Person; foaf:nick “crschmidt”. and then query for all other people like that: (?p a foaf:Person; foaf:nick ?n.).

Another thing that was mentioned to me the other day is that the new query format doesn’t allow the optional commas between variables you select. So, SELECT ?p, ?n will now be SELECT ?p ?n. Not a big deal, but something that’ll bite me in the butt quite a bit as I get used to the switch.

I currently track Redland/Rasqal releases for querying, so I’m going to be following along with dajobe as he works to get his Rasqal engine to switch to the new format. I know he’s already working on it, and I’m looking forward to being able to show off the new syntax in some of the tools I use, from the sparql interface to julie to the IRC version of the bot. I may even try my hand at one of the later tarballs and see how little C I actually know, and try and figure out if I can help in any way.

All in all, a major new release, so if you’re using SPARQL, pay attention!