PersonalProfileDocument Parsing

Posted in FOAF, Python on May 23rd, 2005 at 18:14:44

Earlier today, on the OpenID mailing list, I was asked to supply Perl code to look for PPDs in FOAF docs and return some basic props on the user who owned the FOAF file. My Perl skills have long since fallen by the wayside, but I was able to put together something in Python which seems to me to work pretty good.

ppd.py is a FOAF parser using xml.dom.minidom to look for a PPD, and parse out a couple basic forms of the Personal Profile Document, for cases in which you can’t bring a full RDF parser to bear on the situation. (I know that the question of when this arises has been argued a million times, but an RDF parser is an extra dependency that some projects simply have no interest in bringing on.)

This parses two basic forms of PPD: one in which the foaf:maker is identified by an rdf:nodeID=”nodename”, or one in which the foaf:maker is identified as an rdf:resource=”#nodename” coupled with a rdf:ID=”nodename”.

This hasn’t been fully tested: it was mostly done as a quick proof of concept that people could expand on. I’ve tested it on the nodeID case, and tested that if it can’t find an appropriate PPD, it falls back (against LiveJournal files). I’m not sure how python-esque my code is, but it does seem to work, which was my primary concern.

As usual, this code is designed to be used at the command line as “python ppd.py http://crschmidt.net/foaf.rdf”, or imported as a module, after which you can run ppd.get_person(“http://crschmidt.net/foaf.rdf”).

Thoughts on the method? Will this work with a sufficiently constrained FOAF doc?

XSLT + Image Regions + Sparql

Posted in Flickr, Image Description, RDF, SPARQL, XSLT on May 22nd, 2005 at 20:05:23

Read Masahide’s notes on XSLT+Image Regions. Used some tools to convert my flickr photos to RDF.

Converted an XSLT Stylesheet to a different result format. Loaded ~400 RDF files into a Model, totalling 33,000 statements. Added an option to my Sparql Interface. Changed the default query. Made the extra option add the stylesheet.

Ran a query. Tweaked until it worked. Typed it all up here, to share with all of you.

Hooray for masahide, flickr, and all kinds of other wonderful things.

Wedged Subversion Repository

Posted in Subversion on May 20th, 2005 at 22:11:57

Earlier this morning, one of my projects subversion repositories got wedged. After figuring out that it was actually wedged (no GET response, no PROPFIND/timeout requiring a kill -9 on svn and svnadmin commands), I started playing with svnadmin. Still didn’t work. Hopped into #svn. Asked, was pointed to FAQ.

Copied current repository to another location before attempting anything else, since I’ve fucked up a BDB based subversion repository attempting to repair it before.

Attempted svnadmin repair /var/www/svn/rdfpython: failed with lots and lots of “PANIC” type errors.

Attempted svnadmin repair ~/newcopyofrepos: That seemed to work. An svnadmin verify ~/newcopyofrepos confirmed that it had.

Made another backup of the repos, removed it, copied the new ~/newcopyofrepos into place.

And the world was good again. A verify/checkout process both verified that the files were all in place, and trac started to work again, and all was well, good and happy.

However, I think that from now on, I may use the fsfs storage method rather than BDB, as this is certainly not the first time this has happened to me or anyone else, and I really think I’m just starting to not trust BDB for mission critical tasks, which is what I consider subversion. My version control is one of the few things that I don’t have completely backed up most of the time: files I can copy around easily, but databases of changes to files typically stay in one place. I could recreate the structure, but I couldn’t really recreate the history, and that’s important to me.

If anyone has any experience with fsfs SVN repositories over BDB based ones, I’d be glad to hear it.

Lynx View

Posted in Technology, Web Publishing on May 20th, 2005 at 19:20:58

A new crschmidt.net webservice:

lynxview, converting based on a domain name to a lynx -dump form. For when you want to show some windows user just how crap their website is with all the graphics turned off.

As a form of demonstration, check out crschmidt.net or Planet Mobile.

Currently, sites are cached eternally, so that the service can’t be used to DDoS some poor site.

Produced in part by a request from DanC on #swig earlier today.

Redland Updates

Posted in julie, RDF, Redland RDF Application Framework, SPARQL on May 19th, 2005 at 23:58:45

Dave released a new Raptor and new Rasqal today. I’ve built both, and rebuilt my Python bindings so I no longer get segfaults (Almost thought it was a bug, then Dave reminded me of previous “bugs” which were my fault).

As a result, all of my tools on both zeus and athena are now running the latest and greatest in the way of SPARQL, meaning the new query syntax (and I believe, new XML output syntax). I still need to update the examples on my PHP pages, but julie’s code is all up to date.

While I was at it, I took the oppourtunity to do some cleanups that I’ve been wanting to do for a while: You can see the revisions on the rdfpython trunk in trac’s timeline, but here’s a summary:

* Did some rewriting on mortenf’s smusher. I now get owl:sameAs triples in the store, so I have a reversible process to some extent for smushing, as well as making the smusher look for the shortest URI rather than just grabbing the first node it sees as “canonical”. Of course, I did this after a lot of URIs got tossed in my last smushing run… ah well, live and learn.
* Moved more code to use the “parse_anything” function that I wrote, which uses heuristics and logic to try and guess what kind of content we’re dealing with. It depends a lot on Content Types, but is also something I can edit and reload without restarting the bot, which is a major boon for me. This means that if something is broken, I can fix it, and make it more robust, without any kind of guilty concious about flooding channels with joins/parts/quits.
* GRDDL support (with newest raptor) in parse_anything. Since ^add is really now parse_anything, this means that if you add a page with a GRDDL description Redland supports, you’ll get the triples out of it.
* Heuristics of queries, guessing which is which. (Really ‘dumb’ right now: it just looks for ” {“, and considers it Sparql if it has it.)

What does this mean to you, dear user?

Well, quite simply, it means that you will probably support more formats (RSS, SVG, HTML+GRDDL, Atom, Turtle, ntriples) with less work (it’s all done through ^add). You can run queries in either the old format (RDQL) or new format (Sparql), or store either one.

I’d say that’s a benefit.

Thanks to Dave for getting new Redland stuff out the door.

The Origins of the Internet

Posted in default, Reading on May 14th, 2005 at 23:59:49

I’ve got a number of books on my “To read” list, most which were given to me as part of my birthday or as random gifts. One of them which was gifted to me by Tom Croucher from my Amazon wishlist (as a thank you for help in his dissertation, I believe). The book is Where Wizards Stay Up Late: The Origins of the Internet, and it’s one of my favorite non-fiction computer-related books.

The book is a relatively detailed study of how the internet came to be: the development of the theories behind it, the actual hardware, the proposals from the Defense Department for the creation. The origin of RFCs, the way TCP/IP was invented, things like AlohaNet and ethernet as well.

It’s because of this book that I still carry around Vinton Cerf’s business card, which I obtained at a 10th anniversary discussion of Mosaic. Ever since I acquired the card, I was fascinated with the design that was printed. I then planned on getting some for my business from www.dxprintingperth.com.au with exactly the same layout. Cerf was there discussing the idea of an interplanetary internet and how it would work. I was there drooling at the fact that I was in the same room as Vinton Cerf, and actually got up on stage afterwards and shook his hand. I keep that thing well – it used to be in my wallet, now it stays seperate – because it means that much to me, and it was inspired by this book.

So, a big thank you to Tom for the book, and a suggestion that you all go out and read it.

Blocking Port 25

Posted in SMTP on May 14th, 2005 at 09:26:52

So, for the first time this weekend, I’m on a network where outgoing mail on port 25 is blocked. How annoying.

I use a number of mail servers in a number of different ways. Typically, when on one of my Linux boxes (zeus or athena) I’ll send mail directly from those servers by using a localhost Postfix installation, and no smart or relay hosts. I don’t really see a need for my ISP to see my mail, and doing it this way is the default setup for most Linux distros that I’m aware of.

If I’m someplace that doesn’t have a mail server (ie the powerbook, creusa, or the mac mini, hermes), I use athena as a mail host, on which I have installed SASL authentication. Athena is set up to accept mail in a couple cases:

1. Mail from local network. This includes all the IPs in my block on Sagonet.
2. SASL Authenticated users: This users password authentication against the local mail database to check users who can login to the server to send mail.

As such, the server is protected against being an open relay (so long as I don’t get a spammer on the local machine, but I don’t think that’s going to be the case), and I like having it there as a backup for when other mail servers fail me. wedu’s mail uses POP before SMTP for authentication, which is all well and good, but can be a pain since the logs are reset at :45 past the hour, and if you try and send mail right after that, you get a nice “Relay denied” message.

In any case, I tried to send mail this morning via crschmidt.net… and got a timeout. Tried getting there from here, no go. Panicked a bit, since this is my main mail server, and if it’s down, that’s a bad thing. Tested it from zeus: no problems. Tested it locally: no problems. Tried going to another port 25… problems. So it’s on the Ameritech end. Great.

Set up an ssh tunnel: ssh -L 25:localhost:25 crschmidt@crschmidt.net. Set up a server in Mail.app as localhost port 25. Forward my mail. Sigh at Ameritech. Bitch in weblog. And the circle of life continues.

RDF and Images

Posted in Image Description, Semantic Web, SPARQL on May 8th, 2005 at 12:30:42

Tony Lounging

I know that I’m far too lazy to actually describe my images. I never do it. I write tools to help me, and I still don’t. So, my goal is to use tools which do it for me. With Masahide’s EXIF tools, flickr, and flickr2rdf, I can do this, with a little fudging to get the output to flow together better.

I have a lot of photos to describe, and I was going to get to work on it, when I reached for my keyboard… and found the cat laying on it. So, I switched to the other computer (zeus, rather than hermes) and got to work, creating a SPARQL interface for my photos. Maybe if I can search them, I’ll actually describe them.

I haven’t done a whole lot yet, but the start of my work is in place, with a nice SPARQL query against it. Of course, so far there’s only one photo, but this example should get you started, and if you care, you can check out the data to get you started.

Search My Photos – the crschmidt.net album organization service.

PHP and Redland

Posted in PHP, Redland RDF Application Framework on May 8th, 2005 at 09:49:27

Recently, I moved most of my serving to a colocated machine, so I finally have a “Testing” machine and a “stable” machine, leaving me more free to play around locally. This has led to me installing a Rasqal nightly release and installing it, in an attempt to get the newer SPARQL query syntax working in my RDF bot, so that I can test query type detection and the like.

I had no problems installing it: very simple, just download the nightly, ./configure, make, make install. I got it working in my local “julietest”, although I’m waiting until the next release before I consider installing it on the remote server.

I got it working in PHP from the command line, no problems.

However, no matter what I do, the web version still seems to be using the old query syntax, and I don’t have any clue why. If you go to http://zeus.crschmidt.net/julie/sparql, you can test it out, and it only returns data if you use the old query format. However, if I copy the same script locally, and run the exact same query, it doesn’t work, requiring the new format.

I don’t understand it, and I don’t know if anyone else does either. The PHP in Apache2 and CLI both have almost exactly the same phpinfo(), they both have the same extension directory, and there isn’t a second copy of redland.so for the Apache version to load anyway! If anyone has run into this problem before and knows how to fix it, I’d appreciate it, because right now I’ve given up and am waiting for a release before I debug further.

(This post brought to you in part by the effort to bump all of Danny’s off of PlanetRDF while he’s on vaccation. ;))

Planet, GNU Arch

Posted in GNU Arch, Planet Planet, Technology on May 5th, 2005 at 22:08:29

Yesterday, after some discussion regarding Bluemoon (currently offline, LiveJournal syndicated copy available at livejournal temporarily), the idea of a “Planet Swhack” was brought up: an aggregated collection of the weblogs of members of #swhack, much similar to the many other planets run by the Planet Planet software or like Planet RDF, run off the Chumpalogica aggregator.

So, yesterday, I set it up. AaronSw controls Swhack DNS, and wasn’t around at the same time as me at any point, so I set up as a temporary URI to demonstrate it. Picked up some bloggers, and set up the stylesheet to be the same as my other Planet, PlanetMobile. Tonight, as I was preparing to ask AaronSw to set up DNS for Planet Swhack, I noticed that jcowan‘s most reccent entry was messing things up. I looked into the issue a bit more and found out that Planet was using version 2.7 of Mark Pilgrim’s Universal Feed Parser, which barfed quite badly on the XHTML in his Atom content.

So, I looked into it a bit more, and found out that the “nightly tarball” of Planet has not been updated since October. So much for any kind of decent release schedule.

Looking at a mailing list thread on release scheduling, I realized that the issues I was having had been fixed, and set about to check out the latest code from their version control.

Except there’s no instructions on how to do that, just a repository name. And it’s GNU Arch, which I sure as hell don’t know. So, I go to install it… apt-get install tla, on my home machine… apt tells me:

Media change: please insert the disc labeled
‘Ubuntu 5.04 _Hoary Hedgehog_ – Preview i386 Binary-1 (20050310)’
in the drive ‘/cdrom/’ and press enter

Update: Since I get a large number of hits from Google for this issue: The way to fix this is to edit your /etc/apt/sources.list, and remove the first line in it that references the cdrom drive. (You can simply put a # in front of it.) Then type apt-get update. (You’ll have to edit and update as root – type sudo before the commands to do that.) If you need more help, feel free to comment. (2006-01-10)
… right. I’m a 15 minute bike ride from home. Not going to happen. So, I switch from zeus to athena and try it, and get tla installed. Then start looking for instructions on how to check out a repository.

Apparently, the industry standard term “check out” is not part of the Arch repository system. Eventually, I wandered into Logjam’s arch repository, which provides clean instructions for how to get the code out of it:

tla register-archive http://logjam.danga.com/arch/2004
tla get logjam@danga.com–2004/logjam–dev–4.4

I was able to check out the “shiny development branch” of the Planet code, and get it in place on the site, fixing all my issues with Atom and XHTML content. All is well in the world, and Planet Swhack is a go. Never let it be said that checking out code from an arch repository is intuitive though. Anyone who thinks it is is out of their tree.