nœts - Re-organizing inamidst.com with Keywords

Re-organizing inamidst.com with Keywords

I'd better write this up here in lieu of being able to post to Eph, which is something that I really ought to sought (heh, typo: sort) out, along with posting to miscoranda.com etc.

d8uv asked me what I'm doing, so I started to discuss the new keywords based system for inamidst. The first step was when I worked on a new /sbp/projects page, for which I asked d8uv to provide some 13x13 PNG icons that represent certain things: code, documentation, activities, and publications to begin with. I also did one myself that represents "everything", for the UI. I made a Javascript switcher so that people could select to view everything of one category. The neat side-effect is that all of the icons that d8uv produced look totally awesome when you use them as list-item-images.

Cut to: reoganization of the inamidst.com/ home page. Essentially, it's like a wider version of /sbp/projects, but encompassing only the things that are published at inamidst.com and not the cool wider stuff I've worked on elsewhere. It's a side index, not an sbp-projects index. So I made a highlights section there, and decided to use the same images that d8uv had provided for the original projects page. In doing so, I noticed that some of the things in the list could really have had any of a number of icons. The best example of this is /aefram/, the Atom Extensibility Framework, which is a mixture of activity, publication, code, and documentation. The only way I could think of for resolving it was to just pick an order of preference for the icons: so aefram gets the "activity" icon at the moment because that's the coolest icon!

But that's not enough, and in trying to solve that problem I think I may have inadvertantly solved a wider problem.

I hate organising the site. I hate filenames, pathnames, hierarchies; I hate publishing stuff to URIs that I know just a week later I'll be thinking are entirely wrong. I want to move stuff about, but I can't. I just really hate folders, and yet we're all constrained to folders on our filesystems. The problem that this presents really gets to me some times, and I spent hours thinking about how I might solve it.

The obvious route is to use inherent metadata, and possibly a more flexible form of external metadata, i.e. attributes that people can set to files themselves. So instead of having /blargh point to a single file named blargh, why not point it to a search for all the files named blargh? Then, to provide a URI to something, you simply have to provide the shortest URI which can disambiguate a particular file based upon its properties.

That all sounds okay in theory, but in practice it's extremely difficult to implement in a natural and intuitive manner. One of the things that's been bothering me is: things on the web should have canonical URIs, but keyword based systems are inherently non-canonical in that 2005-03.python.code is the same as python.2005-03.code is the same as code.2005-03.python, etc., which gets N! worse as you include more keywords. That impasse I recently solved. The answer is obvious really: make non-canonical URIs 301 redirect to the canonical URIs, and have the canonical sorting order simply be lexicographical. In other words, sort the keywords alphabetically, and have Apache enforce that sorting.

It's that breakthrough of thought, though I've likely "solved" it a few times before and forgotten about it (it's application to the problem at hand that really matters; being able to think of a variety of different factors and how they'll all influence one another for a single finished result), that's made me think that a keywords based metadata system for inamidst.com will be profitable. (Profitable in the wider sense of giving a return of any kind, not necessarily a financial one.)

So back to the keywords on the homepage at the moment. Instead of having to choose one icon for a particular thing and having that only represented via that icon (which is //li/@class based, with a CSS rule), why not a) allow resources to have multiple keywords, and b) store these in a kind of database? As anyone will know who's followed practically anything that I do, I'm a great believer in using the filesystem as a database first and foremost, so I wondered how to apply that principle here. I wanted at first to be able to add the information to the files themselves, so that it's all inherent. That'd be great for HTML files where you can do <meta> keywords, but for python files and other files such as JPGs where adding metadata is difficult to impossible and requires N solutions for N formats, it's just way too inflexible. I also considered having metadata.txt files or something of that nature in each directory, but again, that means that the directories get rather cluttered up. filename.meta was also an idea, but that might have interfered with conneg. So eventually I came up with the idea of a shadow directory. If you have /path/filename and want to write some metadata about it, you store that metadata in /meta/path/filename; and you store it as RFC 822-style headers, of course, the fieldnames for which can be easily mapped to URIs.

So I started work on /meta/, the most "trivial" aspect of which would mean mod_rewriting all of the requests through a particular script which could deliver these RFC 822-style metadata files in XHTML, Turtle, or RDF/XML. Unfortunately, this turned out to need all kinds of stuff including Accept header parsing, so it took rather a long time; but it was an enjoyable process and it's now implemented quite well. So now it all depends on me making lots of /meta/filename instances for each of the filenames that I want to have represented in this Grand Database Scheme.

The next part of the process will be as follows: one of the fields that's allowed in the metadata files is "keywords", which is of course a list of these service, code, documentation, project, etc. keywords that are linked to the icons that d8uv made for me. I plan to sporadically harvest all of these fields and compile them into a single database. Note that for anything pertaining to /meta/, especially the scripts that drive it, I've decided to use /meta/meta/ since it's the only place in the hierarchy that can safely be used. Therefore, the source for the rewriter script is at /meta/meta/src, and the script itself is at /meta/meta/data (which redirects to the former URI when directly accessed), though that may change. So, all of the filenames that have, for example, the "service" keyword in them will be written to a database at /meta/service.db, or something of that kind of nature. I've not worked out the details entirely yet. This means that I'll be able to easily query for various types of file.

Note that this doesn't solve the problem of where I actually put my files in the filesystem. It's a sad fact that that can probably not be solved until the filesystem itself changes, and that's going to be a long transitional process. But in some sense, it does solve that. For example, when I have a piece of code that I want to put in projectA and projectB, I have to choose which folder it goes into: either /projectA/mycode or /projectB/mycode Now, I still have to make this choice with this database system, *but* the cunning part is that in the metadata file for the code I can say "keywords: projectA projectB". That means that in the inderface code for the database, which I'm yet to describe but have already hinted at with my keyword redirection scheme, you'll be able to find that code under both "projectA" and "projectB" searches.

For the interface, this is possibly the sketchiest part at the moment, but I'm thinking of allowing a combination of keyword searches and other inherent properties such as the real filepath. That way, you'll be able to look up all of the documentation deemed worthy enough to include in the metadata system (which could possibly do with a nice generic name) that's under the /shaks/ directory or something. The syntax that I have in mind for this is that if you want to search for all activities and code which are under the /lalala/ project directory, you'll go to /db/lalala-activities.code Again, I'm not entirely sure about this, and I may even do global mod_rewriting so that you can do it in /lalala/ itself.

That's probably the most distant reach of this application at the moment, and I'm more interested in the immediate things. One of those is generation of the inamidst.com homepage! At the moment it's a bit of a pain to have to maintain it by hand when I could have this richer metadata propelling it. The problem is mainly that I have so many indexes now: for example, I'm describing my services in /svc/ and / and I have a mixture of /svc/* services and those that're housed elsewhere. That gets a bit irritating to keep up with, so I want to make it much easier to generate automatically. I'm thinking that a Makefile in every directory wouldn't be such a bad thing: I type $ make at the command line and then it automatically builds up the indexes based on the properties I write into some template file. It has the advantage that it's all baked and static, which is one of the key requirements of inamidst.com in general.

Another application I have for it is a better sitemap, one including perhaps all of the metadata entries since there aren't too many of them at the moment, and that in time might be able to grow to be some kind of AJAX-backed behemoth allowing people to browse through the site by expanding the different sections that they like, sort of like an outliner. d8uv also came up with the idea of doing venn diagram indexes, which might bear some further investigation especially when coupled with SVG.

So that's quite a large bit of state for such a relatively simple invention, but it's nice to get it all documented out in public to remind myself, show my friends, and educate outsiders as to how things are done on inamidst.com. Feedback about the project, of any kind, is most definitely welcomed.

* Posted by sbp on #d8uv.com at 2005-05-14 04:25:03 UTC.