Archive for October, 2010

MBTiles — a bit of a rant

Posted in default on October 12th, 2010 at 00:05:27

Earlier this weekend, I was pointed to a post about MBTiles, a new portable map tile distribution format. After a bit of hemming and hawing, I realized what it actually was, and realized that there wasn’t really much to hem and haw about.

This ‘new’ format is really nothing new, nor is it actually a ‘file format’ in the strictest sense of the word. What it is instead is:

  1. An OS X binary that takes a bunch of files in directories on disk and writes them into a sqlite database.

Some of you may remember the old commercials for the iMac: “Step 1: Plug in. Step 2: Get connected. There is no step 3.” In this case, there isn’t even a step two. There’s nothing else here: not a specification document, not a reader of any kind, not even a description of what this magical ‘file format’ is.

Okay, so it’s not much of anything, but it’s not really a *bad* idea, it’s just that… it’s not the way I’d go about it.

First of all, if you actually want people to actually use something, you kind of have to have a reader and a writer. A promise of an iPad app improvement someway down the line along with some vague handwavey “saved 90% of your disk space!” statistics might be good for a flash in the pan in the Twitter-world echo chamber, but if you want success, you gotta do a bit better than that.

So, to outline what the MBTiles ‘file format’ actually seems to be: It’s a sqlite database with two simple tables.

CREATE TABLE metadata (name text, value text);
CREATE TABLE tiles (zoom_level integer, tile_column integer, tile_row integer, tile_data blob);

The metadata is not required to contain anything (so far as I can tell; possibly some reader tools might require it).

Now, a smart reader might notice that there is nothing very complex about this: any programming langauge that can interact with sqlite can create or access the MBTiles data — a very good thing. (It’s possible that not publishing this simple fact is because the Development Seed/MapBox folks plan to extend it and don’t want to actually make it so that other people are using it; dunno.) However, excepting that, as it is, there’s no reason (that I can see) that the code to create a cache should be in C!

For prototyping, or if your goal is to create and develop a ‘standards’ thing, you really want to be working in a language which is more widely understood and easier to prototype. I realize that this is a judgement call on my part, but for things where you encourage people to check it out and use it, you should be working in a language with a wider audience. For example, using the mb_tiles_importer on my tiles produced by TileCache gives me a database that has 4 rows… even though my directory only has two tiles in it, and 2 of the 4 rows are entirely empty. If the code were in Python, I might take a look and offer some feedback, but with it being an OS X only binary, or even a thinly documented C script, there’s almost nothing I can do to figure out what’s going on or help.

Add to this the fact that although this format has a writer, it has no open source reader of any kind that I can find. There’s some chat about various MapBox related software reading it, but no separation of this from MapBox — and with no description of the format, there’s not an excuse that it’s designed for people to write their own clients…

That said, I’ve gone ahead and written support into TileCache for the ‘format’ such as it is; I’m not convinced it’s the ideal thing to do, but the core concept of delivering a single file in the form of an SQLite database for tile data is a pretty solid goal.

Overall, the idea is reasonably sound: Delivering tiles in a single file is important, and sqlite is a nice, lightweight format for that that’s accessible from most C-based languages. Writing a quick cache format to read these things in TileCache was easy enough — because, as I said, there really isn’t much there. I didn’t write write support, because doing so seemed like it could be a waste of the MapBox folks want to ‘own’ this format (Hello, GeoPDF, how are you today) and are still developing it, and the only way that I was able to even do what I did was using a tool that I had to grab from a Github link I got over Twitter (and doesn’t appear to work right).

B+ for the idea. It’s a bit iffy on the implementation, but the core goal is sound. However, the way that it’s approached is a somewhat typical approach that I see lately: Publish first, actually create the thing you’re publishing about later. That type of attitude is the kind of thing that drives me — as a creator who puts a lot of time and thought into community interaction first and foremost — absolutely bonkers.

Clean it up, make it a spec, and describe some of the benefit and utility in a way that’s not tied directly to MapBox, and I can see this actually becoming a pretty regular thing for distributing files around. I can definitely see the value and benefit — with some metrics at larger scale — of doing this kind of thing for distributing larger tilesets. I just don’t want to fall into the Admiral Ackbar problem: “It’s a trap!”

VSI Curl Support

Posted in GDAL/OGR, Locality and Space, Software on October 4th, 2010 at 06:14:47

In a conversation at FOSS4G, Schuyler and I sat down with Frank Warmardam to chat about the possibility of extending GDAL to be able to work more cleanly when talking to files over HTTP. After some brief consideration, he agreed to do some initial work on getting a libcurl-based VSIL wrapper built.

VSIL is an API inside of GDAL that essentially allows you to treat files which are accessed through different streaming protocols available as if they were normal files; it is used for support for accessing content inside zipped containers, and other similar data access patterns.

GDAL’s blocking strategy — that is, the knowledge of how to read sub-blocks of files in order to obtain the information it needs, rather than needing to read a larger part of the file — is designed to limit the amount of disk I/O that’s needed for rendering large rasters. A properly set up raster can limit the amount of data that needs to be read significantly, helping improve tile rendering time significantly. This type of access would also allow you to fetch metadata about remote images without the need to access an entire (possibly large) image.

As a result, we thought it might be possible to use HTTP-based access to images using this mechanism; for metadata access and other similar information over the web. Frank thought it was a reasonable idea, though he was concerned about performance. Upon returning from FOSS4G, Frank mentioned in #gdal that he was planning on writing such a thing, and Even popped up mentioning ‘Oh, right, I already wrote that, I just had it sitting around.’

When Schuyler dropped by yesterday, he mentioned that he hadn’t heard anything from Frank on the topic, but I knew that I’d seen something go by in SVN, and said so. We looked it up and found that the support had been checked into trunk, and we both sat down and built a version of GDAL locally with curl support — and were happy to find out that the /vsicurl/ driver works great!

Using the Range: header to do partial downloads, and parsing some directory listing style pages for READDIR support to find out what files are available, the libcurl VSIL support means that I can easily get the metadata about a 1.2GB TIF file with only 64kb of data transferred; with a properly overlaid file, I can pull a 200 by 200 overview of the same file while using only 800kb of data transfer.

People sometimes talk about “RESTful” services on the web, and I’ll admit that there’s a lot to that that I don’t really understand. I’ll admit that the tiff format is not designed to have HTTP ‘links’ to each pixel — but I think the fact that by fetching a small set of header information, GDAL is then able to find out where the metadata is, and request only that data, saving (in this case) more than a gigabyte of network bandwidth… that’s pretty frickin’ cool.

Many thanks to EvenR for his initial work on this, and to Frank for helping get it checked into GDAL.

I’ll leave with the following demonstration — showing GDAL’s ability to grab an overview of a 22000px, 1.2GB tiff file in only 12 seconds over the internet:

$ time ./apps/gdal_translate -outsize 200 200  /vsicurl/ 200.tif
Input file size is 22586, 10000
0...10...20...30...40...50...60...70...80...90...100 - done.

real	0m11.992s
user	0m0.052s
sys	0m0.128s

(Oh, and what does `time` say if you run it on localhost? From the HaitiCrisisMap server:

real	0m0.671s
user	0m0.260s
sys	0m0.048s


Of course, none of this compares as a real performance test, but to give an example of the comparison in performance for a single simple operation:

$ time ./apps/gdal_translate -outsize 2000 2000 
     /vsicurl/ 2000.tif
Input file size is 22586, 10000
0...10...20...30...40...50...60...70...80...90...100 - done.

real	0m1.851s
user	0m0.556s
sys	0m0.272s

$ time ./apps/gdal_translate -outsize 2000 2000 
    /geo/haiti/data/processed/google/21/ov/22000px.tif 2000.tif
Input file size is 22586, 10000
0...10...20...30...40...50...60...70...80...90...100 - done.

real	0m1.452s
user	0m0.508s
sys	0m0.124s

That’s right, in this particular case, the difference between doing it via HTTP and doing it via the local filesystem is only .4s — less than 30% overhead, which is (in my personal opinion) pretty nice.

Sometimes, I love technology.