Technical Ramblings

Archive for June, 2011

Suggestions for Better Geo In the (dot)Cloud

Posted in default on June 10th, 2011 at 00:47:03

икониOne of the interesting things that I’ve noticed in exploring various cloud providers is the limited amount of support for strong geo integration. This isn’t particularly surprising — Geo on the web is still a niche market, even if it is a niche market I tend to care about a lot. Here are some things that I think that might make people more eager to adopt cloud-y services for geo, specifically in the context of DotCloud, where I’ve been investing most of my time recently.

DotCloud has the idea of ‘services’ — independent units which work together to form an overall operating picture. The idea behind these services is that they are simple to set up, and perform one particular task well. So rather than having a service which combines PostGIS, GeoDjango, and MapServer, you have three separate services: one for PostGIS, one for GeoDjango, and one for MapServer, or map rendering, and you connect between them. This way, if your application needs to scale at the frontend, you can easily simply deploy additional GeoDjango services; if you need more map rendering ‘oomph’, same deal. (Deploying additional service for databases won’t magically scale them, but you do have the ability to use the software’s internal scaling across your multiple instances.)

So, again, using DotCloud, there are three types of ‘services’ that I think would be interesting to users who are working in Geo:

PostGIS — Supporting PostGIS on most debian+debian-derivatives platforms is pretty simple; the tools are all easily installable via packages. (pgRouting is another step beyond — something which might actually have some benefit simply because it’s more difficult to get started with, so having it preconfigured could make a big difference in adoption.) Nothing really special needed here other than apt-get install postgis and some wrapper tools to make setting up PostGIS databases easy. There probably isn’t any reason why this couldn’t be included in the default Postgresql service type in DotCloud.
GeoDjango — This one is even easier than PostGIS. GeoDjango already comes with the appropriate batteries included. The primary blocker on this is to get the GEOS, proj, and GDAL libraries installed on the Django service type. I don’t think there is any need for a separate service type for GeoDjango.
Map Rendering — This one is a big one for a lot of people, and I’m not entirely sure the best way to work it within DotCloud. Map Rendering — taking a set of raster or vector data, and making it available as rendered tiles based on requests via WMS or other protocols — is one of the things that is not pursued as often by the community right now, and I think a lot of that is in the difficulty of setup. As data grows large, coping with it all on the client side becomes more difficult; some applications simply never get built because the jump from ‘small’ to ‘big’ is too expensive.
There are three different ‘big’ map rendering setups that I can think of that might be worth trying to support:
- MapServer — MapServer is a tried and true map rendering tool. It primarily exists as a C program with a set of command line tools around it; it is usually run under CGI/FastCGI. Configuration exists as a set of text-based configuration files (mapfiles); rendering uses GDAL/OGR for most data interactions, and GD or AGG (plus more esoteric output formats) for output. MapServer is often combined with TileCache, for caching purposes; TileCache is based on Python.
- GeoServer — GeoServer is a Java program, which runs inside a servlet container like Tomcat. Like MapServer, it supports a variety of input/output formats; configuration is typically via its internal web interface. Caching is built in (via geowebcache). I think GeoServer would probably run as is under the ‘java’ service type that exists on DotCloud, assuming the appropriate PostGIS support exists for the database side.
- OSM data rendering — This one is a bit less solid. OpenStreetMap data rendering has a number of different rendering frontend environments, but the primary one that I think people would tend to set up these days is a stack of mod_tile (Apache-side frontend) talking to tirex (renderer backend) which calls out to/depends on Mapnik, the actual software which does tile rendering. Data comes from a PostGIS database — though in the case of OSM, even that requires some additional finagling, since getting a worldwide OSM dump is… pretty big. (It’s probably safe to set that point aside as a starting point, and concentrate instead on targeting localized OSM rendering deployments — solve the little problems first, scale up later.)

One thing that all of these tools have in common is that they really like having fast access to lots of disk for quickly reading and writing small files. I’m not sure what the right way to do that within the DotCloud setup is — I don’t see an obvious service type which is designed for this — so that might be another component in the overall picture. (Things like the redis service try to solve this problem I think, but since the tools primarily intend to write to disk as is, adopting them to support other ways of storing tile data persistently would require modifying the upstream libraries.)

I think that there is room to significantly simplify deployment of some components of geographic applications by centralizing installation in cloud-based services; the above sketches out some of the components that it might make sense to consider as a first pass. These components would let someone create a relatively complex client + server side geographic data application; exploring and expanding on these — especially the OSM data rendering component — could make deploying to the cloud easier than deploying locally, with the net effect of increased adoption of cloud-based services… and more geodata in the world to consume, which is something I’m always in favor of. 🙂

1 Comment »

Better GDAL in the Cloud (DotCloud, Round 2)

Posted in default on June 7th, 2011 at 06:59:40

My previous attempts to make GDAL work in the cloud were sort of silly; compiling directly on the instance is sort of counter to the design of DotCloud, which lets you scale primarily by assuming that a ‘push’ to the server establishes all of the configuration you’ll need. (sshing into the machine is certainly possible, but it seems it is designed more for debugging than it is for active work.)

So, with a few suggestions from the folks in #dotcloud, I started exploring how to set up my dotcloud service in a bit more of a repeatable way — such that if I theoretically needed to scale, I could do it more easily. Rather than manually installing packages on the host, I moved all of that configuration into a ‘postinstall’ file — fenceposted so it runs only once.

After a bit of experimentation (and reporting a bug in GDAL), I was able to get a repeatable installation going; this means that I no longer have any manual steps in creating and setting up a new service doing OpenAerialMap thumbnailing; in fact, the entire process — right down to setting up redis as a distributed cache — can now be completed automatically.

The postinstall is pretty simple: download and configure, then install, curl; Download and configure, then install, GDAL. In both cases, use the respective -config tool as a fencepost so we don’t install multiple times.

Once this is done, setting up a new deployment with the appropriate services is trivial: that’s outlined in setup.sh — configure a redis service, grab the config information, and deploy a wsgi service using that config and the rest of the code.

In fact, while I was writing this post, I followed that very set of instructions, running ‘./setup.sh openaerialmap’ — before I finished the post, I had a running www.openaerialmap.dotcloud.com instance, which I could then commit in place of the silly-named ‘pepperoni’ service that oam.osgeo.org was using before.

The big thing for me about these one-off deployments is that by making available better tools for iterative deployment, they allow me to be more disciplined in how I configure services. When I was recently attending a workshop on container orchestration and automated deployments, I learned about several Ontario available options that significantly streamline this exact process. Normally, installing GDAL or curl is a one-time event: without reformatting my system (and inevitably losing track of customizations), it’s challenging to test setups reliably. With these improved tools, deploying software literally becomes seconds of work—it’s all scripted, repeatable, and far more manageable. No more mistakes in one-off apt-get install commands that I’ll never remember if my system goes down. Now, I’ve got the ability to script deployments of software.

This is of course possible with EC2 directly as well — building custom AMIs, redeploying them, etc. For me, though, DotCloud took the pain out of having to do such a thing. They did the hard part for me — and I got to do the more interesting and fun parts.

Comments Off on Better GDAL in the Cloud (DotCloud, Round 2)

DotCloud: GDAL in Python in the Cloud

Posted in default on June 5th, 2011 at 15:14:40

One of the key components of the OpenAerialMap design that I have been working on since around FOSS4G of last year is to use distributed services rather than a single service — the goal being to identify a way to avoid a single point of failure, and instead allow the data and infrastructure to be shared among many different people and organizations. This is why OpenAerialMap does not provide hosting for imagery directly — instead, it expects that you will upload images to your own dedicated servers and provide the information about them to the OpenAerialMap repository.

One of the things that many people want from OpenAerialMap is (naturally) a way to see the images in the repository more easily. Since the images themselves are quite large — in some cases, hundreds of megabytes — it’s not as easy as just browsing a directory of files.

Since OAM does not host the imagery itself, there is actually nothing about this type of service that would be better served by hosting it centrally; the meta-information about the images is small, and easily fetched by HTTP, and the images themselves are already remote from the catalog.

As a result, when I saw people were interested in trying to get OpenAerialMap image thumbnails, I decided that I would write it as a separately hosted service. Thankfully, Mike Migurski had already made a branch with the ‘interesting’ parts of the problem solved: Python code using the GDAL library to fetch an overviewed version of the image and return it as a reasonably sized thumbnail. With that in mind, I thought that I would take this as an opportunity to explore setting this code up in the ‘cloud’.

A requirement for such a cloud deployment is that it needs to have support for GDAL: this severely limits the options available. In general, most of the ‘cloud hosting’ that is out there is to host your applications, using pre-installed libraries; very few of the sites out there would let you compile and run code in them, at least that I can find. However, I remembered that during my vacation week, I happened to sign up for an account with ‘DotCloud’, a service which is designed to help you “Assemble your stack from pre-configured and heavily tested components.” — instead of tying you to some specific setup, you’re able to build your own.

At first, I wasn’t convinced that I could do what I needed, but after reading about another migration, I realized that I could get access to the instance I was running on directly via SSH, and investigated more deeply.

I followed the instructions to create a Python deployment; this got me a simple instance running a webserver under uwsgi behind nginx. I was able to quickly SSH in, and with a bit of poking about, was able to compile and run curl and GDAL via the SSH command line into my local home:
$ wget http://curl.haxx.se/download/curl-7.21.6.tar.gz; tar -zvxf curl-7.21.6.tar.gz $ wget http://download.osgeo.org/gdal/gdal-1.8.0.tar.gz; tar -zvxf gdal-1.8.0.tar.gz $ cd curl-7.21.6.tar.gz $ ./configure --prefix=$HOME --bindir=$HOME/env/bin $ make $ make install $ cd ../gdal-1.8.0; $ ./configure --prefix=$HOME --bindir=$HOME/env/bin --with-curl --without-libtool --with-python $ make $ make install $ cd .. $ LD_LIBRARY_PATH=lib/ python

So this got me a running python from which I could import the GDAL Python libraries; a great first step. Then I had to figure out how to make the webserver know about that LD_LIBRARY_PATH — a much more difficult step.

After some mucking about and poking in /etc/, I realized that the way that the uwsgi process was actually being run was under ‘supervsisord’; basically, whenever I deployed, my Python process running under supervisord was restarted, and was running my Python code inside it. So then all I had to do was figure out how to tell the supervisor on the DotCloud instance to run with my environment variables.

I found the ‘/etc/supervisor/conf.d/uwsgi.conf’ which was actually running my uwsgi process; after some research, I found that DotCloud will let you append to the supervisor configuration by simply creating a ‘supervisord.conf’ into your application install directory; it hooks this config at the end of all your other configs. So, I copied the contents of uwsgi.conf into my own supervisord.conf, dropped it in my app directory, and added: environment=LD_LIBRARY_PATH=”/home/dotcloud/lib” to the end of it.

Pushed my app to the server, and now, instead of “libgdal.so.1: cannot open shared object file: No such file or directory”, I was able to successfully import GDAL; I then easy_installed PIL, and with a little bit more work, I got the image thumbnailing code working on my DotCloud instance.

I’ve still got some more to do to clean this up — I’m not sure I’m actually using DotCloud ‘right’ in this sense — but I was able to get GDAL 1.8.0 in a Python server running on a DotCloud instance and use it to run a service, without having to install or screw with the running GDAL libraries on any of my production servers, so I’ll consider that a win.

Comments Off on DotCloud: GDAL in Python in the Cloud

Some things OpenLayers Shouldn’t Do

Posted in default on June 1st, 2011 at 04:16:17

One of the things that sets me apart from other OpenLayers users and contributors at times is the strong belief that it is not the responsibility of OpenLayers to be everything for everyone. I’ve often been faced with a situation where a user will say “Well, OpenLayers doesn’t have this functionality: doesn’t that mean it should?”

Just this morning, Eric made a similar argument: In response to my saying that OpenLayers is not for everyone, h responded if OpenLayers isn’t for everyone we gotta do something to change the situation, I think.

I completely disagree with this statement, and I think that anyone who thought about it would also disagree. There are some things OpenLayers is not meant to do, which other mapping tools are in a better position to do.

OpenLayers is not a 3D spinning globe — we will probably never have support for rendering 3D globes inside of OpenLayers itself.
OpenLayers is the ideal tool for browsing 3D panoramas: we are likely not the best tool for doing something like Google Streetview.
OpenLayers is not a full GIS application: We will likely never ship as part of OpenLayers itself a full UI for doing GIS editing against some server.
OpenLayers should not have as one of its primary goals creating user interface elements for complex functionality.

These choices might seem obvious, but I bring them up specifically because they are things that I have seen people state OpenLayers should do, and I think it’s important for any project to identify goals and seek to solve those goals.

There are also other things which are less obvious that I think OpenLayers is unlikely to do, which other mapping APIs have chosen to do.

OpenLayers should not abandon support for Internet Explorer: One of the moves I’ve seen recently is to create applications which abandon support for Internet Explorer, since supporting IE takes more effort than supporting other platforms. Although that is reasonable for many platforms, I think it would not be the right path for OpenLayers. Building a library which supports more platforms will have side effects: OpenLayers has IE support baked into some of its core functionality, like event handling, so it will always have some minimal impact on things like download size, and there are some things we may choose not to do because doing so would make supporting IE more difficult. Internet Explorer may be close to becoming a non-majority browser, but it’s still going to be very important to OpenLayers users for years to come.
OpenLayers should not remove support for commercial layers, even if supporting these types of layers requires a more complex architecture: Supporting commercial layers is one of the key components of OpenLayers, and I think that it is important to continue to support using commercial layers, even though this does, at times, make the OpenLayers code more complex. This problem is thankfully getting somewhat better with newer APIs from Bing and Google for direct tile access, but there exist other APIs out there that aren’t as advanced, and continuing to support the types of APIs we need to make that happen is something I feel is central to OpenLayers.
OpenLayers should not remove support for direct WMS access. One of the things that some other mapping APIs have chosen to do is to limit their target audience, and as a result, they do not worry about adding WMS access or other similar functionality for accessing data through means other than laying out X/Y/Z tiles on a map. Supporting WMS and other OGC web standards is something that takes some non-trivial portion of OpenLayers developer time. If we were to abandon anything other than support for OSM or XYZ style tiled layers and vector layers, we could certainly concentrate on a smaller API — but I don’t think that is something OpenLayers should do, even if it would mean a better overall API.
OpenLayers should not remove support for fetching via various protocols, parsing via various formats, and choosing what to do with that data via various strategies: I have seen an argument that the more recent work with formats, strategies, and protocols is confusing to users, and should simply be removed in favor of letting that be handled at the application level. Most of the APIs OpenLayers is being compared to do not have this kind of support. This is another thing I strongly disagree with.
OpenLayers should not remove support for dealing with data in projections other than Spherical Mercator. Many other libraries simplify user experience by picking and sticking with a single projection; I think that is impractical for OpenLayers.

Now, I think it is completely true that it would be possible to rip out 80% of the functionality of OpenLayers, create a smaller, easier to maintain library, and hit a use case that could solve interesting problems. I will agree that OpenLayers is a large codebase — after all, we have more than 200 *thousand* lines of code, compared to just under 20,000 in Polymaps, 60,000 in Mapstraction, 9,000 in Leaflet. This is the ‘big daddy’ of mapping frameworks — no one is denying that — and it has a lot of technical debt built up the same way any Open Source project has happen over 6 years of development without a rewrite.

Some of that code is cruft, and should be removed. No one is denying that. In fact, there’s a fair amount of already-deprecated code; controls that have been deprecated or non-default for more than 4 years are still part of the main OpenLayers trunk release. However, I’d put the amount that is cruft at closer to 20% than 80%: the much bigger portion of the OpenLayers code is the broad support for the many different ways of interacting with remote data. In our formats alone — things which are generally designed to do only one thing, read and write from an external string to an OpenLayers resource — we have 69 files with 20,000 lines of code. That’s right, our format parsing — for everything from OSM to GeoJSON, GeoRSS to ArcXML to CQL to KML — is larger than several other entire mapping libraries.

Eric followed up by pointing out that of course he didn’t mean to suggest OpenLayers shouldn’t do everything, but that OpenLayers should have an easier API to hit simple use cases — something which is better documented, easier to use, and less confusing to users. I find it a bit amusing that this is an argument someone feels they need to make to *me*, since I feel like I’ve been pushing that argument for years now within the project. 🙂 However, since it may not have been clear, let me clarify:

The OpenLayers API is difficult to get started with for many simple problems. It can be hard to use, confusing to start, and difficult to understand for solving simple problems. It pushes details that very few users care about in the face of users who don’t know what to do with them. It is crucially important to supporting the future use of the project to make the easy things easy, while maintaining the ability to make the hard things possible.

Some areas where this is most apparent: difficulty in working with projected coordinates for the purposes of interacting with OSM or other spherical mercator layers. Difficulty in setting up a simple request to download a file from a server and render it — even *I* can’t configure that from memory, and I’ve done it dozens of times. These are real problems that users run into all the time and simplifying them would be a huge step forward in usability for solving the simple problems. (Note that I’m saying this in a blog post rather than in code, which also shows that I don’t think this is a trivial problem: the reason these things are not done is in part because coming up with a solid way to do these things in a helpful way is not trivial.)

I just want to clarify that there will always be some things OpenLayers shouldn’t do. Some of the decisions we’ve made have increased our overall technical debt: Maintaining support for IE, even for relatively advanced features like rotated client-side graphics in VML, was a cost that we could have saved if we chose a narrower supported platform range. However, I think that some of these decisions are important: a key component of OpenLayers is its broad support for loading data of any kind, in a wide variety of browsers. That was the core idea when OpenLayers was started, and I think that it is an extremely important to maintain part of our legacy.

Some of the competing mapping frameworks target a narrower use case. As a result, they can concentrate their developer time on better examples and documentation for a subset of functionality that OpenLayers has. This is great for the users of those tools, and will likely make those tools more attractive, short term, to some users. Competition is good: it encourages innovation, and pushes the limits, especially when the competition is open source and can collaborate across teams. If another tool is better for users than OpenLayers, it is in their best interest to use it. In the end, I hope that OpenLayers can continue to expand the set of users for which it is the best tool, and there is a lot of work that can be done there — and I look forward to continuing to see competition and collaboration between the various mapping projects out there to maximize user success.

10 Comments »

Archive for June, 2011

Suggestions for Better Geo In the (dot)Cloud

Better GDAL in the Cloud (DotCloud, Round 2)

DotCloud: GDAL in Python in the Cloud

Some things OpenLayers Shouldn’t Do

Archives

Categories