Populating a MongoDb with POIs

I have been looking at the alternatives for relational databases for a while now and this week I came across MongoDB. MongoDB is a documented-oriented database which tries to bridge the gap between key/value stores and relational databases. In this post I'm going to install MongoDb and populate it with 500.000 POIs from Tele Atlas.

To install MongoDb I only needed to download the Windows binaries and that's it. Then I followed the guidelines in the Quickstart to start the database. As the default datapath for MongoDb on Windows is c:\data\db I specified a custom datapath.

> mongod.exe --dbpath "D:\mongodb\data" run

Once I got MongoDb up and running I downloaded and installed the Python bindings named PyMongo. This also went very smoothly. On the MongoDb website there is a great introductory tutorial for PyMongo.

To populate the MongoDb I wrote a Python program that loops over the POIs in a layer, creates a dictionary with its properties and coordinates. Then the POI dictionary is inserted in the pois collection in the poi database.
First I connect to my running MongoDb and create a reference to a new database called poidb. Then I create a new collection called pois.

import arcgisscripting
from pymongo.connection import Connection

mongo_conn = Connection('localhost', 27017)
poidb = mongo_conn.poidb
pois = poidb.pois

Then comes some code that uses ESRI functions to create a list with the fields in the POI layer and a featurecursor for the POIs. I removed the SHAPE field from this list because this will be treated separately.

gp = arcgisscripting.create()
poi_path = r'D:\ta.gdb\eureur_____pi'

fields = gp.listfields(poi_path)
names = [f.name.upper() for f in iter(fields.next, None)]
names.remove('SHAPE')
fc = gp.searchcursor(poi_path)

In the next part we loop over the POIs in the cursor. For each poi we extract the x and y coordinate and save them in the poi dictionary. For all the fields we fetch the value and put the none empty values in the dictionary. The strings are stored as unicode. Finally we insert the poi dictionary in the poi collection of the poidb. The script stops when more then 500.000 POIs have been inserted (500.002 to be exact).

for index, feat in enumerate(iter(fc.next, None)):
    poi = {}
    point = feat.shape.getpart(0)
    poi['x'] = point.x
    poi['y'] = point.y
    for field in names:            
        value = feat.getvalue(field)
        if value and len(str(value).strip()) > 0:
            if isinstance(value, str):
                value = unicode(value, 'latin-1')
            poi[field] = value

    pois.insert(poi)
    if index > 500000:
        break

To count the pois you only need to call the count method of the pois collection object.

print pois.count()

Before querying this data I'am going to create ascending indexes for the x and y fields. It would be better if spatial indexes where supported but with reasonably sized datasets a double index is sufficient for doing bounding box queries. I also could try the RTree package but that will be for another time. At the end the information about the indexes is printed.

from pymongo import ASCENDING
poi_db.pois.ensure_index('x', ASCENDING)
poi_db.pois.ensure_index('y', ASCENDING)

print pois.index_information()

Query parameters are passed to MongoDb in a nested directory structure. So defining a bounding box query is an straightforward task. The $gt character means bigger then and $lt means smaller then. In this query I search for all the POIs with a longitude larger then 4.5 and smaller then 5.0 and a latitude larger then 50.5 and smaller then 51.0. Then I print the result count, the number of milliseconds that the query took and the first resulting document.

query_pois = pois.find({"x":{"$gt":4.5, "$lt":5.0}, "y":{"$gt":50.5, "$lt":51.0}})

print query_pois.count()
print query_pois.explain()["millis"]
print query_pois.next()

The result looks like this :

4585
265
SON([(u'HSNUM', u'173'), (u'MUNCD', u'23062'), (u'MUNID', 10560032003493.0), (u'x', 4.5000037996786091), (u'NAME', u'Shell Overijse Frans Verbeekstraat'), (u'OBJECTID', 134471), (u'TELNUM', u'+(32)-(2)-6573607'), (u'STNAME', u'Frans Verbeekstraat'), (u'FEATTYP', 7311), (u'ADDRPID', 10560000020526.0), (u'RELPOS', 75), (u'POSTCODE', u'3090'), (u'STNAMELC', u'DUT'), (u'ARNAMELC', u'DUT'), (u'y', 50.769815399855986), (u'IMPORT', 2), (u'MUNNAME', u'Overijse'), (u'_id', ObjectId('YB\xd6l\x99\xe0$U\x8f\t>\xc3')), (u'CLTRPELID', 10560001815786.0), (u'ID', 10560300013584.0), (u'BUANAME', u'Hoeilaart')])

So it took 265 milliseconds to query 4585 POIs from a collection 500.002 POIs with two range query parameters. If you need to optimize a query you can take a look at MongoDb Site for some tips and tricks.

We have come to the end of my first exploration of MongoDb. What I especially liked is the flexibility you get from this kind of databases and the ease of installation and use. The downside for geographic applications is that at the moment there is no built-in support for geometries. I hope you enjoyed this post and I welcome any comments and suggestions.

Related Posts
Spatial indexing a MongoDB
PostGIS Loading and querying data
Installing Tokyo Cabinet and Ruby

5 comments:

dwight said...

Nice.

mario ruggier said...

Wonderful, thank you! Possibly a small typo, initially slightly confusing -- the first line of your 3rd python code listing should probably be:

for index, feat in enumerate(iter(fc.next, None)):

Samuel Bosch said...

Glad you liked the code and yes it was a copy-paste error.

Matt Blackler said...

Nice. It would be interesting to start to see some of the NoSQL offerings making their way into the GDAL library - and seeing what kind of performance benefits can be had by storing them in this way.

You have inspired me to give cassandra and C# a go!

Unknown said...

Nice post with helpful details. I really appreciate your job. Thanks for sharing.

GIS alerting