Skip to main content

Thoughts On Adding Spatial Indexing to Voldemort


This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore..

Basic Idea

The 50000 ft blueprint goes like this.

  • Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass)
  • Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers)
  • Support "geolocation" as a new standard key serializer type in Voldemort. The values will still be  opaque blobs.
  • Implement a LocationStoreClient that supports adding, updating and radius searches on location tagged objects. 
  • Fix the other parts of Voldemort like Online rebalancing & the proxy bridges to play along. 

The LocationStoreClient

A key difference with location data is that there could be multiple data items on the same geo location (think foursquare checkins). Hence, we need a mechanism to distinguish between such data items, so that the storage layer can decide whether to update some data item vs adding it to storage. Hence, there is inherently a notion of "context" around each geolocation (think userids or checkinids as the context with the value being the checkin text itself). Note that context is not indexable i.e you cannot query based on the context. Ergo, the "geokey" would have the following json structure.

{
    "lat"     : <lat_double_value>,
    "lon"     : <lon_double_value>,
    "context" : <string_context>,
    "radius"  : <radius_in_meters>
}

The java LocationStoreClient interface would then support the following operations.

public class interface LocationStoreClient {

    // Gets the data item that matches the exact {lat, lon, context} values in the key
    public List<Versioned<V>> get(GeoKey key);

    // Performs a bounding box search around the {lat, lon} in the key
    // and returns all the data items in the given radius
    public HashMap<GeoKey, List<Versioned<V>> radiusSearch(GeoKey key);

    // If there is a data item with the exact {lat, lon, context} triplet already
    // , an update is performed. Otherwise, the data point is added to storage.
    public void put(GeoKey key, V value);

    // Unsupported for now. Data retention feature of Voldemort can be used instead with limitations
    public void delete(GeoKey key);
}

Implementing Foursquare using Voldemort 

Let's consider an sample location based application to obtain some intuitions as how the interface above is a feasible starting point. Mostly, foursquare is about users checking-in to locations they visit and then foursquare pulling such checkins around an users's current location.

When users checkin, they generate some content (comment, tip) on that specific location. Typically, there is a checkin_id associated with each checkin. We could put(..) the checkin with checkin_id as the context and the content as the value, into the location store. To display checkins, you simply perform a radius query and obtain all the checkin_ids and their content. Since, Voldemort supports multiple storage engines to be used on the same server, you could also have another "regular" (BDB) Voldemort store for other mappings you need such as say checkin_id to user_id.

The problem with the interface is that if you need to update a specific checkin, you need to first perform a radiusSearch or a Voldemort admin fetch to construct a key to be able to turn into an update. i.e you need the context.

Anyway, this is obviously only one example. I am pretty sure, if you get creative, the interface and data model is flexible enough to fit a lot of location based application classes.

Conclusion

Voldemort (with its Store interface) and spatial queries are not a match made in heaven. For example, the getAll() method now morphs into the radius search. However, it does not seem like a freak show either. I believe we could build some interesting applications, leveraging all the elastic scaling goodness Voldemort already provides.

I have a gist stashed here, based on which the first cut of the PostgresStorageEngine could be implemented. If you are Voldemort user and you think this might be useful, get in touch with me..  As of now, progress is slow, since I don't need this until sometime from now.


Comments

Post a Comment

Popular posts from this blog

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.


Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.The receiver based approach is a little complicated IMHO, and …

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

Hadoop (Hadoop2/HDFS)YARN (So you can submit MR jobs)Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)Hive (So we can create some tables and play with it) We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here, we just link the four pieces together nicely in this blog. 
Download StuffFirst off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper


$ ls $HOME/bin/cdh/5.4.7 hadoop hadoop-2.6.0-cdh5.4.7.…