Storing and Querying Geospatial and Temporal Data Tuesday 28th January
The Whats and Whys
We want to explore big data and explore the tools used in this area and we need to make this interesting and demonstrable. What do we do? First we need a dataset. Through our contacts at the Open Data Institute we are made aware of the existence of the open data feeds from Network Rail. With these feeds we can get live access to all train movements within the UK. This is a great dataset, it has a whole bunch of information we could extract and visualise but we have to make a choice. Our choice is: Let’s make a heatmap of the lateness present within the train infrastructure and let’s make it interactive so that it can be explored and investigated.
Based on the idea of a map that can be used to display information based on live or historical train movement data we can see the following requirements in terms of locating the data in the data store:
- Geospatial filtering.
The map should be viewable at a user configurable zoom level and thus we need to be able to specify that we are only interested in data for a specific area / location.
- Temporal filtering.
We will rarely want to consider all of the historical data and will more likely be interested in data from a particular period. Thus we should be able to restrict the data to that which is constrained within a time period. If done correctly this should also allow for Live times – e.g. from 30 minutes ago till now.
For our datastore we chose HBase. HBase is a key-value NoSQL database that resides on HDFS. It rests squarely within the Hadoop infrastructure and (according to its claims) can provide fast lookup of data. For the best performance we must understand how HBase stores and accesses the data and the kind of data we will be requesting from it. From this starting point we must construct a schema that takes all these issues into consideration.
From the requirements above we can see that we should be including some kind of geospatial and temporal information within the key, but what are the considerations we must take into account due to our choice of HBase?
HBase is a key-value based database and thus it comes as no surprise that most of the considerations when it comes to HBase are concerned about the design of the RowKey. There is an interesting read about the considerations one must be aware of when designing the RowKey for HBase here.
I’ll summarise the ones that seem relevant to our particular data set:
- Avoid Monotonically Increasing Row Keys
HBase stores its data sequentially based on the RowKey. It splits data across machines at defined RowKey boundaries. If you have data that has a sequential key (such as a timestamp) then all the current writes will be performed on the same region server (machine) i.e. you will have one machine doing all the writes and thus you will not benefit from the distributed nature of the data. This goes for reads also. If the RowKey is monotonically increasing and all the reads are for data that is the most current (such as the live train data) then one machine will be responsible for all the reads.
- Try to minimize row and column sizes
The entire key and value are stored for each column – not per row. Thus a large key compared to the value of the column is very inefficient with respect to the final data storage size:
- Try to keep the ColumnFamily names as small as possible, preferably one character
- Prefer shorter attribute names
- RowKeys are byte arrays. Specify the key as a byte array rather than a string representation – it’s shorter
Understanding HBase and Scanning the Data
HBase is essentially a simple key value lookup database with a relatively simple scan/lookup mechanism for retrieving data. Also, HBase is column based rather than row based. This allows rows to contain row data with missing column values. It also means that each cell must be uniquely keyed, thus each cell has its own key value.
Data is retrieved using a ‘scan’. A scan defines the start and end cell key and will pull back all data between the keys. A cell key is made up of the following parts:
|Key Part||Defined by|
|Column||Column Family, Identifier|
When a part of the key is excluded all the values that match the rest of the specified key are included in the results, thus not defining a part of the key works the same as a wild card. By skipping the refinement on the key you are stating ‘include all possible values’.
** The timestamp part of the key works slightly differently to this. When a timestamp is not specified then the latest version that matches the rest of the key is retrieved.
The part that has the most relevance to scanning is the RowKey. It prefixes the rest of the key and thus has the greatest influence on the number of cells retrieved.
NB: The RowKey can be partially specified. If you just specify the high order bytes of the RowKey then the low order bytes are considered a wildcard – just like the rest of the cell key.
HBase scans can be further filtered by configuring and applying filters. There are a number of possible filter types supplied out of the box or a custom filter can be created. Filters can be applied to the key or the value, they allow you to refine the results that you are interested in. I have not seen any metrics but it stands to reason that filters that filter on the key rather than the value give better performance (TBC).
There is a good book on filters here.
Given the requirement 1) (Geospatial filtering) and 2) (Temporal based filtering) we would like to get the geospatial and time based info into the RowKey and thus enable fast, targeted scanning of the data.
If we had just a timestamp based RowKey we would hit the ‘Monotonically Increasing Row Keys’ issue. Hence we should prefix the RowKey with a value in order to distribute queries across the key space more evenly. The general approach according to HBase literature is to ‘salt’ the RowKey, or prefix with ‘buckets’. Essentially this means hashing the timestamp and prefixing this to the key, or allocating a row to a ‘bucket’ using some kind of algorithm based on the timestamp. However, since we have a geospatial requirement we can use this to our advantage and use this as a prefix to the RowKey. If we prefix the timestamp with some geospatial information we kill two birds with one stone; we allow filtering of the key by location and spread the data more evenly throughout the datastore.
NB: Of course this does assume that our data will be spread evenly across space and time, which it almost certainly will not be! However, once we have collected some data we can extract some metrics that will allow us to pre-split the data across our regions / region servers. I won’t go into this here but please refer here for clarification.
Geo-location in the RowKey
We could simply uses the longitude and latitude as prefixes to the RowKey but this would be sub optimal with regard to the type of filtering we envisage. By using a key design of this nature we would end up with something like this:
Since all the keys are stored sequentially we would have a difficult time creating a well performing scan that scans for data within a particular region.
A scan for a small area would need to encompass all the points with a matching longitude and result in many more points being returned than necessary:
The area in red is the required scan. The area in blue would be the area required to scan over in order to encompass the required area.
What we need is a way to describe the location within a single dimension. This is counter intuitive but can be achieved with something called a Geohash.
Geohash is a longitude/latitude system that describes location as a point with arbitrary accuracy. Geohashes offer properties like arbitrary precision and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision). Hence a short geohash describes a wide area centred on a point, a longer geohash describes a smaller area (approaching point like precision).
To visualise this consider the following diagram. Note that this is a simplified representation but does describe the basic properties of geohash.
Given a three character hash we can describe areas that encompass more or less points:
A** contains all the points. AA* contains all points in the block AA. AAA contains all the points in the block AAA.
Prepending the geohash for a particular data point to the timestamp for that point gives us a RowKey that allows us to specify the geographical area within which we want to retrieve data points.
For example: Assuming we are using a three character geohash with the above diagram and we want all the points in the lower left quadrant.
Our start row key would be: AAA*
Our end row key would be: AAD*
Thus we fulfil the criteria of spreading the RowKey values more evenly throughout the key space and we allow for searches based on area. The geohash ensures that points that are geographically close to each other will be clustered close together within the range of key values within the data store. This should produce scans for data that cover fewer values and thus increase performance.
NB: The diagram above is simplified in order to express the idea. The actual geohash produces areas / sub areas that are distributed slightly differently than that shown above.
The following link gives information on the actual distribution of sub geohashes distributed within a larger geohash.
For an interactive map that allows you to see and select geohashes see http://geohash.gofreerange.com/
According to this post: http://stackoverflow.com/questions/13836416/geohash-and-max-distance we have geohash precision as follows:
|Precision||Distance of Adjacent Cell in metres|
From this I hypothesise that a geohash of 6 characters in length should be accurate enough for our purposes. It will give an accuracy of 610m, which should limit the number of points within a geohash to below 10. I am assuming stations and signals are seldom closer together than 600m.
Temporal Data in the RowKey
As well as the geohash to locate the data geospatially we also want to include a temporal element within the RowKey. This will allow us to use filters to filter out data that corresponds to the time period we are interested in. Since there will be a ‘real time’ element to our solution we decided that it would be best to use a reverse timestamp in the RowKey. This will have the effect of placing the most recent data toward the start of the key space and thus scans will hit the most recent data first. There is a good section in the HBase book about this here.
Utilising the geohash as part of the RowKey assumes that we have location data for each and every data point. This is not the case. The data we get from Network Rail will contain the location represented as a STANOX. We do not have longitude and latitude data for all STANOX and hence we need a strategy for handling datapoints for which we cannot calculate the geohash.
The solution we have adopted is to store the data using a modified RowKey. In the cases where we know the location we use the geohash within the RowKey. In the cases where we do not know the location we store the data using the STANOX code in the RowKey. In order to distinguish the data that has known location from that which doesn’t we will need to prefix the key with an identifier.
i.e. prefix with
- ‘k’ for known location
- ‘u’ for unknown location
This allows us to keep the data rather than lose it. It also allows us to locate it easily by STANOX code and thus move it into the ‘known’ location section when more location data becomes available.
|Key type identifier||1 byte||A single byte that describes the kind of key and allows us to determine if the location data is known or unknown.
* ‘k’ (0x6B) for data with a known location
* ‘u’ (0x75) for data with an unknown location
|Location data||6 - ‘k’ type data||6 character geohash of the longitude and latitude of the known data point location. Represented as ASCII bytes.|
|5 -‘u’ type data||5 character location code. This is the code used for a location that we do not have the longitude and latitude data for. Represented as ASCII bytes.|
|Reverse timestamp||4||Long.MAX_VALUE – ‘timestamp’|