Extracting Data From HBase Tuesday 11th February
Extracting Data from HBase
Getting large volumes of data into a NoSQL database like HBase is only half the problem; once there for it to be useful it needs to be retrieved. Apache HBase provides a Java API to do just that, which mimics a number of the functions that the HBase shell provides. Spring-hadoop (a part of the Spring-IO project) provides a thin veneer over the top of this by introducing the HBaseTemplate which in the usual Spring manner helps with the boilerplate connectivity code, but the HBase API is still the main entry point to writing the queries themselves.
There are a number of different ways of retrieving data, and for us the Scan is the one that best fits our needs; it retrieves a number of cells based on supplied selection criteria and so is in many ways analogous to an SQL SELECT…. WHERE statement.
The key thing to consider when using the API is how the data is structured – it is held in ascending key order and any Scan will read through those rows (and the cells within them) sequentially, so the most crucial thing is to minimise the number of rows processed to maximise performance.
Using Scans and Filters
The Scan class allows you to specify the column family and a start and end key which immediately narrows the search down. Initially we were just interested in the most recent 10 minutes worth of data and so we used the built-in RowFilter with a custom comparator which just inspected the reverse timestamp to determine if the row was required or not. This returned the data we needed but was inefficient as each row between the Scan start and end row was inspected.
From here we are looking at using custom HBase Filters to further improve performance. The key structure we are using is a six character geohash followed by an 8 byte reverse timestamp to allow for historical data. If we are searching in real-time we probably only want the first few entries for any given six character geohash and the rest of the rows for that partial key can be ignored. Consider the following keyspace, only the red areas at the start of each geohash are of interest:
In order to avoid processing the areas in blue we can implement a custom row skip filter. This will examine the timestamp part of the key and determine if it fits within the required boundaries. If not then the filter will provide a key hint for the beginning of the next geohash to search within, and crucially HBase will then skip over all the intervening rows to that key hint. For example if we are beyond the required time in the ‘u10j0h’ hash the filter would return the key value of the top end of the ‘u10j0j’ partial key, skipping over the older unwanted data within u10j0h.
To do this we implemented our own custom Filter which skips the unwanted rows based on a required time range. The filterKeyValue method returns a ReturnCode which indicates how the scan should proceed, and should it return SEEK_NEXT_USING_HINT then the scan will subsequently call getNextKeyHint on the Filter which will return a row key which the scan should skip to. This vastly reduces the number of individual rows that the Filter needs to inspect, reducing the computation required and speeding up the process. In terms of scanning for a range of geohashes for a given time period this is as good as it gets, however to make practical use of this data, to display it on a web site for example, more work is required.
The requirement for this project was to display this data on a web site in the form of a map, so all searches are done for a geographically bound series of geohashes, for example:
While this area is geographically contiguous, in terms of rowkeys it is not and so this will result in six different scans in HBase to retrieve the data (u10j0j, u10j0m-u10j0n, u10j0q, u10j0t, u10j0v-u10j0w and u10j0y). Once at a lower zoom level the number of scans increases rapidly and soon becomes inefficient. This can be mitigated to an extent by using parallelism (which we implemented with the Java concurrency API but there are many valid and well documented ways of doing this) but ultimately the bottleneck still comes down to IO between the web application and the Hadoop cluster hosting HBase.
To reduce the number of calls made we started performing scans using less granular start and stop rowkeys. For example the six scans above can all be encompassed by a single scan of u10j0:
However as can be seen this then encompasses many more six character geohashes than we are actually interested in, so another custom filter is required to skip within u10j0 to the actual rowkeys we are interested in. Fortunately the HBase API provides a FilterList filter which allows us to chain individual filters together so we can still use our time based skip filter to find the actual rows we want. With our two skip filters in place the scan now looks something like:
The more you zoom out the less granular the scan start and stop geohashes become and the larger the skips within that dataset, but the computation effort is contained on the Region Server rather than returning to the client (in this case the Spring HBaseTemplate) each time.
Fun with HBase
There are some interesting “gotchas” which cropped up during the development which are probably worth mentioning:
- Custom Filters and Comparators must be installed on the HBase RegionServer in the /usr/lib/hbase/lib directory.
- Custom Filters and Comparators must implement Writable, meaning that any state (e.g. filter values) must be preserved by implementing/overriding the appropriate methods (such as write() and readFields()). They must also provide no-arg constructors. Failure to do any of these things will likely result in the search failing silently in HBase.
- When getting the raw key value the byte array returned also contains the length of the key in the first two bytes.
- Within a FilterList if one Filter returns SEEK_NEXT_USING_HINT then getNextKeyHint() is called on all the subsequent filters in the chain.