Geospatial Partitioning
Geospatial partitioning is a method for organizing geospatial data, grouping spatial entities in close proximity to each other. This geospatial grouping allows for accelerated geospatial processing of the grouped data.
Geospatial joins, specifically, will take advantage of this process improvement. An increase in performance using this technique will be proportional to the size of the data being joined.
Example
The goal of the geospatial join in this example is to show flights that landed at (or passed over) JFK airport. It will do this by joining geolocation data sampled at various points of disparate airline flight paths with neighborhood zone data from the NYC NTA data set, and specifically, the JFK airport zones.
This example relies on the flights data set, which can be imported into Kinetica via GAdmin. That data will be copied to two tables, one with no geopartitioning and one with geopartitioning, and then increased in size by a magnitude of order to show the performance difference between querying the two.
The zone data, which includes JFK airport, will come from the NYC Neighborhood data file.
Setup
First, create the two flight data tables, with and without geopartitioning. The partitioned table is partitioned on the geohash of the flight path data points with a precision of 2, which partitions the data into 135 different groups.
|
|
Then, load them up with 10 times the original flight data.
|
|
Lastly, import the NYC neighborhood data
|
|
|
|
Geo-Join
Now that the tables are ready, the same geospatial join can be run, one using the geospatially partitioned table, and one using the non-partitioned table.
Without Geopartitioning
Non-partitioned query:
|
|
Sample runtime:
Query Execution Time: 0.453 s
With Geopartitioning
Partitioned query:
|
|
Sample runtime:
Query Execution Time: 0.196 s
Conclusion & Cautions
In this example, the same query, run against a geopartitioned table, is able to execute more than twice as fast as when run against a table without geopartitioning. Note that only the point data table was partitioned, using a geohash precision of 2. This will not be the optimal configuration in every case.
This geopartitioning technique may require testing several configurations to determine the one that provides optimal performance for the given data set and query. For each of the tables involved in the join, a decision must be made to partition the table (or not), and if so, what precision to use for the geohash used to partition the table. The size & geospatial distribution of both data sets, as well as the extent to which they overlap, will have an impact on the performance of the query and the configuration required to process it in the best case.