Using WKT Data and Geospatial Functions

Learn how to ingest and use WKT geometry with geospatial functions

The following sections demonstrate how to ingest and work with WKT data as well as how to use geospatial functions via SQL and the Kinetica Python API. Details about the geospatial functions can be found under Geospatial/Geometry Functions; a list of potential geospatial operations can be found under Geospatial Operations. All geospatial functions are compatible in both native API and SQL.

Prerequisites

Loading Sample Data

Data loading will be done via SQL. First log in to the database:

1
kisql --url <kinetica_url> --user <username> --pwd <password>

Create a schema to house the example tables.

Create Geospatial Example Schema
1
CREATE SCHEMA example_geospatial

Create tables for geospatial & temporal data within the schema.

Create NYC Neighborhood Shape Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
CREATE OR REPLACE REPLICATED TABLE example_geospatial.nyc_neighborhood
(
  gid INT NOT NULL,
  geom WKT NOT NULL,
  CTLabel VARCHAR(16) NOT NULL,
  BoroCode VARCHAR(16) NOT NULL,
  BoroName VARCHAR(16) NOT NULL,
  CT2010 VARCHAR(16) NOT NULL,
  BoroCT2010 VARCHAR(16) NOT NULL,
  CDEligibil VARCHAR(16) NOT NULL,
  NTACode VARCHAR(16) NOT NULL,
  NTAName VARCHAR(64) NOT NULL,
  PUMA VARCHAR(16) NOT NULL,
  Shape_Leng DOUBLE NOT NULL,
  Shape_Area DOUBLE NOT NULL
)
Create NYC Taxi Trip Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
CREATE OR REPLACE REPLICATED TABLE example_geospatial.taxi_trip_data
(
  transaction_id LONG NOT NULL,
  payment_id LONG NOT NULL,
  vendor_id VARCHAR(4) NOT NULL,
  pickup_datetime TYPE_TIMESTAMP,
  dropoff_datetime TYPE_TIMESTAMP,
  passenger_count TINYINT,
  trip_distance REAL,
  pickup_longitude REAL,
  pickup_latitude REAL,
  dropoff_longitude REAL,
  dropoff_latitude REAL
)
Create NYC Taxi Zone Shape Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
CREATE OR REPLACE TABLE example_geospatial.taxi_zone
(
  geom WKT NOT NULL,
  objectid SMALLINT NOT NULL,
  shape_length DOUBLE NOT NULL,
  shape_area DOUBLE NOT NULL,
  zone VARCHAR(64) NOT NULL,
  locationid INT NOT NULL,
  borough VARCHAR(16) NOT NULL,
  PRIMARY KEY (objectid)
)

Populate the example tables with source data. The files are staged into KiFS from the directory that is local to where KiSQL is running. From there, each is loaded into its respective table. Modify these statements as necessary to reflect the location of the data files on the client machine.

Create KiFS Directory for Geospatial Data Files
1
CREATE DIRECTORY 'geo_data'

Stage NYC Neighborhood Data
1
2
UPLOAD FILE 'nyc_neighborhood.csv'
INTO 'geo_data'
Load NYC Neighborhood Data
1
2
LOAD INTO example_geospatial.nyc_neighborhood
FROM FILE PATHS 'kifs://geo_data/nyc_neighborhood.csv'

Stage NYC Taxi Trip Data
1
2
UPLOAD FILE 'taxi_trip_data.csv'
INTO 'geo_data'
Load NYC Taxi Trip Data
1
2
LOAD INTO example_geospatial.taxi_trip_data
FROM FILE PATHS 'kifs://geo_data/taxi_trip_data.csv'

Stage NYC Taxi Zone Data
1
2
UPLOAD FILE 'taxi_zone.csv'
INTO 'geo_data'
Load NYC Taxi Zone Data
1
2
LOAD INTO example_geospatial.taxi_zone
FROM FILE PATHS 'kifs://geo_data/taxi_zone.csv'

Examples via SQL

Creating a Table and Inserting WKT Data

First, a points-of-interest table is created and loaded with sample WKT data.

Create POI Table
1
2
3
4
5
CREATE OR REPLACE TABLE example_geospatial.points_of_interest
(
  src1_poi WKT NOT NULL,
  src2_poi WKT NOT NULL
)

Then, data is inserted into the WKT table.

Populate POI Table
1
2
3
4
5
6
INSERT INTO example_geospatial.points_of_interest
VALUES
  ('POINT(-73.8455003 40.7577272)','POINT(-73.84550029 40.75772724)'),
  ('POLYGON((-73.9258506 40.8287493, -73.9258292 40.828723, -73.9257795 40.8287387, -73.925801 40.8287635,-73.9258506 40.8287493))','POLYGON((-73.9257795 40.8287387, -73.925801 40.8287635, -73.9258506 40.8287493, -73.9258506 40.8287493,-73.9258292 40.828723))'),
  ('LINESTRING(-73.9942208 40.7504289, -73.993856 40.7500753, -73.9932525 40.7499941)','LINESTRING(-70.9942208 42.7504289, -72.993856 43.7500753, -73.9932525 44.7499941)'),
  ('LINESTRING(-73.9760944 40.6833433, -73.9764404 40.6830626, -73.9763761 40.6828897)','LINESTRING(-70.9761 41.6834, -72.9765 39.6831, -76.9764 38.6829)')

Scalar Functions

For the complete list of scalar functions, see both Scalar Functions and Enhanced Performance Scalar Functions.

GEODIST Example
1
2
3
4
5
6
-- Using GEODIST(), calculate the distance between pickup and dropoff points to
-- see where the calculated distance is less than the recorded trip distance

SELECT *
FROM example_geospatial.taxi_trip_data
WHERE GEODIST(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude) < trip_distance
ST_AREA Example
1
2
3
4
5
6
7
8
-- Calculate the area of each taxi zone and find the five largest zones

SELECT TOP 5
  objectid AS "Zone_ID",
  zone AS "Zone_Name",
  DECIMAL(ST_AREA(geom, 1) / 1000000) AS "Zone_Area_in_Square_Kilometers"
FROM example_geospatial.taxi_zone
ORDER BY 3 DESC

Aggregation Functions

For the complete list of aggregation functions, see Aggregation Functions.

ST_DISSOLVE Example
1
2
3
4
5
6
7
-- Dissolve the boundaries between Queens neighborhood tabulation areas to
-- create a single borough boundary

SELECT
  ST_DISSOLVE(geom) AS neighborhoods
FROM example_geospatial.nyc_neighborhood
WHERE BoroName = 'Queens'

Below is a picture of the dissolved neighborhood boundaries:

queens_neighborhoods.png

Joins

Many geospatial/geometry functions can be used to join data sets. For the complete list of join and other scalar functions, see both Scalar Functions and Enhanced Performance Scalar Functions.

Implicit Geospatial Containment Join
1
2
3
4
5
6
7
8
9
SELECT TOP 25
  vendor_id AS "Pickup_Vendor_ID",
  passenger_count AS "Total_Passengers",
  pickup_longitude AS "Pickup_Longitude",
  pickup_latitude AS "Pickup_Latitude",
  BoroName AS "Pickup_Borough",
  NTAName AS "Pickup_NTA"
FROM example_geospatial.taxi_trip_data, example_geospatial.nyc_neighborhood
WHERE STXY_CONTAINS(geom, pickup_longitude, pickup_latitude) = 1
Explicit Geospatial Intersection Join
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- Find the top 10 neighborhood pickup locations

SELECT TOP 10
  ntaname AS "Pickup_NTA",
  COUNT(*) AS "Total_Pickups"
FROM
  example_geospatial.taxi_trip_data
  JOIN example_geospatial.nyc_neighborhood
    ON STXY_INTERSECTS(pickup_longitude, pickup_latitude, geom) = 1
GROUP BY
  ntaname
ORDER BY
  "Total_Pickups" DESC

In some cases, it may be useful to aggregate on the geospatial column itself, applying standard aggregation functions to other joined columns.

Explicit Geospatial Intersection Join with Geospatial Aggregation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Find the geospatial objects representing the top 10 neighborhoods,
-- measured by distance covered over all taxi trips originating in them

SELECT TOP 10
  SUM(trip_distance) AS "Total_Trip_Distance",
  geom AS "Pickup_Geo"
FROM
  example_geospatial.taxi_trip_data
  JOIN example_geospatial.nyc_neighborhood
    ON STXY_INTERSECTS(pickup_longitude, pickup_latitude, geom) = 1
GROUP BY
  geom
ORDER BY
  "Total_Trip_Distance" DESC

Equality

There are three ways of checking if two geometries are equal, which each have different thresholds:

  • ST_EQUALS()
  • ST_ALMOSTEQUALS()
  • ST_EQUALSEXACT()

For details on these functions see Scalar Functions.

The data inserted in the first section includes two POINTs that vary slightly in precision, two POLYGONs that vary in the ordering of their boundary points, and two pairs of LINESTRINGs that vary in both precision and scale. Each of these geometry types is compared to see if each given pair is spatially equal as defined by the equality functions:

Equality Comparisons Between Two Points
1
2
3
4
5
6
7
8
SELECT
  ST_GEOMETRYTYPE(src1_poi) AS "Source_1_Type",
  ST_GEOMETRYTYPE(src2_poi) AS "Source_2_Type",
  ST_ALMOSTEQUALS(src1_poi, src2_poi, 7) AS "Almost_Equals",
  ST_EQUALS(src1_poi, src2_poi) AS "Equals",
  ST_EQUALSEXACT(src1_poi, src2_poi, 4) AS "Equals_Exact"
FROM example_geospatial.points_of_interest
ORDER BY 1

Examples via the Python API

Creating a Table and Inserting WKT Data

First, a points-of-interest table is created and loaded with sample WKT data.

Create POI Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Create a type using the GPUdbRecordType object
columns = [gpudb.GPUdbRecordColumn("src1_poi", "string",
                                   [gpudb.GPUdbColumnProperty.WKT]),
           gpudb.GPUdbRecordColumn("src2_poi", "string",
                                   [gpudb.GPUdbColumnProperty.WKT])]
poi_record_type = gpudb.GPUdbRecordType(columns, label="poi_record_type")
poi_record_type.create_type(h_db)
poi_type_id = poi_record_type.type_id
# Create a table from the poi_record_type
response = h_db.create_table(table_name=poi_table, type_id=poi_type_id)
print "Table created: {}".format(response['status_info']['status'])

Then, data is inserted into the WKT table, and the insertion is verified.

Populate POI Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Prepare the data to be inserted
pois = [
    "POINT(-73.8455003 40.7577272)",
    "POLYGON((-73.9258506 40.8287493, -73.9258292 40.828723, -73.9257795 40.8287387, -73.925801 40.8287635, "
    "-73.9258506 40.8287493))",
    "LINESTRING(-73.9942208 40.7504289, -73.993856 40.7500753, -73.9932525 40.7499941)",
    "LINESTRING(-73.9760944 40.6833433, -73.9764404 40.6830626, -73.9763761 40.6828897)"
]
pois_compare = [
    "POINT(-73.84550029 40.75772724)",
    "POLYGON((-73.9257795 40.8287387, -73.925801 40.8287635, -73.9258506 40.8287493, -73.9258506 40.8287493, "
    "-73.9258292 40.828723))",
    "LINESTRING(-70.9942208 42.7504289, -72.993856 43.7500753, -73.9932525 44.7499941)",
    "LINESTRING(-70.9761 41.6834, -72.9765 39.6831, -76.9764 38.6829)"
]
# Insert the data into the table using a list
encoded_obj_list = []
for val in range(0, 4):
    poi1_val = pois[val]
    poi2_val = pois_compare[val]
    record = gpudb.GPUdbRecord(poi_record_type, [poi1_val, poi2_val])
    encoded_obj_list.append(record.binary_data)
response = h_db.insert_records(table_name=poi_table, data=encoded_obj_list, list_encoding="binary")
print "Number of records inserted:  {}".format(response['count_inserted'])

Scalar Functions

For the complete list of scalar functions, see both Scalar Functions and Enhanced Performance Scalar Functions.

GEODIST Example
1
2
3
4
5
6
7
8
9
""" Using GEODIST(), calculate the distance between pickup and dropoff points to see where the calculated distance
    is less than the recorded trip distance """
response = h_db.filter(
        table_name=taxi_trip_table,
        view_name=dist_check_view,
        expression="GEODIST(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude) "
                    "< trip_distance"
)
print "Number of trips where geographic distance < recorded distance: {}".format(response["count"])
ST_AREA Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Calculate the area of each taxi zone and copy the selected data to a projection
h_db.create_projection(
        table_name=taxi_zones_table,
        projection_name=zones_by_area_proj,
        column_names=["objectid", "zone", "ST_AREA(geom,1) / 1000000 AS area"],
        options={"order_by": "area desc", "limit": "5"}
)
# Display the top 5 largest taxi zones
large_zones = h_db.get_records(
        table_name=zones_by_area_proj,
        offset=0, limit=gpudb.GPUdb.END_OF_SET,
        encoding="json",
        options={"sort_by":"area", "sort_order": "descending"}
)['records_json']
print "Top 5 largest taxi zones:"
print "{:<7s} {:<27s} {:<30s}".format("Zone_ID", "Zone_Name", "Zone_Area_in_Square_Kilometers")
print "{:=<7s} {:=<27s} {:=<30s}".format("", "", "")
for zone in large_zones:
    print "{objectid:>7d} {zone:<27s} {area:>30.4f}".format(**json.loads(zone))

Aggregation Functions

For the complete list of aggregation functions, see Aggregation Functions.

ST_DISSOLVE Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Dissolve the boundaries between Queens neighborhood tabulation areas to create a single borough boundary
response = h_db.aggregate_group_by(
        table_name=nyc_nta_table,
        column_names=["ST_DISSOLVE(geom) AS neighborhoods"],
        offset=0, limit=1,
        encoding="json",
        options={
                "expression": "BoroName = 'Queens'",
                "result_table": queens_agb
        }
)
print "Queens neighborhood boundaries dissolved:  {}".format(response['status_info']['status'])

Below is a picture of the dissolved neighborhood boundaries:

queens_neighborhoods.png

Joins

Many geospatial/geometry functions can be used to join data sets. For the complete list of join and other scalar functions, see both Scalar Functions and Enhanced Performance Scalar Functions.

Implicit Geospatial Containment Join
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Locate which borough and neighborhood taxi pickup points occurred
h_db.create_join_table(
        join_table_name=pickup_point_location,
        table_names=[taxi_trip_table + " AS t", nyc_nta_table + " AS n"],
        column_names=[
                "t.vendor_id", "t.passenger_count",
                "t.pickup_longitude", "t.pickup_latitude",
                "n.BoroName", "n.NTAName"
        ],
        expressions=["STXY_CONTAINS(n.geom, t.pickup_longitude, t.pickup_latitude)"]
)
# Display a subset of the pickup point locations
pickup_points = h_db.get_records(
        table_name=pickup_point_location,
        offset=0, limit=25,
        encoding="json",
        options={"sort_by":"pickup_longitude"}
)['records_json']
print "Subset of pickup point neighborhood locations:"
print "{:<10s} {:<12s} {:<13s} {:<12s} {:<11s} {:<43s}".format("Vendor ID", "Pass. Count", "Pickup Long.",
                                                               "Pickup Lat.", "Borough", "NTA")
print "{:=<10s} {:=<12s} {:=<13s} {:=<12s} {:=<11s} {:=<43s}".format("", "", "", "", "", "")
for point in pickup_points:
    print "{vendor_id:<10s} {passenger_count:<12d} {pickup_longitude:<13f} {pickup_latitude:<12f} "\
          "{BoroName:<11s} {NTAName:<43s}".format(**json.loads(point))
Implicit Geospatial Intersection Join
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Find the top neighborhood pickup locations
h_db.create_join_table(
        join_table_name=top_pickup_locations_by_frequency,
        table_names=[taxi_trip_table + " AS t", nyc_nta_table + " AS n"],
        column_names=["n.NTAName AS NTAName"],
        expressions=["((STXY_INTERSECTS(t.pickup_longitude, t.pickup_latitude, n.geom) = 1))"]
)
# Aggregate the neighborhoods and the count of records per neighborhood
h_db.aggregate_group_by(
        table_name=top_pickup_locations_by_frequency,
        column_names=["NTAName AS Pickup_NTA", "COUNT(*) AS Total_Pickups"],
        offset=0, limit=gpudb.GPUdb.END_OF_SET,
        options={"result_table": pickup_agb}
)
# Display the top 10 neighborhood pickup locations
response = h_db.get_records_by_column(
        table_name=pickup_agb,
        column_names=["Pickup_NTA", "Total_Pickups"],
        offset=0, limit=10,
        encoding="json",
        options={"order_by": "Total_Pickups desc"}
)
data = h_db.parse_dynamic_response(response)['response']
print "Top 10 neighborhood pickup locations:"
print "{:<43s} {:<14s}".format("Pickup NTA", "Total Pickups")
print "{:=<43s} {:=<14s}".format("", "")
for pickupLocs in zip(data["Pickup_NTA"], data["Total_Pickups"]):
    print "{:<43s} {:<14d}".format(*pickupLocs)

In some cases, it may be useful to aggregate on the geospatial column itself, applying standard aggregation functions to other joined columns.

Implicit Geospatial Intersection Join with Geospatial Aggregation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Find the geospatial objects representing the top 10 neighborhoods by
# distance covered over all taxi trips originating in them
response = h_db.create_join_table(
        join_table_name=top_pickup_locations_by_distance,
        table_names=[taxi_trip_table + " AS t", nyc_nta_table + " AS n"],
        column_names=["n.geom AS geom", "t.trip_distance AS trip_distance"],
        expressions=["((STXY_INTERSECTS(t.pickup_longitude, t.pickup_latitude, n.geom) = 1))"]
)
# Aggregate the neighborhoods and the count of records per neighborhood
response = h_db.aggregate_group_by(
        table_name=top_pickup_locations_by_distance,
        column_names=["geom AS Pickup_Geo", "SUM(trip_distance) AS Total_Trip_Distance"],
        encoding = "json",
        offset = 0, limit = 10,
        options = {"sort_by":"value", "sort_order":"descending"}
)

Equality

There are three ways of checking if two geometries are equal, which each have different thresholds:

  • ST_EQUALS()
  • ST_ALMOSTEQUALS()
  • ST_EQUALSEXACT()

For details on these functions see Scalar Functions.

The data inserted in the first section includes two POINTs that vary slightly in precision, two POLYGONs that vary in the ordering of their boundary points, and two pairs of LINESTRINGs that vary in both precision and scale. Each of these geometry types is compared to see if each given pair is spatially equal as defined by the equality functions:

Equality Comparisons Between Two Points
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Calculate types of geospatial equality for differing sources of geometry
h_db.create_projection(
    table_name=poi_table,
    projection_name=poi_comp_table,
    column_names=[
        "ST_GEOMETRYTYPE(src1_poi) AS src1type",
        "ST_GEOMETRYTYPE(src2_poi) AS src2type",
        "ST_ALMOSTEQUALS(src1_poi, src2_poi, 7) AS almostequals",
        "ST_EQUALS(src1_poi, src2_poi) AS equals",
        "ST_EQUALSEXACT(src1_poi, src2_poi, 4) AS equalsexact"
    ]
)
equality_comps = h_db.get_records(
        table_name=poi_comp_table,
        offset=0, limit=10,
        encoding="json",
        options={"sort_by":"src1type"}
)['records_json']
print "Comparing ST_ALMOSTEQUALS(), ST_EQUALS(), and ST_EQUALSEXACT():"
print "{:<13s} {:<13s} {:<13s} {:<6s} {:<12s}".format(
        "Source_1_Type", "Source_2_Type", "Almost_Equals", "Equals", "Equals_Exact"
)
print "{:=<13s} {:=<13s} {:=<13s} {:=<6s} {:=<12s}".format("", "", "", "", "")
for comp in equality_comps:
    print "{src1type:<13s} {src2type:<13s} {almostequals:>13d} {equals:>6d} {equalsexact:>12d}".format(**json.loads(comp))

Complete Samples

Included below are complete examples containing all the above requests and the output.