Version:

Iteration

Overview

Kinetica supports arbitrary iteration over records within a data set for the purpose of generating 0 to N result records per iterated record.

The primary use case for this is in returning sub-elements of a given geospatial column as individual records; e.g., returning each point contained within a set of WKTs as a distinct row in a result set.

This capability exists in both the native API and SQL using the ITER virtual table. By joining a data set with the ITER table, each record in the data set can be returned as a user-defined number of records in the result set.

The iteration can occur over sub-elements of a column value (as in the case of the point extraction from a WKT) or a fixed number of times for each record in a given data set. To accomplish this, an iteration expression must be used to limit the ITER table's upper bound column, i:

ITER.i < <column expression>

The iteration can be limited by either:

  • a value-based number of iterations--here, the number of iterations for a given record is equal to the length of a string column in that record:

    ITER.i < LENGTH(column)
    
  • a constant number of iterations--here, there are 5 iterations (result rows) for each record in the data set:

    ITER.i < 5
    

The ITER table can be utilized via the native /create/jointable endpoint or by joining against the ITER table in SQL.

Examples

Geospatial

In these examples, point & line extraction from a WKT column will be demonstrated, using variable & fixed iteration techniques.

They will operate against a line table containing an identifier column and a WKT column:

+------+-------------------------------------------+
|   id | wkt                                       |
+------+-------------------------------------------+
|    1 | LINESTRING (1 1, 2 2, 3 3)                |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |
+------+-------------------------------------------+

Point Extraction - Variable Iteration (in SQL)

The individual points can be extracted from each LINESTRING, demonstrated here using SQL. Each record in line will be iterated over a number of times equal to the number of points in each line's WKT value (ST_NPOINTS(wkt)). Each resulting record will be the point within the WKT at the ith + 1 position (ST_POINTN(wkt, i + 1)). Note that the ITER table's i column values begin at 0.

SELECT id, wkt, i, ST_POINTN(wkt, i + 1) AS line_point
FROM line
JOIN ITER ON i < ST_NPOINTS(wkt)
ORDER BY id, i;
+------+-------------------------------------------+-----+-----------------+
|   id | wkt                                       |   i | line_point      |
+------+-------------------------------------------+-----+-----------------+
|    1 | LINESTRING (1 1, 2 2, 3 3)                |   0 | POINT (1 1)     |
|    1 | LINESTRING (1 1, 2 2, 3 3)                |   1 | POINT (2 2)     |
|    1 | LINESTRING (1 1, 2 2, 3 3)                |   2 | POINT (3 3)     |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |   0 | POINT (10 11)   |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |   1 | POINT (12 13)   |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |   2 | POINT (14 15)   |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |   3 | POINT (16 17)   |
+------+-------------------------------------------+-----+-----------------+

As the result set shows, the two LINESTRING values of three & four points each were turned into seven records--one for each POINT represented in the two LINESTRING values.

Line Segment Extraction - Variable Iteration (in Python)

The line segments can be extracted from each LINESTRING, demonstrated here with the create_join_table() function in Python. Each record in line will be iterated over a number of times equal to the number of component line segments in each line's WKT value (ST_NPOINTS(wkt) - 1). Each resulting record will be a LINESTRING created from the points within the WKT at the ith + 1 and ith + 2 positions (ST_POINTN(wkt, i + 1) & ST_POINTN(wkt, i + 2), respectively). Note that the ITER table's i column values begin at 0.

# Iterate over geo table records, creating line segments from the points
h_db.create_join_table(
    join_table_name = "point",
    table_names = ["line","ITER"],
    column_names = ["id", "wkt", "i", "ST_MAKELINE(ST_POINTN(wkt, i + 1), ST_POINTN(wkt, i + 2)) AS line_segment"],
    expressions = ["i < ST_NPOINTS(wkt) - 1"]
)

print "\n{:2s}  {:40s}  {}  {}".format("ID", "WKT", "ITER", "Line Segment")
for record in gpudb.GPUdbTable(None, "point", db = h_db).get_records():
    print "{r[id]:2d}  {r[wkt]:40s}  {r[i]:4d}  {r[line_segment]}".format(r = record)
ID  WKT                                       ITER  Line Segment
 1  LINESTRING (1 1, 2 2, 3 3)                   0  LINESTRING (1 1, 2 2)
 1  LINESTRING (1 1, 2 2, 3 3)                   1  LINESTRING (2 2, 3 3)
 2  LINESTRING (10 11, 12 13, 14 15, 16 17)      0  LINESTRING (10 11, 12 13)
 2  LINESTRING (10 11, 12 13, 14 15, 16 17)      1  LINESTRING (12 13, 14 15)
 2  LINESTRING (10 11, 12 13, 14 15, 16 17)      2  LINESTRING (14 15, 16 17)

As the result set shows, the two LINESTRING values of three & four points each were turned into five records--two component line segments for the first LINESTRING value and three for the second one.

Endpoint Extraction - Fixed Iteration (in SQL)

The points at both ends of each LINESTRING can be extracted, demonstrated here using SQL. Each record in line will be iterated over a fixed number of times--here, twice, one time for each point at the ends of the WKT value.

Each resulting record will either be the point at the beginning (ST_STARTPOINT(wkt)) or end (ST_ENDPOINT(wkt)) of the LINETSTRING. Note that the ITER table's i column values begin at 0, and that this query uses implied join syntax--no JOIN keyword is necessary.

SELECT id, wkt, i, IF(i = 0, ST_STARTPOINT(wkt), ST_ENDPOINT(wkt)) AS line_end_point
FROM line, ITER
WHERE i < 2
ORDER BY id, i;
+------+-------------------------------------------+-----+------------------+
|   id | wkt                                       |   i | line_end_point   |
+------+-------------------------------------------+-----+------------------+
|    1 | LINESTRING (1 1, 2 2, 3 3)                |   0 | POINT (1 1)      |
|    1 | LINESTRING (1 1, 2 2, 3 3)                |   1 | POINT (3 3)      |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |   0 | POINT (10 11)    |
|    2 | LINESTRING (10 11, 12 13, 14 15, 16 17)   |   1 | POINT (16 17)    |
+------+-------------------------------------------+-----+------------------+

As the result set shows, the two LINESTRING values were turned into four records--one for the POINT at each end of the two LINESTRING values.

Non-Geospatial

In these examples, character extraction from a string column will be demonstrated, using variable & fixed iteration techniques.

They will operate against a dictionary table containing an identifier column and a charN word column:

+------+----------+
|   id | word     |
+------+----------+
|    1 | apple    |
|    2 | banana   |
|    3 | cherry   |
+------+----------+

Letter Counting - Variable Iteration (in SQL)

Individual letters can be extracted from each word and then aggregated to count the number of occurrences of each letter across all words in dictionary, demonstrated here using SQL. Each record in dictionary will be iterated over a number of times equal to the number of letters in each word (LENGTH(word)). Each resulting record will be the letter within the word at the ith + 1 position (SUBSTR(word, i + 1, 1)). Note that the ITER table's i column values begin at 0.

SELECT id, word, i, SUBSTR(word, i + 1, 1) AS letter
FROM dictionary
JOIN ITER ON i < LENGTH(word)
ORDER BY id, i;
+------+----------+-----+----------+
|   id | word     |   i | letter   |
+------+----------+-----+----------+
|    1 | apple    |   0 | a        |
|    1 | apple    |   1 | p        |
|    1 | apple    |   2 | p        |
|    1 | apple    |   3 | l        |
|    1 | apple    |   4 | e        |
|    2 | banana   |   0 | b        |
|    2 | banana   |   1 | a        |
|    2 | banana   |   2 | n        |
|    2 | banana   |   3 | a        |
|    2 | banana   |   4 | n        |
|    2 | banana   |   5 | a        |
|    3 | cherry   |   0 | c        |
|    3 | cherry   |   1 | h        |
|    3 | cherry   |   2 | e        |
|    3 | cherry   |   3 | r        |
|    3 | cherry   |   4 | r        |
|    3 | cherry   |   5 | y        |
+------+----------+-----+----------+

As the result set shows, the three words were turned into a set of records, one for each letter in each word.

The query can have grouping applied to calculate the number of occurrences of each letter. Note that this query uses implied join syntax--no JOIN keyword is necessary.

SELECT SUBSTR(word, i + 1, 1) AS letter, COUNT(*) AS total_occurrences
FROM dictionary, ITER
WHERE i < LENGTH(word)
GROUP BY 1
ORDER BY letter;
+----------+---------------------+
| letter   |   total_occurrences |
+----------+---------------------+
| a        |                   4 |
| b        |                   1 |
| c        |                   1 |
| e        |                   2 |
| h        |                   1 |
| l        |                   1 |
| n        |                   2 |
| p        |                   2 |
| r        |                   2 |
| y        |                   1 |
+----------+---------------------+

Data Duplication - Fixed Iteration (in Python)

Data can be duplicated without being materialized through the use of iteration. This may be useful in several situations, including generating randomized test data from a template table or generating a large number of records for load testing. In either case, the memory footprint will be small, relative to the size of the table involved in the iteration operation. Queries against the result set should be as performant as those on tables with an equivalent number of materialized records.

In this example, the words in the dictionary table will be duplicated five times:

# Iterate a fixed number of times, duplicating source records
h_db.create_join_table(
    join_table_name = "dictionary_duplicated",
    table_names = ["dictionary","ITER"],
    column_names = ["id", "i", "word"],
    expressions = ["i < 5"]
)

print "\n{:2s}  {}  {}".format("ID", "ITER", "Word")
for record in gpudb.GPUdbTable(None, "dictionary_duplicated", db = h_db).get_records():
    print "{r[id]:2d}  {r[i]:4d}  {r[word]}".format(r = record)
ID  ITER  Word
 1     0  apple
 1     1  apple
 1     2  apple
 1     3  apple
 1     4  apple
 2     0  banana
 2     1  banana
 2     2  banana
 2     3  banana
 2     4  banana
 3     0  cherry
 3     1  cherry
 3     2  cherry
 3     3  cherry
 3     4  cherry

As the result set shows, each word is duplicated five times.

Limitations

  • Store-only columns cannot be used in the iteration expression
  • As outer joins (LEFT, RIGHT, & FULL) have no meaning within the context of iteration, they should not be used in the join
  • Though multiple iterators can be used in the same operation, they cannot refer to each other for bounding purposes; e.g., iter1.i < iter2.i