> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kinetica.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Python Developer Guide

> Step-by-step instructions on writing Python applications with Kinetica

<a id="um-python-label" />

The following guide provides step-by-step instructions to get started writing
*Python* applications using *Kinetica*. This guide demonstrates only a small set
of the available API.  A detailed description of the complete interface is
available under [Python API Reference](/content/api/python).

## Prerequisites

The prerequisites for running the match graph example are listed below:

* Python API
* [Tutorial script](https://raw.githubusercontent.com/kineticadb/kinetica-docs/master/content/examples/python/python_tutorial.py)
* [Taxi trip data CSV file](https://raw.githubusercontent.com/kineticadb/kinetica-docs/master/content/examples/data/taxi_trip_data.csv)

<a id="python-api-installation" />

### Python API Installation

When using this include, precede it with the following heading:

### Python API Installation

Depending on the target operating system, a Python virtual environment may need
to be installed first:

* [Python Virtual Environment](#python-virtual-environment)

The native *Kinetica Python API* is accessible through the following means:

* [PyPI](#pypi)
* [Git](#git)

#### Python Virtual Environment

A Python virtual environment is necessary to install in an operating environment
where Python is externally managed.

1. Install a Python virtual environment:

   ```bash theme={null}
   python3 -m venv .venv
   ```

2. Activate the Python virtual environment:

   ```bash theme={null}
   source .venv/bin/activate
   ```

#### PyPI

1. Install the API:

   ```bash theme={null}
   pip3 install gpudb
   ```

2. Test the installation:

   ```python theme={null}
   python3 -c "import gpudb;print('Import Successful')"
   ```

   If *Import Successful* is displayed, the API has been installed as is ready
   for use.

#### Git

1. In the desired directory, run the following, but be sure to replace
   `<kinetica-version>` with the name of the installed Kinetica version,
   e.g., `v7.2`:

   ```bash theme={null}
   git clone -b release/<kinetica-version> --single-branch https://github.com/kineticadb/kinetica-api-python.git
   ```

2. Change directory into the newly downloaded repository:

   ```bash theme={null}
   cd kinetica-api-python
   ```

3. In the root directory of the unzipped repository, install the Kinetica API:

   ```bash theme={null}
   sudo pip3 install .
   ```

4. Test the installation (*Python3* is necessary for running the API example):

   ```bash theme={null}
   python3 examples/example.py
   ```

### Data File

The tutorial script references the <Badge color="gray">taxi\_trip\_data.csv</Badge> data file,
mentioned in the [Prerequisites](#prerequisites), in the current local directory, by default.
This directory can specified as a parameter when running the script.

## Connecting to the Database

To interact with Kinetica, you must first instantiate an object of the
[GPUdb](/content/api/python/source/gpudb) class while providing the
connection URL, username, & password to the database server.  For more details
on connecting to the database, see [Connecting via API](/content/api/concepts#api-conn).

```python Connect to the Database theme={null}
kdb = gpudb.GPUdb(host = args.url, username = args.username, password = args.password)
```

## Creating a Schema

All [tables](/content/concepts/tables) &
[views](/content/concepts/views) must be created within a
[schema](/content/concepts/schemas).  So, the first step will be to create
a *schema* to contain all of the *tables* & *views* created within this
tutorial.

```python Create Tutorial Schema theme={null}
schema = "tutorial_python"
kdb.create_schema(schema)
```

## Creating a Type

Before a *table* can be created and any data can be loaded into it, a
[type](/content/concepts/types) needs to be defined.  The *type* is the set
of column definitions for the *table*, structured as a list of lists, where each
contained list corresponds to a single column definition.  Each column
comprises at least two values:  a column name (always the first value) and a
*base type* (always the second value).  Any subsequent values are column
properties.  You can also use the
[GPUdbColumnProperty](/content/api/python/source/gpudbcolumnproperty)
object in addition to literal string values to help define a column.

For this tutorial, three *types* will be used to create three corresponding
*tables*:

* A base *table* containing a sample of NYC taxi cab ride data
* A lookup *table* containing sample cab vendor information (created in the
  example below)
* A 1-to-1 relation *table* containing payment information for a subset of the
  cab-ride transactions

The definition of the *type* for the `vendor` *table* follows:

```python Define a Type theme={null}
vendor_columns = [
    # column types and properties can be listed as strings
    ["vendor_id", "string", "char4", "primary_key"],
    ["vendor_name", "string", "char64"],
    ["phone", "string", "char16", "nullable"],
    ["email", "string", "char64", "nullable"],
    ["hq_street", "string", "char64"],
    # column properties can also be listed using the GPUdbColumnProperty
    # object
    [
        "hq_city",
        "string",
        gpudb.GPUdbColumnProperty.CHAR8,
        gpudb.GPUdbColumnProperty.DICT
    ],
    [
        "hq_state",
        "string",
        gpudb.GPUdbColumnProperty.CHAR8,
        gpudb.GPUdbColumnProperty.DICT
    ],
    ["hq_zip", "int"],
    ["num_emps", "int"],
    ["num_cabs", "int"]
]
```

<Info>
  While the type is just defined here, it will be created implicitly
  when first used to create a table, shown below.
</Info>

## Creating a Table

After you've defined a *type* list, you can create a *table* using the
[GPUdbTable](/content/api/python/source/gpudbtable) interface, passing in
the *type*, the name of the *table*, any *table* options, and a handle to the
database connection.

```python Create a Table theme={null}
table_vendor_obj = gpudb.GPUdbTable(
    _type = vendor_columns,
    name = table_vendor,
    options = {"is_replicated": "true"},
    db = kdb
)
```

To pass in table creation options, you can either specify a Python map, as
above, or you can use the
[GPUdbTableOptions](/content/api/python/source/gpudbtableoptions) object
instead:

```python Use GPUdbTableOptions Instead of Raw Map for Options theme={null}
creation_options = gpudb.GPUdbTableOptions.default().is_replicated(True)
```

<Note>
  At this point in the tutorial, a `payment` and
  `taxi_trip_data` type and table are also created.
</Note>

To acquire a [GPUdbTable](/content/api/python/source/gpudbtable) object for
a *table* that already exists, pass in a `_type` of `None`:

```python Get a Handle to an Existing Table theme={null}
table_vendor_obj = gpudb.GPUdbTable(_type=None, name=table_vendor, db=kdb)
```

## Inserting Data

You can insert single records or multiple records using the
[insert\_records](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.insert_records)
method. As mentioned previously, the number and order of values must match the
column declarations found in the [type creation](#creating-a-type) step. In
*Python*, you can pass a value of `None` for null values, but the column must
be declared nullable when creating the *type* schema.

<Info>
  The following methods are not the only way to insert records, but they
  are the simplest.  There is a convenience class called
  [GPUdbIngestor](/content/api/python/source/gpudbingestor) and an
  automatic multi-head capability available through
  [GPUdbTable](/content/api/python/source/gpudbtable), which facilitate
  inserting records into a table in batches; see
  [Distributed Ingest](/content/tuning/multihead/multihead_ingest) for details.
</Info>

### Key/Value Record

In this example, a dictionary object will be used to map key/value pairs to
table column values.

```python Insert by Dictionary theme={null}
# Create ordered dictionary for keys & values of record
payment_datum = collections.OrderedDict()
payment_datum["payment_id"] = 189
payment_datum["payment_type"] = "No Charge"
payment_datum["credit_type"] = None
payment_datum["payment_timestamp"] = None
payment_datum["fare_amount"] = 6.5
payment_datum["surcharge"] = 0
payment_datum["mta_tax"] = 0.6
payment_datum["tip_amount"] = 0
payment_datum["tolls_amount"] = 0
payment_datum["total_amount"] = 7.1

# Insert the record into the table
table_payment_obj.insert_records(payment_datum)
```

### In-Line Record

In this example, records will be defined in-line and added to a list.  They can
either be inserted as a single list of lists, as is done here, or as individual
lists, one per record.

```python Insert by List theme={null}
# Create a list of in-line records. The order of the values must match the
# column order in the type
vendor_records = [
    ["VTS", "Vine Taxi Service", "9998880001", "admin@vtstaxi.com",
     "26 Summit St.", "Flushing", "NY", 11354, 450, 400],
    ["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn",
     "NY", 11223, 445, 425],
    ["NYC", "New York City Cabs", None, "support@nyc-taxis.com",
     "9669 East Bayport St.", "Bronx", "NY", 10453, 505, 500],
    ["DDS", "Dependable Driver Service", None, None,
        "8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
    ["CMT", "Crazy Manhattan Taxi", "9778896500",
     "admin@crazymanhattantaxi.com", "950 4th Road Suite 78", "Brooklyn",
     "NY", 11210, 500, 468],
    ["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx",
     "NY", 10458, 315, 305],
    ["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.",
     "Brooklyn", "NY", 11228, 166, 150],
    ["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com",
     "9128 Lantern Street", "Brooklyn", "NY", 11229, 193, 175]
]

# Insert the records into the Vendor table
table_vendor_obj.insert_records(vendor_records)
```

### CSV

CSV and other data files can be loaded into Kinetica via
[KiFS](/content/tools/kifs) using two lines of code:

1. Upload the local file to *KiFS* using
   [upload\_files](/content/api/python/source/gpudb#gpudb.GPUdb.upload_files),
   passing the *KiFS* path to upload to and the bytes of the file to upload.
2. Insert the data from the uploaded file in *KiFS* using
   [insert\_records\_from\_files](/content/api/python/source/gpudb#gpudb.GPUdb.insert_records_fromfiles),
   passing the name of the table to load data into and the *KiFS* path to the
   source file.

```python Insert from File theme={null}
kdb.upload_files("/data/" + CSV_FILE, open(csv_path, "rb").read())
kdb.insert_records_from_files(table_taxi, ["kifs://data/" + CSV_FILE])
```

<Info>
  The *KiFS* directory must already exist before uploading a file into
  it.  The
  [create\_directory](/content/api/python/source/gpudb#gpudb.GPUdb.create_directory)
  method can be used to create this directory.
</Info>

## Retrieving Data

Once the [table](/content/concepts/tables) is populated with data, the data can be
retrieved using binary encoding or JSON encoding. Binary encoding is more
efficient than JSON encoding. The example below uses the
[GPUdbTable.get\_records](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.get_records)
method and binary encoding:

```python Retrieve Records (Binary Encoding via GPUdbTable class) theme={null}
# Retrieve no more than 10 records from the Payment table using the
# GPUdbTable interface with binary encoding
print("{:>10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} {:<12s} {:<12s}".format(
        "Payment ID", "Payment Type", "Credit Type", "Payment Timestamp", "Fare Amount",
        "Surcharge", "MTA Tax", "Tip Amount", "Tolls Amount", "Total Amount"
))
print("{:=>10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} {:=<12s} {:=<12s}".format(
        "", "", "", "", "", "", "", "", "", ""
))
for record in table_payment_obj.get_records(
    offset = 0,
    limit = 10,
    encoding = "binary",
    options = {"sort_by": "payment_id"}
):
    print("{payment_id:>10d} {payment_type:<12s} {credit_type:<11s} {payment_timestamp:>17} " \
        "{fare_amount:11.2f} {surcharge:9.2f} {mta_tax:7.2f} {tip_amount:10.2f} " \
        "{tolls_amount:12.2f} {total_amount:12.2f}".format(
            payment_id=record["payment_id"],
            payment_type=(record["payment_type"] or ""),
            credit_type=(record["credit_type"] or ""),
            payment_timestamp=(record["payment_timestamp"] or ""),
            fare_amount=record["fare_amount"],
            surcharge=record["surcharge"],
            mta_tax=record["mta_tax"],
            tip_amount=record["tip_amount"],
            tolls_amount=record["tolls_amount"],
            total_amount=record["total_amount"]
    ))
```

This example demonstrates JSON encoding record retrieval, using the
[GPUdb.get\_records](/content/api/python/source/gpudb#gpudb.GPUdb.get_records)
method:

```python Retrieve Records (JSON Encoding via GPUdb class) theme={null}
# Retrieve all records from the Vendor table using the GPUdb interface with JSON encoding
print("{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<8s} {:<8s} {:<6s} {:<11s} {:<6s}".format(
        "Vendor ID", "Vendor Name", "Phone", "Email", "HQ Street",
        "HQ City", "HQ State", "HQ Zip", "# Employees", "# Cabs"
))
print("{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<8s} {:=<8s} {:=<6s} {:=<11s} {:=<6s}".format(
        "", "", "", "", "", "", "", "", "", ""
))
vendor_records_gpudb = kdb.get_records(
    table_name = table_vendor,
    offset = 0,
    limit = gpudb.GPUdb.END_OF_SET,
    encoding = "json",
    options = {"sort_by": "vendor_id"}
)["records_json"]
for record in vendor_records_gpudb:
    rec = json.loads(record)
    rec["phone"] = rec["phone"] or ""
    rec["email"] = rec["email"] or ""
    print("{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} {hq_street:<24s} " \
        "{hq_city:<8s} {hq_state:<8s} {hq_zip:<6d} {num_emps:11d} {num_cabs:6d}".format(**rec))
```

For large *tables*, the data can be easily be retrieved in smaller blocks by
using the `offset` and `limit` parameters.  The returned response also
contains the *type* (column definitions) of the results.

## Updating Records

Using any [GPUdbTable](/content/api/python/source/gpudbtable) object, you can
update records via the
[update\_records](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.update_records)
method.

```python Update Records theme={null}
# Update the e-mail of, and add two employees and one cab to, the DDS vendor
table_vendor_obj.update_records(
    expressions = ["vendor_id = 'DDS'"],
    new_values_maps = {
        "email": "'management@ddstaxico.com'",
        "num_emps": "num_emps + 2",
        "num_cabs": "num_cabs + 1"
    },
    options = {"use_expressions_in_new_values_maps":"true"}
)
```

## Deleting Records

You can delete records using the
[delete\_records](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.delete_records)
method.  This method allows you to specify multiple filter expressions--note
that each expression is used to delete records independently from the others
(i.e., a record only needs to meet any one expression's criteria to be deleted
from the table).  Set `delete_all_records` to `True` in the `options` map
to delete all records in the table.

```python Delete Records theme={null}
# Delete payment 189
delete_expr = ["payment_id = 189"]
table_payment_obj.delete_records(expressions=delete_expr)
```

## Alter Table

Some properties can be altered or added after *table* creation, using
[alter\_table](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.alter_table),
including [indexes](/content/concepts/indexes) and
[dictionary encoding](/content/concepts/dictionary_encoding).

### Indexes

Using the
[alter\_table](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.alter_table)
method, you can create *indexes* on columns using the `create_index` *action*
paired with a column name.

```python Index Columns theme={null}
# Add column indexes on:
#   - payment table, fare_amount (for query-chaining filter example)
#   - taxi table, passenger_count (for filter-by-range example)
table_payment_obj.alter_table(action="create_index", value="fare_amount")
table_taxi_obj.alter_table(action="create_index", value="passenger_count")
```

### Dictionary Encoding

Applying *dictionary encoding* via
[alter\_table](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.alter_table)
involves adding a new property to a column.

```python Dictionary Encode Column theme={null}
# Apply dictionary encoding to the vendor_id column
at_resp = table_taxi_obj.alter_table(
    action = "change_column",
    value = "vendor_id",
    options = {"column_properties": "char4,dict"}
)
```

<Note>
  To add a new property, all existing column properties must be
  listed along with any new properties.
</Note>

## Filters

Filters are an easy way to reduce larger *tables* into more concise
[views](/content/concepts/views) using
[expressions](/content/concepts/expressions).

```python Count Value-Filtered Records theme={null}
# Filter for only payments with no corresponding payment type, returning the
# count of records found; allow Kinetica to assign a random name to the view
f1_count = table_payment_obj.filter(expression="IS_NULL(payment_type)").size()
```

```python Count Using Filter Chaining; Name the Backing Views theme={null}
# Using GPUdbTable query chaining, filter null payment type records with a fare amount greater than 8
f2_count = table_payment_obj.filter(
    view_name = view_example1, expression = "IS_NULL(payment_type)"
).filter(
    view_name = view_example2, expression = "fare_amount > 8"
).size()
```

```python Count List-Filtered Records theme={null}
# Filter by list where vendor ID is either NYC or YCAB
f3_count = table_taxi_obj.filter_by_list(
    view_name = view_example3, column_values_map = {"vendor_id": ["NYC", "YCAB"]}
).size()
```

```python Count Range-Filtered Records theme={null}
# Filter by range trip with passenger count between 1 and 3
f4_count = table_taxi_obj.filter_by_range(
    view_name = view_example4, column_name = "passenger_count", lower_bound = 1, upper_bound = 3
).size()
```

## Aggregates

*Kinetica* supports various aggregate and group-by queries, which group and
aggregate your data to return counts and useful statistics.

```python Calculate Count/Min/Mean/Max theme={null}
# Aggregate count, min, mean, and max on the trip distance
a1_resp = table_taxi_obj.aggregate_statistics(
    column_name = "trip_distance",
    stats = "count,min,max,mean"
)["stats"]

print("Statistics of values in the trip_distance column:")
print("\tCount: {count:5.0f}\n\tMin:   {min:5.2f}\n\tMean:  {mean:5.2f}\n\tMax:   {max:5.2f}\n" \
    "".format(**a1_resp))
```

```python Retrieve Unique Values theme={null}
# Find unique taxi vendor IDs
a2_resp = table_taxi_obj.aggregate_unique(
    column_name = "vendor_id",
    offset = 0,
    limit = gpudb.GPUdb.END_OF_SET,
    encoding = "json"
)["data"]["vendor_id"]

print("Unique vendor IDs in the taxi trip table:")
for vendor in a2_resp:
    print("\t* {}".format(vendor))
```

```python Report Number of Records per Group theme={null}
# Find number of trips per vendor
a3_resp = table_taxi_obj.aggregate_group_by(
    column_names = ["vendor_id", "COUNT(vendor_id)"],
    offset = 0,
    limit = gpudb.GPUdb.END_OF_SET,
    options = {"sort_by": "key"}
)["data"]

print("Trips per vendor:")
for vendor in zip(a3_resp["vendor_id"], a3_resp["COUNT(vendor_id)"]):
    print("\t{:<6s} {:3d}".format(vendor[0] + ":", vendor[1]))
```

```python Generate Histogram theme={null}
# Create a histogram for the different groups of passenger counts
a4_resp = table_taxi_obj.aggregate_histogram(
    column_name = "passenger_count",
    start = 1,
    end = 6,
    interval = 1
)["counts"]

print("Passenger count groups by size:")
print("{:<10s} {:<11s}".format("Passengers", "Total Trips"))
print("{:=<10s} {:=<11s}".format("", ""))

for histo_group in zip([1, 2, 3, 4, '>5'], a4_resp):
    print("{:>10} {:11.0f}".format(*histo_group))
```

## Joins

[Joins](/content/concepts/joins) allow you to link multiple *tables* together,
along their relations, retrieving associated information from any or all of
them.  Tables can only be *joined* if they're [sharded](/content/concepts/tables#sharded)
similarly or [replicated](/content/concepts/tables#replicated).    All join types can be
performed using the
[create\_join\_table](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.create_join_table)
method.  This method is static and is not called with a corresponding
[GPUdbTable](/content/api/python/source/gpudbtable) object.

An *inner join* returns only records that have matching values in both *tables*.

```python Inner Join theme={null}
# Retrieve payment information for rides having more than three passengers
gpudb.GPUdbTable.create_join_table(
    join_table_name = join_table_inner,
    table_names = [
        schema + ".taxi_trip_data as t",
        schema + ".payment as p"
    ],
    column_names = [
        "t.payment_id",
        "payment_type",
        "total_amount",
        "passenger_count",
        "vendor_id",
        "trip_distance"
    ],
    expressions = [
        "t.payment_id = p.payment_id",
        "passenger_count > 3"
    ],
    db = kdb
)
```

A *left join* returns all of the records an *inner join* does, but additionally,
for each record in the table on the *left* side of the *join* that has no match
along the relation to a record in the table on the *right* side of the *join*, a
corresponding record will be returned with "left-side" columns populated with
the "left-side" record data and the "right-side" columns populated with *nulls*.
Note the usage of `left join` in the given expression.

```python Left Join theme={null}
# Retrieve cab ride transactions and the full name of the associated vendor
# (if available--blank if vendor name is unknown) for transactions with
# associated payment data, sorting by increasing values of transaction ID.
gpudb.GPUdbTable.create_join_table(
    join_table_name = join_table_left,
    table_names = [
        schema + ".taxi_trip_data as t",
        schema + ".vendor as v"
    ],
    column_names = [
        "transaction_id",
        "pickup_datetime",
        "trip_distance",
        "t.vendor_id",
        "vendor_name"
    ],
    expressions = [
        "LEFT JOIN t, v ON (t.vendor_id = v.vendor_id)",
        "payment_id <> 0"
    ],
    db = kdb
)
```

<Info>
  Full outer *joins* require both tables to be replicated or joined on
  their shard keys. Set merges that perform deduplication of records, like
  [Union Distinct](/content/concepts/unions),
  [Intersect](/content/concepts/intersect), and
  [Except](/content/concepts/except) also need to use replicated tables to
  ensure the correct results, so a replicated version of the taxi
  (`taxi_trip_data_replicated`) *table* is created at this point in the
  tutorial using
  [create\_projection](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.create_projection).
</Info>

```python Create Replicated Table for Full Outer Join theme={null}
kdb.create_projection(
    table_name = table_taxi,
    projection_name = table_taxi_replicated,
    column_names = [
        "transaction_id",
        "payment_id",
        "vendor_id",
        "pickup_datetime",
        "dropoff_datetime",
        "passenger_count",
        "trip_distance",
        "pickup_longitude",
        "pickup_latitude",
        "dropoff_longitude",
        "dropoff_latitude"
    ],
    options = {"is_replicated": "true"}
)
```

A *full outer join* returns all of the records a *left join* does, but
additionally, for each record in the table on the *right* side of the *join*
that has no match along the relation to a record in the table on the *left* side
of the *join*, a corresponding record will be returned with "right-side" columns
populated with the "right-side" record data and the "left-side" columns
populated with *nulls*.

```python Full Outer Join theme={null}
# Retrieve the vendor IDs of known vendors with no recorded cab ride
# transactions, as well as the vendor ID and number of transactions for
# unknown vendors with recorded cab ride transactions
gpudb.GPUdbTable.create_join_table(
    join_table_name = join_table_outer,
    table_names = [
        schema + ".taxi_trip_data_replicated as t",
        schema + ".vendor as v"
    ],
    column_names = [
        "t.vendor_id as vendor_id",
        "v.vendor_id as vendor_id_1"
    ],
    expressions = ["FULL_OUTER JOIN t,v ON ((v.vendor_id = t.vendor_id))"],
    db = kdb
)
```

## Projections

You can create [projections](/content/concepts/projections) using the
[create\_projection](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.create_projection)
method.

```python Create a Temporary Projection theme={null}
# Create a projection containing all payments by credit card
table_payment_obj.create_projection(
    projection_name = projection_example1,
    column_names = [
        "payment_id",
        "payment_type",
        "credit_type",
        "payment_timestamp",
        "fare_amount",
        "surcharge",
        "mta_tax",
        "tip_amount",
        "tolls_amount",
        "total_amount"
    ],
    options = {"expression": "payment_type = 'Credit'"}
)
```

To persist a *projection*:

```python Create a Permanent Projection theme={null}
# Create a persisted table with cab ride transactions greater than 5 miles
# whose trip started during lunch hours
table_taxi_obj.create_projection(
    projection_name = projection_example2,
    column_names = [
        "HOUR(pickup_datetime) as hour_of_day",
        "vendor_id",
        "passenger_count",
        "trip_distance"
    ],
    options = {
        "expression":
            "(HOUR(pickup_datetime) >= 11) AND "
            "(HOUR(pickup_datetime) <= 14) AND "
            "(trip_distance > 5)",
        "persist": "true"
    }
)
```

## Union, Intersect, & Except

[Union](/content/concepts/unions) can be used to combine homogeneous data
sets into one larger data set.  When calling the
[create\_union](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.create_union)
method, a `mode` of `union` or `union_distinct` will both combine data
sets, but only retain the records that are unique across the chosen columns,
removing all duplicates. Using `union_all` will combine data sets, retaining
all records from the source data sets.

```python Set Union Retaining Duplicates theme={null}
# Calculate the average number of passengers, as well as the shortest,
# average, and longest trips for all trips in
# each of the two time periods--from April 1st through the 15th, 2015 and
# from April 16th through the 23rd, 2015--and return those two sets of
# statistics in a single result set.
table_taxi_obj.aggregate_group_by(
    column_names = [
        "AVG(passenger_count) as avg_pass_count",
        "AVG(trip_distance) as avg_trip_dist",
        "MIN(trip_distance) as min_trip_dist",
        "MAX(trip_distance) as max_trip_dist"
    ],
    offset = 0,
    limit = gpudb.GPUdb.END_OF_SET,
    options = {
        "expression":
            "(pickup_datetime >= '2015-04-01') AND "
            "(pickup_datetime <= '2015-04-15 23:59:59.999')",
        "result_table": agg_grpby_union_all_src1
    }
)
table_taxi_obj.aggregate_group_by(
    column_names = [
        "AVG(passenger_count) as avg_pass_count",
        "AVG(trip_distance) as avg_trip_dist",
        "MIN(trip_distance) as min_trip_dist",
        "MAX(trip_distance) as max_trip_dist"
    ],
    offset = 0,
    limit = gpudb.GPUdb.END_OF_SET,
    options = {
        "expression":
            "(pickup_datetime >= '2015-04-16') AND "
            "(pickup_datetime <= '2015-04-23 23:59:59.999')",
        "result_table": agg_grpby_union_all_src2
    }
)
gpudb.GPUdbTable.create_union(
    table_name = union_all_table,
    table_names = [
        agg_grpby_union_all_src1,
        agg_grpby_union_all_src2
    ],
    input_column_names = [
        ["'2015-04-01 - 2015-04-15'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"],
        ["'2015-04-16 - 2015-04-23'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"]
    ],
    output_column_names = [
        "pickup_window_range",
        "avg_pass_count",
        "avg_trip",
        "min_trip",
        "max_trip"
    ],
    options = {"mode": "union_all"},
    db = kdb
)
```

Calling the
[create\_union](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.create_union)
method with a `mode` of `intersect` will perform a set
[intersection](/content/concepts/intersect), which will combine data sets
but only include the records found in both data sets, removing duplicate result
records.  Using `intersect_all` will retain intersecting duplicates from both
sets.

```python Set Intersection theme={null}
# Retrieve locations (as lat/lon pairs) that were both pick-up and drop-off points
gpudb.GPUdbTable.create_union(
    table_name = union_intersect_table,
    table_names = [
        table_taxi_replicated,
        table_taxi_replicated
    ],
    input_column_names = [
        ["pickup_latitude", "pickup_longitude"],
        ["dropoff_latitude", "dropoff_longitude"]
    ],
    output_column_names = ["latitude", "longitude"],
    options = {"mode": "intersect"},
    db = kdb
)
```

Calling the
[create\_union](/content/api/python/source/gpudbtable#gpudb.GPUdbTable.create_union)
method with a `mode` of `except` will perform a set
[except](/content/concepts/except) (subtraction), which will return records
that appear in the first data set but not the second data set.  Note that the
data sets on each side of the *except* will have duplicates removed *first*, and
then the set subtraction will be processed.  Using `except_all` will retain
duplicates from the first set.

```python Set Exception (Subtraction) theme={null}
# Show vendors that operate before noon, but not after noon: retrieve the
# unique list of IDs of vendors who provided cab rides between midnight
# and noon, and remove from that list the IDs of any vendors who provided
# cab rides between noon and midnight
table_taxi_replicated_obj.create_projection(
    projection_name = projection_except_src1,
    column_names = ["vendor_id"],
    options = {
        "expression":
            "(HOUR(pickup_datetime) >= 0) AND "
            "(HOUR(pickup_datetime) <= 11)"
    }
)
table_taxi_replicated_obj.create_projection(
    projection_name = projection_except_src2,
    column_names = ["vendor_id"],
    options = {
        "expression":
            "(HOUR(pickup_datetime) >= 12) AND "
            "(HOUR(pickup_datetime) <= 23)"
    }
)
gpudb.GPUdbTable.create_union(
    table_name = union_except_table,
    table_names = [
        projection_except_src1,
        projection_except_src2
    ],
    input_column_names = [
        ["vendor_id"],
        ["vendor_id"]
    ],
    output_column_names = ["vendor_id"],
    options = {"mode": "except"},
    db = kdb
)
```

<a id="complete-sample-python" />

## Download & Run

Included below is a complete example containing all the above requests, the data
file, and output.

* [Python Tutorial Script](https://raw.githubusercontent.com/kineticadb/kinetica-docs/master/content/examples/python/python_tutorial.py)
* [Taxi Data File](https://raw.githubusercontent.com/kineticadb/kinetica-docs/master/content/examples/data/taxi_trip_data.csv)
* [Python Output](https://raw.githubusercontent.com/kineticadb/kinetica-docs/master/content/examples/python/python_tutorial.out)

To run the complete sample, ensure that:

* the <Badge color="gray">python\_tutorial.py</Badge> script is in the current directory
* the <Badge color="gray">taxi\_trip\_data.csv</Badge> file is in the current directory or use
  the `data_dir` parameter to specify the local directory containing it

Then, run the following:

```bash title="Run Example" theme={null}
python python_tutorial.py [--url <kinetica_url>] --username <username> --password <password> [--data_dir <data_file_directory>]
```

<Info>
  As this script creates a schema and several database objects within
  it, [system admin permission](/content/security/sec_concepts#security-concepts-permissions-system) is
  required to run it.
</Info>
