Python Developer Manual¶

The following guide provides step by step instructions in Python to get started writing applications using GPUdb. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.

Downloading GPUdb API¶

The source code for the Python API is available for download from GitHub at GPUdb/gpudb-api-python. Follow the instructions in the included README file to install the API package.

Connecting to GPUdb¶

To connect to the GPUdb server, create an object of the GPUdb class providing the ipaddress and port number of the server.

import gpudb

gpudb_handle = gpudb.GPUdb(encoding = 'BINARY', host = '127.0.0.1', port = '9191')

If the server is configured to authenticate users, then the username and password will also need to be provided at the time of creating the object as follows

gpudb_handle = gpudb.GPUdb(encoding='BINARY', host='172.22.33.44', port='9191', username='user1', password='mysecret')

Loading Data¶

Before any data can be loaded into the system, a Type needs to be defined in the system. The type definition is a JSON string describing the fields (i.e. columns) of the type along with a name for the type. Each field consists of a name and a data type.

my_type='''{
            "type":"record",
            "name":"polygon_point",
            "fields":[
                {"name":"col1","type":"double"},
                {"name":"col2","type":"string"}
            ]
            }'''
response = gpudb_handle.create_type(type_definition = my_type, label = 'type_id_1')

The returned object from the create_type call contains a unique type identifier allocated by the system. This identifier can then be used in the request to create a new table as follows

response = gpudb_handle.create_table(table_name = 'my_table_1', type_id = type_id_1)

Once the table is created, data can be inserted as follows

gpudb_handle.insert_records(table_name='my_table_1',data=encoded_obj_list,list_encoding='json',options={})

Retrieving Data¶

Once the table is populated with data, the data can be retrieved from the system by a call to get_records as shown below

response = gpudb_handle.get_records(table_name='my_table_1', offset=0, limit=100, encoding='json', options={})

The results are returned in the response along with the schema of the results. For large tables, the data can be easily be retrieved in smaller blocks by using the offset and limit parameters.

Running Queries¶

To filter and retrieve only a subset of the records, use the filter function with an expression as follows

gpudb_handle.filter(table_name='my_table_1', view_name='my_resultset', expression='col1 = 1.1')

To retrieve all the records matching a known list of values

gpudb_handle.filter_by_list(table_name='my_table_1', view_name='my_second_resultset', column_values_map={'col_1':['1.1','2.1','5.1']})

To retrieve a list of all the unique values for a column or a set of columns or expression

response = h_db.aggregate_unique(table_name=my_table, column_name='group_id', offset=0, limit=20, encoding='json')

GPUdb supports various group by queries. To group by one or more columns and return the count of the unique values, use the aggregate_group_by function as shown below. The next query shows how to get the count, sum and mean of the values

print gpudb_handle.aggregate_group_by(table_name='my_table_1', column_names=['group_id'], offset=0, limit=1000,encoding='json')

groupby_col_names = ['group_id', "count(*)", 'sum(col1)', 'avg(col1)']
print gpudb_handle.aggregate_group_by(table_name='my_table_1', column_names=groupby_col_names, offset=0, limit=1000,encoding='json') ["json_encoded_response"]

GPUdb supports grouping numerical data into a histogram. The input range is divided into equal sized bins and the count of records in each bin are returned

print gpudb_handle.aggregate_histogram(table_name='my_table_1',column_name='col1',start=0,end=11, interval=1)

Complete Sample¶

Included below is a complete sample program containing all the above queries

import cStringIO
import collections
from avro import schema, io

import gpudb

def gpudb_example():
    """An example of how to use the GPUdb python API.
    """

    new_line = '\n'

    # handle to database
    h_db = gpudb.GPUdb(encoding = 'BINARY', host = '127.0.0.1', port = '9191')

    my_table='my_table_1'

    # Data type for the table
    my_type = """
    {
        "type": "record",
        "name": "my_type_1",
        "fields": [
            {"name":"col1","type":"double"},
            {"name":"col2","type":"string"},
            {"name":"group_id","type":"string"}
        ]
    }  """.replace(' ','').replace('\n','')

    # Create the data type in the DB
    response = h_db.create_type( type_definition = my_type, label = 'my_type_lb_1' )
    type_id_1 = response['type_id']
    print 'GPUdb generated type id for the new type - ', type_id_1, new_line

    # Create a table with the given data type
    response = h_db.create_table( table_name = my_table, type_id = type_id_1 )

    # Generate data to be inserted into the table
    encoded_obj_list = []

    for val in range(1,10):
        datum = collections.OrderedDict()
        datum["col1"] = val+0.1
        datum["col2"] = 'string '+str(val)
        datum["group_id"] = 'Group 1'
        # Encode the data appropriately to prepare for insertion
        encoded_obj_list.append(h_db.encode_datum(my_type, datum))

    # Optional parameter that enables returning IDs for the
    # newly inserted records
    options = {'return_record_ids':'true'}

    # Insert the records into the table
    response = h_db.insert_records( table_name = my_table,
                                    data = encoded_obj_list,
                                    list_encoding = 'binary',
                                    options = options )
    print "Record Ids for %d new records - %s" % (response['count_inserted'], response['record_ids']), new_line

    # Retrieve records from a table. Note that the records are stringified and have to be parsed
    response = h_db.get_records(table_name=my_table, offset=0, limit=100, encoding='json', options={})
    print "Returned records ", response['records_json'], new_line

    # Filter records into a view.  Response contains the count only
    response = h_db.filter(table_name=my_table, view_name='my_table_view', expression='col1 = 1.1')
    print "Number of records returned by filter expresion ", response['count'], new_line

    # Read the filtered records from the view (exactly as reading from table)
    response = h_db.get_records( table_name = 'my_table_view',
                                 offset = 0, limit = 100,
                                 encoding = 'json', options = {} )
    print "Filtered records ", response['records_json'], new_line

    # Drop the view
    h_db.clear_table('my_table_view')

    # Filter expression with two columns on the original table
    response = h_db.filter(my_table,'my_table_view','col1 <= 9 and group_id="Group 1"')
    print "Number of records returned by second filter expresion ", response['count'], new_line

    # Fetch the records from the view
    response = h_db.get_records( table_name = 'my_table_view',
                                 offset = 0, limit = 100, encoding = 'json',
                                 options = {} )
    print "Returned records ", response['records_json'], new_line

    # Filter by a list.  query is executed on resultset from previous query (query chaining)
    response = h_db.filter_by_list( table_name = 'my_table_view',
                                    view_name = 'my_table_view_2',
                                    column_values_map = {'col1': ['1.1', '2.1', '5.1' ] } )
    print "Number of records returned by filter expresion ", response['count'], new_line

    # Fetch the records
    response = h_db.get_records( table_name = 'my_table_view_2',
                                 offset = 0, limit = 100,
                                 encoding = 'json', options = {} )
    print "Returned records filtered by list: ", response['records_json'], new_line

    # filter a range of values (numeric values only)
    response = h_db.filter_by_range( table_name = my_table,
                                     view_name = 'my_table_view_3',
                                     column_name = 'col1',
                                     lower_bound = 1,
                                     upper_bound = 5 )
    print "Number of records returned by filter expresion ", response['count'], new_line

    response = h_db.get_records(table_name='my_table_view_3', offset=0, limit=100, encoding='binary', options={})

    # Decoding the binary encoded response
    print "Returned records filtered by range: "
    parsed_schema = schema.parse( response['type_schema'] )
    reader = io.DatumReader( parsed_schema )
    for bin_record in response['records_binary']:
        str_IO = cStringIO.StringIO( bin_record )
        bin_decoder = io.BinaryDecoder( str_IO )
        decoded_response = reader.read( bin_decoder )
        print decoded_response, new_line




    response = h_db.aggregate_statistics(table_name=my_table, column_name='col1', stats='count,sum,mean')
    print "Statistics of values in col1 ", response['stats'], new_line

    encoded_obj_list=[]
    for val in range(1,8):
        datum = collections.OrderedDict()
        datum["col1"] = val+10.1
        datum["col2"] = 'string '+str(val)
        datum["group_id"] = 'Group 2'
        encoded_obj_list.append(h_db.encode_datum(my_type, datum))

    h_db.insert_records(my_table,encoded_obj_list,'binary',{})

    # find unique values in a column
    response = h_db.aggregate_unique( table_name = my_table,
                                      column_name = 'group_id',
                                      offset = 0, limit = 20,
                                      encoding = 'json')
    print 'Unique values in group_id column ', response['json_encoded_response'], new_line

    # Group by
    groupby_col_names = ['col2']
    retval = h_db.aggregate_group_by(table_name=my_table, column_names=groupby_col_names, offset=0, limit=1000,encoding='json')
    print "Group by results ", retval['json_encoded_response'], new_line

    # aggregate values
    groupby_col_names = ['group_id', "count(*)", 'sum(col1)', 'avg(col1)']
    retval = h_db.aggregate_group_by(table_name=my_table, column_names=groupby_col_names, offset=0, limit=1000,encoding='json')
    print "Group by results ", retval['json_encoded_response'], new_line

    # Do another aggregate group by operation
    groupby_col_names = ['group_id', 'sum(col1*10)']
    retval = h_db.aggregate_group_by( table_name = my_table,
                                      column_names = groupby_col_names,
                                      offset = 0, limit = 1000,
                                      encoding = 'json')
    print "Group by results ", retval['json_encoded_response'], new_line

    # Add more data
    for val in range(4,10):
        datum = collections.OrderedDict()
        datum["col1"] = val+0.6
        datum["col2"] = 'string 2'+str(val)
        datum["group_id"] = 'Group 1'
        encoded_obj_list.append(h_db.encode_datum(my_type, datum))
    h_db.insert_records( table_name = my_table,
                         data = encoded_obj_list,
                         list_encoding = 'binary',
                         options = {} )

    histogram_result = h_db.aggregate_histogram( table_name = my_table,
                                                 column_name = 'col1',
                                                 start = 1.1, end = 2,
                                                 interval = 1 )
    print "histogram result:", histogram_result, new_line



    # Drop the table (will automatically drop all views on the table)
    h_db.clear_table(my_table)

    # Check that clearing a table automatically drops all the dependent views
    response = h_db.get_records( table_name = 'my_table_view',
                                 offset = 0, limit = 100,
                                 encoding = 'json', options = {})
    assert (response['status_info']['status'] == "ERROR"), \
        "Problem: View on deleted table found!!!"
    print ("Response status and message : ", response['status_info']['status'],
           ' - ', response['status_info']['message'], new_line)

# end example()


if __name__ == '__main__':
    gpudb_example()

GPUdb generated type id for the new type - 8028811435171830948

Record Ids for 9 new records - [u'0010200000000000_0000000000000000', u'0010200000000000_0000000000000001', u'0010200000000000_0000000000000002', u'0010200000000000_0000000000000003', u'0010200000000000_0000000000000004', u'0010200000000000_0000000000000005', u'0010200000000000_0000000000000006', u'0010200000000000_0000000000000007', u'0010200000000000_0000000000000008']

Returned records [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}', u'{"col1":2.1,"col2":"string 2","group_id":"Group 1"}', u'{"col1":3.1,"col2":"string 3","group_id":"Group 1"}', u'{"col1":4.1,"col2":"string 4","group_id":"Group 1"}', u'{"col1":5.1,"col2":"string 5","group_id":"Group 1"}', u'{"col1":6.1,"col2":"string 6","group_id":"Group 1"}', u'{"col1":7.1,"col2":"string 7","group_id":"Group 1"}', u'{"col1":8.1,"col2":"string 8","group_id":"Group 1"}', u'{"col1":9.1,"col2":"string 9","group_id":"Group 1"}']

Number of records returned by filter expresion 1

Filtered records [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}']

Number of records returned by second filter expresion 8

Returned records [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}', u'{"col1":2.1,"col2":"string 2","group_id":"Group 1"}', u'{"col1":3.1,"col2":"string 3","group_id":"Group 1"}', u'{"col1":4.1,"col2":"string 4","group_id":"Group 1"}', u'{"col1":5.1,"col2":"string 5","group_id":"Group 1"}', u'{"col1":6.1,"col2":"string 6","group_id":"Group 1"}', u'{"col1":7.1,"col2":"string 7","group_id":"Group 1"}', u'{"col1":8.1,"col2":"string 8","group_id":"Group 1"}']

Number of records returned by filter expresion 3

Returned records filtered by list: [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}', u'{"col1":2.1,"col2":"string 2","group_id":"Group 1"}', u'{"col1":5.1,"col2":"string 5","group_id":"Group 1"}']

Number of records returned by filter expresion 4

Returned records filtered by range: {u'group_id': u'Group 1', u'col2': u'string 1', u'col1': 1.1}

{u'group_id': u'Group 1', u'col2': u'string 2', u'col1': 2.1}

{u'group_id': u'Group 1', u'col2': u'string 3', u'col1': 3.1}

{u'group_id': u'Group 1', u'col2': u'string 4', u'col1': 4.1}

Statistics of values in col1 {u'count': 9.0, u'sum': 45.9, u'mean': 5.1}

Unique values in group_id column {"column_1":["Group 1","Group 2"],"column_headers":["group_id"]}

Group by results {"column_1":["string 1","string 2","string 3","string 4","string 5","string 6","string 7","string 8","string 9"],"column_2":[2,2,2,2,2,2,2,1,1],"column_headers":["col2","count(*)"]}

Group by results {"column_1":["Group 1","Group 2"],"column_2":[9,7],"column_3":[45.90000000000001,98.69999999999999],"column_4":[5.100000000000001,14.1],"column_headers":["group_id","count(*)","sum(col1)","avg(col1)"]}

Group by results {"column_1":["Group 1","Group 2"],"column_2":[459,987],"column_headers":["group_id","sum(col1*10)"]}

histogram result: {u'start': 1.1, u'counts': [1.0], u'end': 2.0, 'status_info': {u'status': u'OK', u'data_type': u'aggregate_histogram_response_avro', u'message': u'', 'response_time': 0.00244}}

('Response status and message : ', u'ERROR', ' - ', u"Couldn't find the set id: my_table_view (HS/TWGSSc:674)", 'n')

Table Of Contents