The following guide provides step by step instructions in Python to get started writing applications using GPUdb. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.
The source code for the Python API is available for download from GitHub at GPUdb/gpudb-api-python. Follow the instructions in the included README file to install the API package.
To connect to the GPUdb server, create an object of the GPUdb class providing the ipaddress and port number of the server.
import gpudb
gpudb_handle = gpudb.GPUdb(encoding = 'BINARY', host = '127.0.0.1', port = '9191')
If the server is configured to authenticate users, then the username and password will also need to be provided at the time of creating the object as follows
gpudb_handle = gpudb.GPUdb(encoding='BINARY', host='172.22.33.44', port='9191', username='user1', password='mysecret')
Before any data can be loaded into the system, a Type needs to be defined in the system. The type definition is a JSON string describing the fields (i.e. columns) of the type along with a name for the type. Each field consists of a name and a data type.
my_type='''{
"type":"record",
"name":"polygon_point",
"fields":[
{"name":"col1","type":"double"},
{"name":"col2","type":"string"}
]
}'''
response = gpudb_handle.create_type(type_definition = my_type, label = 'type_id_1')
The returned object from the create_type call contains a unique type identifier allocated by the system. This identifier can then be used in the request to create a new table as follows
response = gpudb_handle.create_table(table_name = 'my_table_1', type_id = type_id_1)
Once the table is created, data can be inserted as follows
gpudb_handle.insert_records(table_name='my_table_1',data=encoded_obj_list,list_encoding='json',options={})
Once the table is populated with data, the data can be retrieved from the system by a call to get_records as shown below
response = gpudb_handle.get_records(table_name='my_table_1', offset=0, limit=100, encoding='json', options={})
The results are returned in the response along with the schema of the results. For large tables, the data can be easily be retrieved in smaller blocks by using the offset and limit parameters.
To filter and retrieve only a subset of the records, use the filter function with an expression as follows
gpudb_handle.filter(table_name='my_table_1', view_name='my_resultset', expression='col1 = 1.1')
To retrieve all the records matching a known list of values
gpudb_handle.filter_by_list(table_name='my_table_1', view_name='my_second_resultset', column_values_map={'col_1':['1.1','2.1','5.1']})
To retrieve a list of all the unique values for a column or a set of columns or expression
response = h_db.aggregate_unique(table_name=my_table, column_name='group_id', offset=0, limit=20, encoding='json')
GPUdb supports various group by queries. To group by one or more columns and return the count of the unique values, use the aggregate_group_by function as shown below. The next query shows how to get the count, sum and mean of the values
print gpudb_handle.aggregate_group_by(table_name='my_table_1', column_names=['group_id'], offset=0, limit=1000,encoding='json')
groupby_col_names = ['group_id', "count(*)", 'sum(col1)', 'avg(col1)']
print gpudb_handle.aggregate_group_by(table_name='my_table_1', column_names=groupby_col_names, offset=0, limit=1000,encoding='json') ["json_encoded_response"]
GPUdb supports grouping numerical data into a histogram. The input range is divided into equal sized bins and the count of records in each bin are returned
print gpudb_handle.aggregate_histogram(table_name='my_table_1',column_name='col1',start=0,end=11, interval=1)
Included below is a complete sample program containing all the above queries
import cStringIO
import collections
from avro import schema, io
import gpudb
def gpudb_example():
"""An example of how to use the GPUdb python API.
"""
new_line = '\n'
# handle to database
h_db = gpudb.GPUdb(encoding = 'BINARY', host = '127.0.0.1', port = '9191')
my_table='my_table_1'
# Data type for the table
my_type = """
{
"type": "record",
"name": "my_type_1",
"fields": [
{"name":"col1","type":"double"},
{"name":"col2","type":"string"},
{"name":"group_id","type":"string"}
]
} """.replace(' ','').replace('\n','')
# Create the data type in the DB
response = h_db.create_type( type_definition = my_type, label = 'my_type_lb_1' )
type_id_1 = response['type_id']
print 'GPUdb generated type id for the new type - ', type_id_1, new_line
# Create a table with the given data type
response = h_db.create_table( table_name = my_table, type_id = type_id_1 )
# Generate data to be inserted into the table
encoded_obj_list = []
for val in range(1,10):
datum = collections.OrderedDict()
datum["col1"] = val+0.1
datum["col2"] = 'string '+str(val)
datum["group_id"] = 'Group 1'
# Encode the data appropriately to prepare for insertion
encoded_obj_list.append(h_db.encode_datum(my_type, datum))
# Optional parameter that enables returning IDs for the
# newly inserted records
options = {'return_record_ids':'true'}
# Insert the records into the table
response = h_db.insert_records( table_name = my_table,
data = encoded_obj_list,
list_encoding = 'binary',
options = options )
print "Record Ids for %d new records - %s" % (response['count_inserted'], response['record_ids']), new_line
# Retrieve records from a table. Note that the records are stringified and have to be parsed
response = h_db.get_records(table_name=my_table, offset=0, limit=100, encoding='json', options={})
print "Returned records ", response['records_json'], new_line
# Filter records into a view. Response contains the count only
response = h_db.filter(table_name=my_table, view_name='my_table_view', expression='col1 = 1.1')
print "Number of records returned by filter expresion ", response['count'], new_line
# Read the filtered records from the view (exactly as reading from table)
response = h_db.get_records( table_name = 'my_table_view',
offset = 0, limit = 100,
encoding = 'json', options = {} )
print "Filtered records ", response['records_json'], new_line
# Drop the view
h_db.clear_table('my_table_view')
# Filter expression with two columns on the original table
response = h_db.filter(my_table,'my_table_view','col1 <= 9 and group_id="Group 1"')
print "Number of records returned by second filter expresion ", response['count'], new_line
# Fetch the records from the view
response = h_db.get_records( table_name = 'my_table_view',
offset = 0, limit = 100, encoding = 'json',
options = {} )
print "Returned records ", response['records_json'], new_line
# Filter by a list. query is executed on resultset from previous query (query chaining)
response = h_db.filter_by_list( table_name = 'my_table_view',
view_name = 'my_table_view_2',
column_values_map = {'col1': ['1.1', '2.1', '5.1' ] } )
print "Number of records returned by filter expresion ", response['count'], new_line
# Fetch the records
response = h_db.get_records( table_name = 'my_table_view_2',
offset = 0, limit = 100,
encoding = 'json', options = {} )
print "Returned records filtered by list: ", response['records_json'], new_line
# filter a range of values (numeric values only)
response = h_db.filter_by_range( table_name = my_table,
view_name = 'my_table_view_3',
column_name = 'col1',
lower_bound = 1,
upper_bound = 5 )
print "Number of records returned by filter expresion ", response['count'], new_line
response = h_db.get_records(table_name='my_table_view_3', offset=0, limit=100, encoding='binary', options={})
# Decoding the binary encoded response
print "Returned records filtered by range: "
parsed_schema = schema.parse( response['type_schema'] )
reader = io.DatumReader( parsed_schema )
for bin_record in response['records_binary']:
str_IO = cStringIO.StringIO( bin_record )
bin_decoder = io.BinaryDecoder( str_IO )
decoded_response = reader.read( bin_decoder )
print decoded_response, new_line
response = h_db.aggregate_statistics(table_name=my_table, column_name='col1', stats='count,sum,mean')
print "Statistics of values in col1 ", response['stats'], new_line
encoded_obj_list=[]
for val in range(1,8):
datum = collections.OrderedDict()
datum["col1"] = val+10.1
datum["col2"] = 'string '+str(val)
datum["group_id"] = 'Group 2'
encoded_obj_list.append(h_db.encode_datum(my_type, datum))
h_db.insert_records(my_table,encoded_obj_list,'binary',{})
# find unique values in a column
response = h_db.aggregate_unique( table_name = my_table,
column_name = 'group_id',
offset = 0, limit = 20,
encoding = 'json')
print 'Unique values in group_id column ', response['json_encoded_response'], new_line
# Group by
groupby_col_names = ['col2']
retval = h_db.aggregate_group_by(table_name=my_table, column_names=groupby_col_names, offset=0, limit=1000,encoding='json')
print "Group by results ", retval['json_encoded_response'], new_line
# aggregate values
groupby_col_names = ['group_id', "count(*)", 'sum(col1)', 'avg(col1)']
retval = h_db.aggregate_group_by(table_name=my_table, column_names=groupby_col_names, offset=0, limit=1000,encoding='json')
print "Group by results ", retval['json_encoded_response'], new_line
# Do another aggregate group by operation
groupby_col_names = ['group_id', 'sum(col1*10)']
retval = h_db.aggregate_group_by( table_name = my_table,
column_names = groupby_col_names,
offset = 0, limit = 1000,
encoding = 'json')
print "Group by results ", retval['json_encoded_response'], new_line
# Add more data
for val in range(4,10):
datum = collections.OrderedDict()
datum["col1"] = val+0.6
datum["col2"] = 'string 2'+str(val)
datum["group_id"] = 'Group 1'
encoded_obj_list.append(h_db.encode_datum(my_type, datum))
h_db.insert_records( table_name = my_table,
data = encoded_obj_list,
list_encoding = 'binary',
options = {} )
histogram_result = h_db.aggregate_histogram( table_name = my_table,
column_name = 'col1',
start = 1.1, end = 2,
interval = 1 )
print "histogram result:", histogram_result, new_line
# Drop the table (will automatically drop all views on the table)
h_db.clear_table(my_table)
# Check that clearing a table automatically drops all the dependent views
response = h_db.get_records( table_name = 'my_table_view',
offset = 0, limit = 100,
encoding = 'json', options = {})
assert (response['status_info']['status'] == "ERROR"), \
"Problem: View on deleted table found!!!"
print ("Response status and message : ", response['status_info']['status'],
' - ', response['status_info']['message'], new_line)
# end example()
if __name__ == '__main__':
gpudb_example()
GPUdb generated type id for the new type - 8028811435171830948
Record Ids for 9 new records - [u'0010200000000000_0000000000000000', u'0010200000000000_0000000000000001', u'0010200000000000_0000000000000002', u'0010200000000000_0000000000000003', u'0010200000000000_0000000000000004', u'0010200000000000_0000000000000005', u'0010200000000000_0000000000000006', u'0010200000000000_0000000000000007', u'0010200000000000_0000000000000008']
Returned records [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}', u'{"col1":2.1,"col2":"string 2","group_id":"Group 1"}', u'{"col1":3.1,"col2":"string 3","group_id":"Group 1"}', u'{"col1":4.1,"col2":"string 4","group_id":"Group 1"}', u'{"col1":5.1,"col2":"string 5","group_id":"Group 1"}', u'{"col1":6.1,"col2":"string 6","group_id":"Group 1"}', u'{"col1":7.1,"col2":"string 7","group_id":"Group 1"}', u'{"col1":8.1,"col2":"string 8","group_id":"Group 1"}', u'{"col1":9.1,"col2":"string 9","group_id":"Group 1"}']
Number of records returned by filter expresion 1
Filtered records [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}']
Number of records returned by second filter expresion 8
Returned records [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}', u'{"col1":2.1,"col2":"string 2","group_id":"Group 1"}', u'{"col1":3.1,"col2":"string 3","group_id":"Group 1"}', u'{"col1":4.1,"col2":"string 4","group_id":"Group 1"}', u'{"col1":5.1,"col2":"string 5","group_id":"Group 1"}', u'{"col1":6.1,"col2":"string 6","group_id":"Group 1"}', u'{"col1":7.1,"col2":"string 7","group_id":"Group 1"}', u'{"col1":8.1,"col2":"string 8","group_id":"Group 1"}']
Number of records returned by filter expresion 3
Returned records filtered by list: [u'{"col1":1.1,"col2":"string 1","group_id":"Group 1"}', u'{"col1":2.1,"col2":"string 2","group_id":"Group 1"}', u'{"col1":5.1,"col2":"string 5","group_id":"Group 1"}']
Number of records returned by filter expresion 4
Returned records filtered by range: {u'group_id': u'Group 1', u'col2': u'string 1', u'col1': 1.1}
{u'group_id': u'Group 1', u'col2': u'string 2', u'col1': 2.1}
{u'group_id': u'Group 1', u'col2': u'string 3', u'col1': 3.1}
{u'group_id': u'Group 1', u'col2': u'string 4', u'col1': 4.1}
Statistics of values in col1 {u'count': 9.0, u'sum': 45.9, u'mean': 5.1}
Unique values in group_id column {"column_1":["Group 1","Group 2"],"column_headers":["group_id"]}
Group by results {"column_1":["string 1","string 2","string 3","string 4","string 5","string 6","string 7","string 8","string 9"],"column_2":[2,2,2,2,2,2,2,1,1],"column_headers":["col2","count(*)"]}
Group by results {"column_1":["Group 1","Group 2"],"column_2":[9,7],"column_3":[45.90000000000001,98.69999999999999],"column_4":[5.100000000000001,14.1],"column_headers":["group_id","count(*)","sum(col1)","avg(col1)"]}
Group by results {"column_1":["Group 1","Group 2"],"column_2":[459,987],"column_headers":["group_id","sum(col1*10)"]}
histogram result: {u'start': 1.1, u'counts': [1.0], u'end': 2.0, 'status_info': {u'status': u'OK', u'data_type': u'aggregate_histogram_response_avro', u'message': u'', 'response_time': 0.00244}}
('Response status and message : ', u'ERROR', ' - ', u"Couldn't find the set id: my_table_view (HS/TWGSSc:674)", 'n')