Page Rank in Python
A Page Rank example using the NYC Taxi dataset
A Page Rank example using the NYC Taxi dataset
The following is a complete example, using the Python API, of solving a graph created with NYC Taxi data for a page rank problem via the /solve/graph endpoint. For more information on Network Graphs & Solvers, see Graphs & Solvers Concepts.
The prerequisites for running the page rank solve graph example are listed below:
The native Kinetica Python API is accessible through the following means:
The Python package manager, pip, is required to install the API from PyPI.
Install the API:
|
|
Test the installation:
|
|
If Import Successful is displayed, the API has been installed as is ready for use.
In the desired directory, run the following, but be sure to replace <kinetica-version> with the name of the installed Kinetica version, e.g., v7.2:
|
|
Change directory into the newly downloaded repository:
|
|
In the root directory of the unzipped repository, install the Kinetica API:
|
|
Test the installation (Python 2.7 (or greater) is necessary for running the API example):
|
|
The example script references the nyc_neighborhood.csv
data file,
mentioned in the Prerequisites, in the current local directory, by default.
This directory can specified as a parameter when running the example script.
Several constants are defined at the beginning of the script:
SCHEMA -- the name of the schema in which the tables supporting the graph creation and match operations will be created
Important
The schema is created during the table setup portion of the script because the schema must exist prior to creating the tables that will later support the graph creation and match operations.
TABLE_NYC_N -- the name of the table into which the NYC Neighborhood dataset is loaded. This dataset is joined to the TABLE_TAXI table to create the JOIN_TAXI dataset.
TABLE_TAXI -- the name of the table into which the NYC taxi dataset is loaded. This dataset is joined to the TABLE_NYC_N table to create the JOIN_TAXI dataset.
TABLE_TAXI_E -- the name of the projection derived from the JOIN_TAXI dataset that serves as the edges for the GRAPH_T graph.
TABLE_TAXI_N -- the name of the union derived from the JOIN_TAXI dataset that serves as the nodes for the GRAPH_T graph.
TABLE_TAXI_N_S -- the same as TABLE_TAXI_N but later sharded so it can be joined to the nyctaxi_graph_id graph.
JOIN_TAXI -- the name of the join view that represents the dataset of all the trips found in the TABLE_TAXI dataset that overlap with the neighborhood boundaries found in the TABLE_NYC_N dataset
JOIN_PR_RESULTS -- the name of the join view that represents the dataset of all the nodes found in the TABLE_TAXI_N where the ID matches the SOLVERS_NODE_ID found in TABLE_GRAPH_T_PRSOLVED_S
GRAPH_T -- the NYC taxi graph
TABLE_GRAPH_T_PRSOLVED -- the solved NYC taxi graph using the PAGE_RANK solver type
TABLE_GRAPH_T_PRSOLVED_S -- the same as TABLE_GRAPH_T_PRSOLVED but later sharded so it can be joined to the nyctaxi_nodes_sharded table.
|
|
One graph is used for this example: nyctaxi_graph_id, a graph utilizing IDs based on a modified version of the standard NYC Taxi dataset (mentioned in Prerequisites).
To filter out data that could skew graph nyctaxi_graph_id, the NYC Neighborhood dataset must be inserted into Kinetica and joined to the NYC Taxi dataset using STXY_CONTAINS to remove any trip points in the NYC Taxi dataset that are not contained within the geospatial boundaries of the NYC Neighborhood dataset:
|
|
Before nyctaxi_graph_id can be created, the edges must be derived from the taxi_tables_joined dataset's XY pickup and dropoff pairs to create the nyctaxi_edges_id dataset:
|
|
Now, nyctaxi_graph_id is created with the following characteristics:
|
|
The graph is solved:
|
|
Important
A source node ID was selected at random from the nyctaxi_nodes. Since page rank is ranking each node's connectedness in relation to other nodes, any node can be the source
A sharded version of the nyctaxi_nodes union (created earlier) is created so it can be joined:
|
|
A sharded version of the nyctaxi_graph_id_page_rank_solved table is also created so it can be joined to the nyctaxi_nodes union:
|
|
The union and graph results table are joined on ID:
|
|
The top 10 nodes (sorted by descending cost) are retrieved. The higher the cost, the more frequently the node was visited:
|
|
Included below is a complete example containing all the above requests, the data files, and output.
To run the complete sample, ensure that:
solve_graph_nyctaxi_page_rank.py
script is in the current
directorynyc_neighborhood.csv
file is in the current directory or
use the data_dir parameter to specify the local directory containing itThen, run the following:
|
|