Querying Social Graphs in SQL

An example of querying a cinema relation graph with Kinetica (no external dataset required)

The following is a complete example, using SQL, of querying a graph of movies & TV shows and the people who star in and direct them. For more information on Graphs & Solvers, see Graphs & Solvers Concepts.

Prerequisites

The prerequisites for running this query graph example are listed below:

Overview

This example is going to demonstrate querying a cinema-based graph of relationships between actors, directors, & shows for:

  • directors and titles of movies in which a given actor has acted, using a hop-based search
    • the same query, but applying a fuzzy search, employing Kinetica's full text search capability via the FILTER_BY_STRING function
    • the same query, but applying a SQL join to the results of a graph query to supplement the result set
  • titles & release years of movies directed by directors of movies in which a given actor has acted, demonstrating the nesting of graph queries

Table Setup

Three source data tables will be used to construct the graph:

  1. qsgs_person - contains actors, directors, & producers
  2. qsgs_show - contains the genre, release date, & score of movies & TV shows
  3. qsgs_role - contains the relationships (production roles) between the people and the shows on which they have worked; e.g., X acted in show Y or A directed show B

Person Table

This table contains the name, semicolon-separated list of production roles, and date of birth of actors, directors, & producers used in this example. The list of production roles will be used later in graph creation to associate all of a given person's career-wide production roles with them, not just their production role on any particular show.

Create Person Table
1
2
3
4
5
6
CREATE TABLE qsgs_person
(
    name VARCHAR(32, TEXT_SEARCH),
    profession VARCHAR(32),
    dob DATE
)
Populate Person Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
INSERT INTO qsgs_person (name, profession, dob)
VALUES
    ('Robert De Niro',  'actor:director', '1943-08-07'),
    ('Jean Reno',       'actor:director', '1948-07-30'),
    ('Al Pacino',       'actor:producer', '1940-04-25'),
    ('Michael Douglas', 'actor:producer', '1944-09-25'),
    ('James Spader',    'actor',          '1960-02-07'),
    ('Katarina Witt',   'actor',          '1965-12-03'),
    ('Oliver Stone',    'director',       '1946-09-15'),
    ('Joe Carnahan',    'director',       '1969-05-09')

Show Table

This table contains the title, show type (movie or TV series), semicolon-separated list of genres, year of release, and score for the shows used in this example. The list of genres will be used in graph creation to associate each of the genres with the corresponding show.

Create Show Table
1
2
3
4
5
6
7
8
CREATE TABLE qsgs_show
(
    title VARCHAR(32),
    show_type VARCHAR(8),
    score REAL,
    genre VARCHAR(32),
    released INT
)
Populate Show Table
1
2
3
4
5
6
7
INSERT INTO qsgs_show (title, show_type, score, genre, released)
VALUES
    ('Wall Street',   'movie',  8.4, 'thriller',                1987),
    ('Ronin',         'movie',  7.2, 'action',                  1998),
    ('The Godfather', 'movie',  9.2, 'drama:thriller',          1972),
    ('Salome',        'movie',  6.3, 'drama',                   2013),
    ('The Blacklist', 'series', 8.2, 'drama::mystery:thriller', 1972)

Production Role Table

This table contains the production role each person had on each show within the example data set. The production role values are a semicolon-separated list of roles, each of which will be applied to the relation between the person and the show within the graph.

Create Production Role Table
1
2
3
4
5
6
CREATE TABLE qsgs_role
(
    person VARCHAR(32),
    film VARCHAR(32),
    participation VARCHAR(32)
)
Populate Production Role Table
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
INSERT INTO qsgs_role (person, film, participation)
VALUES
    ('Robert De Niro',  'The Godfather', 'acted'),
    ('Robert De Niro',  'Ronin',         'acted'),
    ('Jean Reno',       'Ronin',         'acted'),
    ('Al Pacino',       'The Godfather', 'acted'),
    ('Michael Douglas', 'Wall Street',   'acted'),
    ('James Spader',    'Wall Street',   'acted'),
    ('Katarina Witt',   'Ronin',         'acted'),
    ('Al Pacino',       'Salome',        'acted:directed'),
    ('Oliver Stone',    'Wall Street',   'directed'),
    ('James Spader',    'The Blacklist', 'acted'),
    ('Joe Carnahan',    'The Blacklist', 'directed')

Graph Construction

This example will demonstrate both constructing a graph from source tables and amending the graph with additional data after construction.


Creation

One graph is used for these query graph examples, qsgs_cinema, which will use several queries against the source tables to build the edges, nodes, & labels. The graph is created with the CREATE GRAPH command:

Create Cinema Graph
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
CREATE DIRECTED GRAPH qsgs_cinema
(
    NODES => INPUT_TABLES
    (
        (SELECT title AS NAME, show_type || ':' || genre AS LABEL FROM qsgs_show),
        (SELECT DISTINCT CHAR4(released) AS NAME, 'year' AS LABEL FROM qsgs_show),

        (SELECT name AS NAME, 'person:' || profession AS LABEL FROM qsgs_person),
        (SELECT DISTINCT CHAR4(YEAR(DOB)) AS NAME, 'year' AS LABEL FROM qsgs_person)
    ),
    EDGES => INPUT_TABLES
    (
        (SELECT person AS NODE1_NAME, film AS NODE2_NAME, participation AS LABEL FROM qsgs_role),
        (SELECT title AS NODE1_NAME, CHAR4(released) AS NODE2_NAME, 'released' AS LABEL FROM qsgs_show),
        (SELECT name AS NODE1_NAME, CHAR4(YEAR(dob)) AS NODE2_NAME, 'born' AS LABEL FROM qsgs_person)
    ),
    OPTIONS => KV_PAIRS(label_delimiter = ':', graph_table = 'qsgs_cinema_graph_table')
)

The graph will contain 4 node types:

  • show node - represents a movie or TV show, labeled with the type of show it is and each genre it encompasses
  • release year node - represents the release year of the connected show
  • person node - represents an actor, director, or producer; labeled with person to indicate a person node, as well as each profession the person has held
  • birth year node - represents the date of birth of the connected person

The graph will also contain 3 edge types:

  • participation edge - represents the connection of a person to a show, labeled with the production role of the person in the show
  • released edge - represents the connection of a show to the year it was released; labeled with released to indicate it as a released edge
  • born edge - represents the connection of a person to the year of birth; labeled with born to indicate it as a born edge

Lastly, the graph is created with the following characteristics:

  • It is directed, depicting relationships between people and the shows on which they have worked as person-to-show; e.g., X directed show Y. Since all edges are directed, the other edge types will follow suit; e.g., X was born in year Y and X was released in year Y.
  • It has no weights because this example doesn't favor any production roles or linked attributes over others
  • It has no inherent restrictions for any of the nodes or edges in the graph
  • It will use : as the delimiter when parsing each of the node & edge labels upon creation
  • It will have a corresponding table, qsgs_cinema_graph_table, created for it containing its edges and qsgs_cinema_graph_table_nodes created for it containing its node endpoints

Alteration

The created graph can have data (nodes, edges, & labels) added to it after creation, as demonstrated with the ALTER GRAPH command:

Alter Cinema Graph
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
ALTER GRAPH qsgs_cinema MODIFY
(
    NODES => INPUT_TABLES
    (
        (SELECT 'Stargate' AS NAME, 'movie:scifi' AS LABEL),
        (SELECT '1994' AS NAME, 'year' AS LABEL),

        (SELECT 'Roland Emmerich' AS NAME, 'person:director' AS LABEL)
    ),
    EDGES => INPUT_TABLES
    (
        (SELECT 'James Spader' AS NODE1_NAME, 'Stargate' AS NODE2_NAME, 'acted' AS LABEL),
        (SELECT 'Roland Emmerich' AS NODE1_NAME, 'Stargate' AS NODE2_NAME, 'directed' AS LABEL),

        (SELECT 'Stargate' AS NODE1_NAME, '1994' AS NODE2_NAME, 'released' AS LABEL)
    ),
    OPTIONS => KV_PAIRS(label_delimiter = ':', graph_table = 'qsgs_cinema_graph_table')
)

Here, the show Stargate and the person Roland Emmerich are added to the graph, along with their associated connections. Note that the graph entities are added as constants instead of queries here, though either or a mixture of the two is acceptable in both CREATE GRAPH & ALTER GRAPH.

Though not required for supplementing the graph data, a backing record for Stargate is added to the show table for use in a supplemental data query later in this guide.

Insert Supplemental Data
1
2
INSERT INTO qsgs_show (title, show_type, score, genre, released)
VALUES ('Stargate', 'movie', 7.6, 'scifi', 1994)

Querying

This example will demonstrate four ways graph queries can be used to answer relationship questions:


Hop-Based

To find the directors and titles of movies (no TV series) in which James Spader has acted:

Hop Query
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
SELECT
    QUERY_NODE1_NAME AS Director,
    QUERY_NODE2_NAME AS Movie
FROM TABLE
(
    QUERY_GRAPH
    (
        GRAPH => 'qsgs_cinema',
        QUERIES => INPUT_TABLES
        (
            (SELECT 'James Spader'           AS NODE_NAME),
            (SELECT  1 AS HOP_ID, 'acted'    AS EDGE_LABEL),
            (SELECT  1 AS HOP_ID, 'movie'    AS NODE_LABEL),
            (SELECT -2 AS HOP_ID, 'directed' AS EDGE_LABEL)
        ),
        RINGS => 2
    )
)
WHERE QUERY_NODE1_NAME != 'James Spader'
ORDER BY 1, 2

The graph query is constructed in the following way:

  • Start with James Spader
    • NODE_NAME of James Spader
  • Find shows in which he has acted
    • HOP_ID of 1 with an EDGE_LABEL of acted
  • Isolate the shows to just movies
    • HOP_ID of 1 with a NODE_LABEL of movie
  • Find directors of those movies
    • HOP_ID of -2 with an EDGE_LABEL of director; the negative sign indicates the query should be allowed to traverse these edges of the directed graph in reverse--from show to person, whereas the edge is defined topologically as one-way, from person to show
  • Stop the query at this second hop
    • RINGS of 2
  • Filter out entries connecting James Spader to these movies, as these are part of the queried path, but not the director-to-movie relations in which we are interested
    • WHERE QUERY_NODE1_NAME != 'James Spader'
  • Extract the director & title of the movies from the query results
    • QUERY_NODE1_NAME represents the director and QUERY_NODE2_NAME the movie in the director-to-movie directed edge relation

Supplemental Functions

To perform the same query as in Query by Hops above, but employing the Kinetica full text search capability using the FILTER_BY_STRING function to find the same James Spader starting node:

Fuzzy Search Query
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
SELECT
    QUERY_NODE1_NAME AS Director,
    QUERY_NODE2_NAME AS Movie
FROM TABLE
(
    QUERY_GRAPH
    (
        GRAPH => 'qsgs_cinema',
        QUERIES => INPUT_TABLES
        (
            (
                SELECT name AS NODE_NAME
                FROM TABLE
                (
                    FILTER_BY_STRING
                    (
                        TABLE_NAME => INPUT_TABLE(qsgs_person),
                        MODE => 'search',
                        EXPRESSION => 'James Spdar'
                    )
                )
            ),
            (SELECT  1 AS HOP_ID, 'acted'    AS EDGE_LABEL),
            (SELECT  1 AS HOP_ID, 'movie'    AS NODE_LABEL),
            (SELECT -2 AS HOP_ID, 'directed' AS EDGE_LABEL)
        ),
        RINGS => 2
    )
)
WHERE QUERY_NODE1_NAME != 'James Spader'
ORDER BY 1, 2

The graph query is constructed in the following way:

  • Perform a fuzzy search on the person table for a name like James Spdar
    • FILTER_BY_STRING on the qsgs_person table using a fuzzy search for James Spdar across all of its columns
  • Start with the name returned by the fuzzy search
    • NODE_NAME of the value in the name column of the qsgs_person table that matched James Spdar
  • Perform the rest of the query the same way as was done in Query by Hops

Supplemental Data

To perform the same query as in Query by Hops above, but employing a SQL join to supplement the query results with movie scores from the show table:

Supplemental Data Query
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SELECT
    QUERY_NODE1_NAME AS Director,
    QUERY_NODE2_NAME AS Movie,
    s.score AS Score
FROM TABLE
(
    QUERY_GRAPH
    (
        GRAPH => 'qsgs_cinema',
        QUERIES => INPUT_TABLES
        (
            (SELECT 'James Spader'           AS NODE_NAME),
            (SELECT  1 AS HOP_ID, 'acted'    AS EDGE_LABEL),
            (SELECT  1 AS HOP_ID, 'movie'    AS NODE_LABEL),
            (SELECT -2 AS HOP_ID, 'directed' AS EDGE_LABEL)
        ),
        RINGS => 2
    )
)
JOIN qsgs_show s ON s.title = QUERY_NODE2_NAME
WHERE QUERY_NODE1_NAME != 'James Spader'
ORDER BY 1, 2

The graph query is constructed in the following way:

  • Perform the query the same way as was done in Query by Hops
  • Associate the movies returned by the query with their corresponding entries in the show table, matching by movie title
    • JOIN qsgs_show s ON s.title = QUERY_NODE2_NAME
  • Add each movie's score to the results
    • s.score AS Score

Nested Graph Queries

To perform the same query as in Query by Hops above, but nesting it within another graph query to find the titles and release years of all shows directed by directors of James Spader movies:

Nested Graph Queries
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
SELECT
    QUERY_NODE1_NAME AS "Show",
    QUERY_NODE2_NAME AS Release_Year
FROM TABLE
(
    QUERY_GRAPH
    (
        GRAPH => 'qsgs_cinema',
        QUERIES => INPUT_TABLES
        (
            (
                SELECT DISTINCT QUERY_NODE2_NAME as NODE_NAME
                FROM TABLE
                (
                    QUERY_GRAPH
                    (
                        GRAPH => 'qsgs_cinema',
                        QUERIES => INPUT_TABLES
                        (
                            (SELECT 'James Spader'           AS NODE_NAME),
                            (SELECT  1 AS HOP_ID, 'acted'    AS EDGE_LABEL),
                            (SELECT  1 AS HOP_ID, 'movie'    AS NODE_LABEL),
                            (SELECT -2 AS HOP_ID, 'directed' AS EDGE_LABEL),
                            (SELECT  3 AS HOP_ID, 'directed' AS EDGE_LABEL)
                        ),
                        RINGS => 3
                    )
                )
            ),
            (SELECT 'year' AS TARGET_NODE_LABEL)
        ),
        RINGS => 1
    )
)

The graph query is constructed in the following way:

  • Perform an inner graph query the same way as was done in Query by Hops, with two modifications:
    • Find shows directed by directors of James Spader movies
      • HOP_ID of 3 with an EDGE_LABEL of directed; move from the directors of James Spader movies returned by the original graph query to all of the shows those directors have directed
    • Stop the query after this additional hop, from directors to shows
      • RINGS of 3
  • Perform an outer graph query on the list of shows returned by the inner graph query, finding their associated release dates
    • Start with the names of shows returned by the inner graph query
      • NODE_NAME of QUERY_NODE2_NAME (inner query result show's name)
    • Find release years of those shows
      • TARGET_NODE_LABEL of year
    • Ensure the release years found are directly connected to the respective shows
      • RINGS of 1 (only go one hop away from the show nodes)
    • Extract the title and release year of the shows from the query results
      • QUERY_NODE1_NAME represents the show title and QUERY_NODE2_NAME the show release year in the show-to-release-year directed edge relation

Download & Run

Included below is a complete example containing all the above requests and corresponding output.

The script can be run via Workbooks in Workbench or any SQL client.

To execute using KiSQL, download the SQL script, switch to the directory in which it has been downloaded, and run:

Run Example
1
$ kisql --url https://<aws.fqdn>/<aws.cluster.name>/gpudb-0 --user <username> --file guide_query_graph_cinema.kisql