Dictionary Encoding

Dictionary encoding is a data compression technique that can be applied to individual columns of the following effective types:

  • int
  • long
  • date
  • char1 - char256

It will store each unique value of a column in memory and associate each record with its corresponding unique value. This eliminates the storage of duplicate values in a column, reducing the overall memory & disk space required to hold the data.

Dictionary encoding is most effective on columns with low cardinality; the fewer the number of unique values within a column, the greater the reduction in memory usage. Queries against the encoded column will generally be faster.

A column can be created with dictionary encoding in effect by applying the dict data handling property to the column during type creation (using /create/type). An existing column can be converted to use dictionary encoding by modifying the column and applying the dict property (using /alter/table).

For example, to apply dictionary encoding to a column during table creation, in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Create a column list; use GPUdbColumnProperty.DICT to apply dictionary
# encoding to last_name
columns = [
    [ "id", GRC._ColumnType.INT, GCP.PRIMARY_KEY ],
    [ "dept_id", GRC._ColumnType.INT, GCP.PRIMARY_KEY, GCP.SHARD_KEY ],
    [ "manager_id", GRC._ColumnType.INT, GCP.NULLABLE ],
    [ "first_name", GRC._ColumnType.STRING, GCP.CHAR32, GCP.NULLABLE ],
    [ "last_name", GRC._ColumnType.STRING, GCP.CHAR32, GCP.DICT, GCP.NULLABLE ],
    [ "sal", GRC._ColumnType.STRING, GCP.DECIMAL, GCP.NULLABLE ],
    [ "hire_date", GRC._ColumnType.STRING, GCP.DATE, GCP.NULLABLE ]
]

# Create a simple table using the column list
t = gpudb.GPUdbTable(
    columns,
    name = "example.employee",
    db = h_db
)

To apply dictionary encoding to a column after table creation, in Python:

1
2
3
4
5
6
7
8
9
# Set dictionary encoding on the last_name column of the employee table
h_db.alter_table(
    table_name = "example.employee",
    action = "change_column",
    value = "last_name",
    options = {
        "column_properties": "char32,dict,nullable"
    }
)

To remove dictionary encoding from a column, in Python, alter the column, specifying all non-dictionary-encoding properties the column currently has:

1
2
3
4
5
6
7
8
9
# Remove dictionary encoding on the last_name column of the employee table
h_db.alter_table(
    table_name = "example.employee",
    action = "change_column",
    value = "last_name",
    options = {
        "column_properties": "char32,nullable"
    }
)

Important

Don't leave any spaces between properties in an /alter/table command's column_properties option.

Limitations and Cautions

Columns with any of the following characteristics are not eligible for dictionary encoding: