Dictionary encoding is a data compression technique that can be applied to individual columns of the following types:
charN
)int
long
date
It will store each unique value of a column in memory and associate each record with its corresponding unique value. This eliminates the storage of duplicate values in a column, reducing the overall memory & disk space required to hold the data.
Dictionary encoding is most effective on columns with low cardinality; the fewer the number of unique values within a column, the greater the reduction in memory usage. Queries against the encoded column will generally be faster.
A column can be created with dictionary encoding in effect by applying the
dict
data handling property to the column during type creation (using
/create/type). An existing column can be converted to use
dictionary encoding by modifying the column and applying the dict
property
(using /alter/table).
For example, to apply dictionary encoding to a column in Python:
gpudb.alter_table(
table_name = "taxi_data",
action = "change_column",
value = "vendor_code",
options = {"column_properties":"char4,dict"}
)
Columns with any of the following characteristics are not eligible for dictionary encoding:
charN
),
int
, long
, and date