Version:

Dictionary Encoding

Dictionary encoding is a data compression technique that can be applied to individual columns of the following types:

  • restricted-length string (charN)
  • int
  • long
  • date

It will store each unique value of a column in memory and associate each record with its corresponding unique value. This eliminates the storage of duplicate values in a column, reducing the overall memory & disk space required to hold the data.

Dictionary encoding is most effective on columns with low cardinality; the fewer the number of unique values within a column, the greater the reduction in memory usage. Queries against the encoded column will generally be faster.

A column can be created with dictionary encoding in effect by applying the dict data handling property to the column during type creation (using /create/type). An existing column can be converted to use dictionary encoding by modifying the column and applying the dict property (using /alter/table).

For example, to apply dictionary encoding to a column in Python:

gpudb.alter_table(
  table_name = "taxi_data",
  action = "change_column",
  value = "vendor_code",
  options = {"column_properties":"char4,dict"}
)

Limitations and Cautions

Columns with any of the following characteristics are not eligible for dictionary encoding: