Column Compression

Column compression is the application of a standard compression algorithm to a persisted column's data.

Column data will be compressed on disk for faster loading into memory at the cost of the decompression processing time on that loaded data. Once in memory, the data will remain uncompressed until it is no longer needed. Any inserting or updating of data in memory would result in column data being compressed and written to disk depending on which of these conditions is reached first:

  • Memory data chunk being written to becomes full
  • System compression timeout (disk_auto_optimize_timeout) is reached

Column compression can be specified at three different levels, and the compression applied to an individual column (if any) will be determined by which of the following configuration items is found first, in this order:

  • Compression explicitly defined on the column (compress column property)
  • Compression explicitly defined on the column's table at the time of its creation, applying to any columns on that table with no explicit compression definition of their own (compression_codec table property)
  • Global default column compression (compression_codec)

Available column compression algorithms are as follows. Those with an optional compression parameter are shown with the minimum & maximum values for that parameter.

NameArg. Min.Arg. Max.Notes
none  Apply no compression to this column, overriding any higher-level defaults.
lz4012Default is 0. Values of 9 and above enable LZ4_HC.
snappy   
zstd-13107222Default is 3. Negative values prioritize speed over compression ratio.

For example, to apply the following column compression during table creation, in Python:

  • LZ4 compression on the first_name
  • zstd compression on the last_name
  • Snappy compression on all other columns
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Create a column list with compression on name columns
columns = [
    [ "id", "int", "primary_key" ],
    [ "dept_id", "int", "primary_key", "shard_key" ],
    [ "manager_id", "int", "nullable" ],
    [ "first_name", "string", "char32", "compress(lz4(9))", "nullable" ],
    [ "last_name", "string", "char32", "compress(zstd)", "nullable" ],
    [ "salary", "string", "decimal", "nullable" ],
    [ "hire_date", "string", "date", "nullable" ]
]

# Create a simple table using the column list, applying snappy compression
#   to all columns without explicit compression defined
t = gpudb.GPUdbTable(
    columns,
    name = table_name,
    db = kinetica,
    options = {"compression_codec": "snappy"}
)

To add a new column with column compression to a table, in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Add column with explicit compression to the employee table
resp = kinetica.alter_table(
    table_name = table_name,
    action = "add_column",
    value = "last_reviewed_by",
    options = {
        "column_type": "string",
        "column_properties": "char64,compress(lz4),nullable"
    }
)

Limitations and Cautions