What is compaction in Cassandra?

Hey Guys!!! Welcome to tutorial hub. Lets talk about an important background process in Apache Cassandra called Compaction

 

What is Compaction?

Compaction is a periodic, SSTable maintenance process, which performs the following tasks

  • Merges most recent partition keys and columns.
  • Evicts deleted and TTLexpired partition columns.
  • Creates new SSTable.
  • Rebuilds partition index and partition summary.

 

Why it is required?

  • SSTables are immutable, so updates tend to fragment data over time.
  • Deletes are writes and must be periodically cleared.

NOTE: Compaction is an efficient process because SSTables are inherently sorted by partition key and no random I/O is required.

How does compaction affect reads and disk space?

During compaction

  • Disk I/O and utilization increase.

After compaction

  • Read performance increases as less SSTables are read for off-cache reads.
  • Disk utilization drops as old SSTables are deleted.

 

Available Compaction Strategies:

Size Tiered Compaction Strategy

Leveled Compaction Strategy

Date Tiered Compaction Strategy

 

CQL command to create compaction

CREATE TABLE performer (

first text PRIMARY KEY,

last text,

level text

) WITH compaction = {‘class’ : ‘<strategy>’, <params>};

Size Tiered Compaction:

STCS is the default compaction strategy.

Compacts set number of similarly sized SSTables to a larger SSTable.

 

Compaction is triggered when the system has enough (Default is 4) similarly sized SSTables. These are merged together, to form one larger SSTables. Later, when several large SSTables have accumulated, they will be merged to form one even-larger SSTable – and so on.

This means that the system has several size tiers (small SSTables, large SSTables, even-larger SSTables) and in each tier there are roughly the same number of files. When one tier is full, the system merges all its tables to one table in the next tier.

What is problem with STCS in update-heavy workload?

  • Frequently updated partitions may spread across many SSTables. In the worst case, we could have columns from a given row in each SSTable.

 

  • A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns will be merged out of existence; this is particularly noticeable when there is a high ratio of deletes.

 

  • Space can also be a problem as SSTable grows larger from repeated compactions, since an obsolete SSTable cannot be removed until the merged SSTable is completely written. Requires significant disk space (2 x free disk space as largest CQL table)

Leveled Compaction Strategy:

With leveled compaction, instead of potentially huge SSTables the system uses small, fixed-size (by default 160 MB) SSTables divided into different “levels”.

 

 

Within each level, SSTables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.

 

Implications of Leveled Tiered Strategy:

  • With the leveled compaction strategy, SSTable reads are efficient. The great number of small SSTables doesn’t mean we need to look up a key in that many SSTables, because we know the SSTables in each level have disjoint ranges, so we only need to look in one SSTable in each level. In the typical case, we just need to read one SSTable.

 

  • The other factors making this compaction strategy efficient are that at most 10% of space will be wasted by obsolete rows, and only enough space for ~10x the SSTable size needs to be reserved for temporary use by compaction.

 

  • The downside of this method is two times more I/O on writes, so it is not as good for write-new-data-mostly workloads.

 

What compaction strategy should you use?

SizeTieredCompaction-pros SizeTieredCompaction-cons
Relatively low overhead per compaction due to small numbers of SSTables involved.

 

May be optimal for write-heavy applications.

Upto 2*largest table free disk space is needed for compaction.

 

Frequently updated partitions may spread across many SSTables making read queries slow.

 

A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns will be merged out of existence

LeveledCompaction-pros LeveledCompaction-cons
Reduces total potential SSTables to be touched for each read.

 

Less disk space needed for compaction.

 

Better tombstone eviction.

 

May be optimal for read-heavy applications.

Compaction is more frequent – higher throughput required ( May be an issue for older ,slower hardware)

 

No advantage if rows are write once.