paint-brush
How Cassandra Stores Data: An Exploration of Log Structured Merge Treesby@dulithag
24,160 reads
24,160 reads

How Cassandra Stores Data: An Exploration of Log Structured Merge Trees

by DulithaJune 14th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Cassandra is a distributed database that runs on multiple nodes. In Cassandra, each node stores only a fraction of the table’s rows. Memtables are an in-memory data structure that holds data before it is flushed to disk as an SSTable. SSTables are a persistent file format that stores data on disk in a sorted way.
featured image - How Cassandra Stores Data: An Exploration of Log Structured Merge Trees
Dulitha HackerNoon profile picture

Database storage is a crucial aspect of any data-intensive application. How data is stored, organized, and accessed can have a significant impact on the performance, scalability, and reliability of the system. In this article, we will explore Log Structured Merge Trees that use SSTables and Memtables to provide fast and reliable storage. We will explore each component in turn and understand how they work together.


Data Model

Before we explore how Cassandra stores data, let’s understand how it models data. Cassandra uses tables similar to SQL databases. It also supports an SQL-like language (CQL) that enables us to insert, update and delete rows. However, these tables are more akin to key-value stores when we consider how they are accessed.


Here is an example table for employee data.

Department (partition key)

Join Year (cluster column)

Name

Accounting

2022

John

Engineering

2021

Jack

Engineering

2022

James

Sales

2020

Sam


In any SQL database, a row is uniquely identified by a Primary Key.

In Cassandra and Scylla, a Primary Key also serves as the only way to access a row. A Primary Key in Cassandra can be either Simple or Composite.


  • A Simple Primary Key consists of only a Partition Key.
    • Cassandra is a distributed database that runs on multiple nodes. The Partition Key’s hash determines which node (or nodes) will store the row.

    • With a Simple Primary Key (i.e., only a Partition Key), our table acts as a key-value store. We can retrieve the row’s contents by providing the key to Cassandra.


  • A Composite Primary Key comprises both a Partition Key and a Cluster Column.
    • A Cluster Column enables us to have multiple rows with the same Partition Key value. The Cluster Column sorts the rows within the same Partition Key value.
    • Clustering columns let us perform range queries over the rows with the same Partition Key value.
      • Select * from Employees where department = ‘Engineering’ and Year >2020 and Year < 2022;


In a distributed database like Cassandra, each node stores only a fraction of the table’s rows.

Next, let’s examine how Memtables and SSTables store these rows.


Memtables

Memtables stand for Memory Tables. It is an in-memory data structure that holds data before it is flushed to disk as an SSTable. A Memtable is basically a hash map.


Memtables have several advantages:

  • They are fast to write and read, since they only use memory operations without involving disk I/O.
  • They reduce write amplification, since they consolidate multiple writes to the same key before flushing them to disk.
  • They improve write throughput, since they allow concurrent writes without locking or blocking.

Memtables also have some disadvantages:

  • They are volatile, since they can be lost in case of power failure or system crash. To ensure durability, Memtables are backed up by a write-ahead log (WAL) that records all mutations before applying them to the Memtable.
  • They are limited by memory size, since they cannot store more data than the available memory. To prevent memory exhaustion, Memtables are flushed to disk when they reach a certain size or when a flush is triggered manually or by other factors.



SSTables

SSTables stands for Sorted Strings Table. It is a persistent file format that stores data on disk in a sorted and immutable way. An SSTable consists of multiple data blocks, each containing a set of key-value pairs. The keys are sorted in ascending order within each block, and the blocks are sorted by their first key. Each SSTable also has an index file that maps each key to its corresponding block location, allowing for fast lookups.


SSTables are created when data stored in memory (in Memtables) is flushed to disk periodically or when a certain memory threshold is reached. Once an SSTable is written to disk, it cannot be modified. Therefore, any updates or deletes on existing data are stored in a new SSTable. This means that there can be multiple SSTables for the same data, with different versions or timestamps.


SSTables have several advantages:

  • They are fast to write, since they only append data to disk without overwriting existing data.
  • They are efficient to read, since they use binary search to locate keys within blocks and index files to locate blocks within SSTables.
  • They are easy to merge, since they are already sorted by key. This allows for efficient garbage collection and compaction of old or redundant SSTables. Merging SSTables can be done in O(n) time complexity using the merge step of merge sort.

SSTables also have some disadvantages:

  • They consume more disk space, since they store multiple versions of the same data and do not delete obsolete data until compaction.
  • They require more memory, since they need to keep track of all the existing SSTables and their index files.
  • They increase read latency, since they need to check multiple SSTables for the most recent version of a key.



Log Structured Merge Tree

An LSM tree is a tree or collection of Memtable and SSTables. The topmost level consists of a single Memtable. The second Level and below are one or more SSTables. As the levels grow so does the amount of SSTables or sizes of SSTables (this depends on the compaction strategy). The basic workflow is as follows:

Write

When data is inserted into the LSM tree, the following happens.

  • Write data to Memtable and Write Ahead Log
  • Once the Memtable threshold is reached (size or time), flush the memtable to create an SSTable at level 1.
  • Once there are too many SSTables in Level L, combine two or more SSTables to create a new SSTable at Level L+1 in a process called compaction. The two tables that were combined are removed.

Read

  • When a read operation arrives, it is first checked in the current Memtable. If the key is not found or if the value is outdated, it is then checked in the recently created SSTables in decreasing order of creation time until the most recent value is found or until all SSTables are exhausted.
  • The number of levels we must iterate over to construct a row has a direct effect on the performance of a read query.



Compaction

As the number of SSTables grows, we see the following trends

  • Wasted space on disk - each SSTable holds a set of updates or mutations to the rows. Some of these updates are redundant since a newer SSTable holds the most recent value.
  • Read performance drops as the number of SSTables we need to read increases.


Compaction helps to reduce these problems by merging two or more SSTables together. The resulting SSTable will only hold the newest copy of data. Since SSTables are sorted this process of merging is very efficient. Note here that we never alter an existing SSTable, as said before they are immutable. This means we can do compaction lock-free while the system is still using the SSTables.


There are many different compaction strategies. Size Tiered and Leveled compaction are two of the more popular strategies:


  • Size Tiered Compaction:
    • In Size Tiered Compaction Strategy the size of SSTable grows as we go down the levels. Simply once we have a fixed number of SSTables in level 1, they are all combined to create a larger SSTable in level 2. The SSTables in level 1 are deleted. When there is a predefined number of SSTables collected in level 2, they are again combined/compacted to create an even larger SSTable in level 3.


  • Level Compaction
    • Level compaction ensures that each key appears only once in a given level. When compaction occurs, an SSTable from level L-1 is merged with all the SSTables from level L that have a common key range. The number of SSTables in each level grows exponentially. This is based on LevelDB.
    • LevelDB is a useful resource to learn more about LSM. The term DB is misleading, as LevelDB is not a database but a library that provides LSM functionality.


Compaction is a complex topic that we will not cover in detail in this article, but it is important to note that the choice of compaction strategy can affect the database performance significantly.



Conclusion

Log Structured Merge Tree, Memtables and SSTables are common database storage concepts that are used by some NoSQL databases, such as Apache Cassandra and ScyllaDB, to store data on disk and in memory, respectively. They offer a trade-off between performance and durability, while also enabling scalability and fault tolerance in distributed systems. However, they also pose some challenges that require careful design and implementation choices.



References

1: MemtableSSTable - CASSANDRA2 - Apache Software Foundation 2: Storage Engine | Apache Cassandra Documentation 3: Memtable & SSTable (Sorted String Table) | Mauricio Poppe 4: What is a SSTable? Definition & FAQs | ScyllaDB

5: https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/


The lead image for this article was generated by HackerNoon's AI Image Generator via the prompt "a tree with computer fruits"