Apache Cassandra Architecture, How Cassandra Works ?

Cassandra Architecture is purely basic on one point that Hardware and Software Failures can happened and do happen and hence some mechanism should be implemented to deal with these scenarios without affecting the output.

How Cassandra Works ?

Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. All the nodes in the cluster are treated as same and no master slave structure. Data is being replicated among these nodes and two or more replica are made as per configuration to ensure data availability if any node goes down.

Nodes in Cassandra cluster manages a peer-to-peer distributed system to address any failure, these nodes kept on exchanging information every second to ensure availability.

Data can be written and read from any of the node in the cluster, A sequentially written commit log is maintained on every node to captures write activity and ensure data durability. Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache. Once the memory structure is full, the data is written to disk in an SSTable data file. All writes are automatically partitioned and replicated throughout the cluster. Using a process called compaction Cassandra periodically consolidates SSTables, discarding obsolete data and tombstones(an indicator that data was deleted).

Read and Write in Cassandra ?

All the nodes in the cluster are treated as same, user can connect to any node in the cluster and that node works as a coordinator in between the cluster and client application. Coordinator node determines the request from the client and ensure which node in the cluster should fulfill the request based on how the cluster is configured initially.

Cassandra also provides a shell script base interaction, user can connect to any node to perform operations through shell script. CQL so called Cassandra Query Language is used to create, update and delete key-spaces, column families and rows in the cluster. For simplicity CQL has a Sql like syntax,

Cassandra Components

Cassandra cluster has a number of components those together makes a reliable and consistent cluster to handle heavy data operations.

1) Node

A node is a basic component of Cassandra Architecture, this is the place where data is stored and queried upon.

2) Data Center

A data center is a collection of data nodes, data center can be a physical data center or virtual data center. Data is written on multiple data centers based on replication factor defined at the time of keyspace creation.

3) Cluster

A cluster contains one or more data centers. It can span physical locations.

4) Commit Log

Each node in the cluster maintains a commit log, data is first written to these logs for durability and later flushed to SSTables, it can be archived, deleted, or recycled.

5) Table

These are the tables we queried upon, so called columns families as well. Every table has a primary key and can have cluster key as well.

6) SSTable

A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

Cassandra configuration factors

There are a number of basic operations as well those kept on running to ensure all the functioning:

1) Gossip

Its a peer-to-peer communication protocol that helps the nodes to exchange states and information and keeps them updated about the data location. One local copy of the information is also kept on each node to be used immediately when a node starts.

2) Partitioner

A Partitioner is a hash function for computing the token of a partition key. A partitioner determines how to distribute the data across the nodes in the cluster and which node to place the first copy of data on.

3) Replication factor

A replication factor is a number that determines how many number of copies of a data has to replicates across the cluster. If replication factor is 1, that means only a single copy of the data will be there across the cluster. One must define replication factor as at least 2 and no more than the number of nodes. All the replicas across cluster and equally important and there is no primary or secondary replica.

4) Replication strategies

Replication factor and Replication strategy are two attributes to be passed at the time of keyspace creation. Factor determines the number of replicas of same data across the cluster. There are two replication strategies as well:

a) Simple strategy : This should be used when the cluster is made of single data center.

b) Network Strategy: This is recommended strategy when you plan to have a multiple data center cluster.