What is Data Modeling in Cassandra?

The Data in Apache Cassandra is stored in the tables in form of rows and columns, and the Cassandra Query Language(CQL) is used to fetch the records from the tables. Data Modeling is used to identify the entities and the relationship between the entities. Cassandra's data model is not similar to what we have in Relational Databases. In a Relational database, Normalization is used to store the data in which Primary Key and Foreign Key are used to create a relationship of related data.

The Cassandra Data Modeling is based on the query, which means the query should be designed well enough so that it can access all required data from a single table itself. Just like in Relational Database the joining is used to join multiple tables and get the data, In Cassandra joining is not supported so all columns must be grouped in a single table to fetch the data.

The main goal of Cassandra Data Modeling is to develop and design a high-performance and well-organized Cluster and the following are some of the best practices that can help fulfill it.

  1. Cassandra's database can't be used like a Relational Databases, so we should not take Cassandra Databases as Relational Database.
  2. The Data Model should be designed considering 3 data distribution Goles.
    • In Cassandra Cluster, the data should be evenly distributed on nodes.
    • There should be a limit applied on the size of Partition.
    • The number of partitions returned by the query should be minimized.
  3. Primary Key plays an important role hence use is properly.
  4. Based on Query, design your data model.
  5. Test the result for better performance.

Structured Approach to Design Data Model in Cassandra

The following is the Bigdata Data Modeling methodology that can be followed to create a high-performing Cassandra Data Model.

a. Data Discovery (DD)

It is a high-level and initial view of data in which we can find out the entities, their attributes, and the identifiers. This stage can be an iterative process during the development of a Data Model.

b. Identify the Access Patterns (AP)

In this stage, we can identify the queries that the application needs to fetch the data, and the other questions like what data are required together, the search criteria, the updated patterns, and so on. So now depending upon the answers to the questions this stage can be repetitive in Data Modeling.

c. Map data and queries (MDQ)

In this stage, take Step 1 and Step 2 and merge the data and queries to create a logical design of the table. Now combining both steps there will be a high-level design representation of the Cassandra Table.

d. Create the physical tables (PT)

Now use the CQL CREATE TABLE command to create the table using the Logical design.

e. Review and Refine physical data model

Review the Physical Data Model and per requirement refine it.

The following diagram shows the flow of Cassandra Data Modeling.

cassandra data model design flow cloudduggu

Data Model Components of Cassandra

Cassandra Data Modeling is different from Relational Database modeling. It provides the mechanism of data storage and the major components of the Cassandra Data Model are Keyspaces, Tables, and Columns.

Let us see the Cassandra Data Model components in detail in the following section.

1. Keyspaces

The Cassandra Keyspaces are similar to the Schema of Relation database which is used to hold the data.

Some of the important features of Cassandra Keyspace are as below.

  • There is no default Keyspace in Cassandra, so before table creation, the Keyspace should be created.
  • In a Keyspace, many tables can be created and stored but one table belongs to one Keyspace that is why it's called One to Many relationships.
  • The replication of data is defined at the Keyspace level. If we define a replication of 4 that indicates that there will be 4 copies of data.

The following command will create the Cassandra Keyspace with the name "ClouddugguDB" and replication factor 4 which means data will be copied 4 times.

CREATE KEYSPACE clouddugguDB WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 4 };

The following figure shows the high-level overview of Cassandra Keyspace and column family.

cassandra keyspace schematic view cloudduggu

2. Column Family

The Cassandra Column Family is the collection of sorted rows which are used to store the related data. In the Column family, each row represents the ordered column collection.

The following are the attribute of the Cassandra Column Family.

2.1 Key Cache

The Key Cache holds the per-column family location keys. By default, Key_Cache is enabled and the level of keys is 200,000.

2.2 Row Cache

The Row Cache provides substantial performance benefits by holding the subset of data frequent access data in the memory. The base usage of Row Cache is when a data or set of columns is being used frequently and considered as hot data.

In the following figure, the Row Cache and Key Cache both are configured and If the data detail is not present in Row Cache then Key Cache will be seen to check the Disk location of data.

cassandra row cache key cache flow cloudduggu

3. Tables

The Cassandra Tables are used to hold the actual data in the form of rows and columns. A table in Cassandra must be created with the Primary Key during table creation time, post that it can't be altered. To alter the table new tables should be created with existing data. The Primary Key would be used to locate and order the data.

The following are some of the important features of Cassandra Tables.

  • A Table is referred to as the Column family that can have many rows and columns.
  • Each Table should have Primary Key Defined for better access and data sorting.
  • The Table name must begin with a letter and contains alphabets and underscore.

The following command will create a table named "emp_data" under the "clouddugguDB" keyspace.

CREATE TABLE clouddugguDB.emp_data
  first_name text,
  last_name text,
  birthday timestamp,
  nationality text );

4. Columns

Cassandra Column is used to store a single piece of data. The column can consist of various types of data such as big integer, double, text, float, and Boolean. Each column value has a timestamp associated with it that shows the time of update. Cassandra provides the collection type of columns such as list, set, and map.

5. UUID and TimeUUID

The Cassandra UUID and TimeUUID is a type of column that is used to store the sequence numbers and timestamps. The UUID stands for Universal Unique Identity is similar to the sequence numbers of relational database and has a 128-bit integer. The Example of UUID is 01234567-012345-012345-012345-0123456789ed. The TimeUUID store timestamp and don't store duplicate values. It uses the 100 nanosecond time format. The Example of TimeUUID is D2166aa0-ebb2-11ae-a472-001b669c56e1.

Cassandra Data Model Vs RDBMS Data Model

The Cassandra Data Model is not similar to the RDBMS Data Model. In the following section, we will see the difference between the Cassandra Data Model and the RDBMS Data Model.

Cassandra RDBMS
Cassandra handles the unstructured data. RDBMS handles the structured data.
Cassandra provides a flexible schema to store the data. RDBMS provides a fixed schema to store the data.
In Cassandra, the Keyspaces are used to store the tables and it is the outermost container of Cassandra. In RDBMS the databases are used to store the tables and it is the outermost container of the Relational Database management system.
In Cassandra, the Tables are represented as the nested key-value pairs. The Tables are represented as the array of ROW and COLUMNS.
In Cassandra, the entities are represented through Tables and Columns. In RDBMS the entities are represented through Tables.
In Cassandra, the relationship is presented by the Collections. The Joining in RDBMS is supported by the concept of foreign key join.
In Cassandra, a column is a storage unit. The attribute of the table is represented through a Column.
In Cassandra, the rows are represented as the replication unit. In RDBMS, the rows are represented as the actual data.