Tag Archives: Apache Cassandra

Apache Cassandra Performance Benchmark on a large data set

Cassandra Performance Benchmark Results on Large Data Set (in Terabytes)

Hi,

This post is about performance of NoSQL database Apache Cassandra. Cassandra is one of the most popular massively scalable NoSQL database with read/write support across multiple data center. This is all about how Cassandra performs in read, write and update operations when data increases in the cluster. Various kind of workloads(read, write, update, mixed) were applied on Cassandra cluster with to check how performance if effected with the increase of data size.

This benchmark is performed on 6 node Cassandra cluster with the help of YCSB (Yahoo Cloud Serving Benchmark). Three types of workloads(write only, read only, mixed) were applied to this cluster.

Write Only Workload: This workload writes data(fixed no of records) to Cassandra cluster with the help of YCSB for some time.

Read Only Workload: This workload reads data(fixed no of records) from Cassandra cluster with the help of YCSB.

Mixed Workload: This workload is a combination of read and write operations on Cassandra cluster where 50% reads are performed and 50% writes are performed simultaneously.

Cassandra Cluster Details:

     No of Nodes in Cassandra Cluster : 6
     Replication factor : 3
     Consistency Level : Quorum (2 in this case)
     Avg Record Size: 380 KB
     No of YCSB Clients(instances) : 1
Hardware Configurations:
Total 6 machine were used with below hardware configuration for each machine:
RAM              :  32 GB
No of Cores  :   8

Hard Disk     :   2 TB

Benchmark Statistics:

Write performance (Write only workload): Representing write throughput (no of writes operations performed per second) behavior with the increase in data size(no of records present) in Cassandra cluster.

  • Write Throughput vs Data Size: Here write workload was applied on 6 node Cassandra cluster with replication factor of 3 i.e each record was written on 3 nodes. Some specific no of write operations were performed with the help of 1 YCSB client for this. This process was repeated with increase in data in Cassandra cluster. Here the motive was to check how much the write performance will affect when inserting data in a 1 GB data Cassandra cluster and having 1 TB(Terabyte data). Will the write throughput increase, decrease or remain constant while increasing total amount of data in cluster. Below graph is showing write throughput( writes operations per second) with increase in total data present in 6 node Cassandra cluster.
Here x axis is showing the data size present in Cassandra cluster before applying write workload.Y axis is representing the write throughput( no of writes ops per second) against each data set size.
Result: The graph shows that with increase in data in Cassandra cluster, the no of write operations per second (write throughput) decreases. It was 61 write ops per second when 36 GB data was present in Cassandra cluster which decrease to 20 write ops per second when 800 GB data was present to the same cluster. Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.
  • Write Latency vs Data Size: It shows the behavior of latency( avg time taken by each record to process) during above write operations. Below graph is representing how time taken by each write record varies with respect to data present in Cassandra cluster.
Here x axis is showing the data size present in Cassandra cluster before applying write workload.
Y axis is representing the write latency( avg time taken by each record) against data set size.
Result: The graph shows that with increase in data in Cassandra cluster, write latency increases i.e with more data present in cluster, each record will take longer time to complete its write as compare to when small data is present in cluster. It was 694 ms (0.6 second) when 36 GB data was present in cluster which increases to 1967 ms( 1.9 second) when data increases to 800 GB data on the same cluster. Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

 

Read performance (Read only workload): Shows how reads are affected by data size i,e impact of data present in Cassandra cluster on read throughput (no of reads per second) and Latency (avg time taken by each read operation to respond/complete.

  • Read Throughput vs Data Size: Representing read throughput (no of read operations performed per second) behavior with the increase in data size(or no of records) on 6 node Cassandra Cluster. Some specific no of read operations were performed with the help of 1 YCSB client with 100 threads for this. This read workload was applied multiple times after inserting some data in cluster so that reads are performed on large data set each time. Here the aim was to check how much the read performance will be affected when reading data from a 1 GB data Cassandra cluster and having 1 TB(Terabyte data). Will the read throughput increase, decrease or remain constant while increasing total amount of data in cluster. Below graph is showing read throughput( read operations per second) with increase in total data present in 6 node Cassandra cluster
Here x axis is showing the data size(in GB) present in Cassandra cluster when applying read workload.
Y axis is representing the read throughput( no of read ops per second) against each data set size.
Result: The graph shows that with increase in data in Cassandra cluster, the no of read operations per second (read throughput) decreases. It was 188 read ops per second when 36 GB data was present in Cassandra cluster which decrease to 36 read ops per second when 761 GB data was present to the same cluster.
In above graph, we can see that sometimes performance was increased a little bit with increase of data (it was 82 reads per sec on 253 GB data & 85 reads per sec on 290 GB data). This is due of
(1) Compaction which occurs in background to decrease the no of SSTables in cluster
(2) Cache also plays important role in read operations
(3) Data size difference between 2 consecutive read workloads was small (290-253 = 37 GB) i.e a only 37 GB data was inserted in Cassandra cluster before applying next read workload. So some data was read from cache.
Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.
  • Read Latency vs Data Size: It shows the behavior of latency( avg time taken by each record to process) during above write operations. Below graph is representing how time taken by each write record varies with respect to data present in Cassandra cluster.
Here x axis is showing the data size (In GB) present in Cassandra cluster when applying read workload.
Y axis is representing the read Latency( avg time taken by each read to complete) against data set size.
Result: The graph shows that with increase in data in Cassandra cluster, read latency increases i.e with more data present in cluster, each record will take longer time to complete its read as compare to when small data is present in cluster. It was 265 ms (0.2 second) when 36 GB data was present in cluster which increases to 689 ms (0.6 second) when data increases to 761 GB data on the same cluster. The reason it decreases a little bit for some workloads is that the data size difference between 2 consecutive read workloads was small (for e.g. 290-253 = 37 GB) i.e a only 37 GB data was inserted in Cassandra cluster before applying next (4th) read workload. So some data was read from cache.
Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

 

Performance on mixed workload (Reads + Writes simultaneously): Representing performance of Cassandra under mixed workload (reads & writes occurring simultaneously) with increase in data in cluster. It shows how Throughput and Latency behaves when mixed kind of operations are performed on it on continuously increasing data in Cassandra Cluster. This is a general scenario in most of web applications where some clients are performing transactions & some are busy in reading data.
  • Mixed (50% write + 50 % read) throughput vs Data size: Representing mixed throughput (no of read + writes operations performed per second) behavior with the increase in data size(or no of records) on 6 node Cassandra Cluster. Some specific no of read + write operations were performed with the help of 1 YCSB client with 100 threads for this. This workload was applied multiple times after inserting some data in cluster so that reads+writes are performed on large data set each time. Here the aim was to check how much the read+write performance will be affected when performed on 1 GB data cluster and 1 TB(Terabyte data). Will the read+write throughput increase, decrease or remain constant while increasing total amount of data in cluster. Below graphs this variation-
Here x axis is showing the data size (In GB) present in Cassandra cluster when applying mixed workload.
Y axis is representing the  Throughput( mixed ops per sec) against data set size.
Result: The graph shows that with increase in data in Cassandra cluster, Throughput decreases. It was 73 (mixed ops per sec) when 72 GB data was present in cluster which decreases to 41 (mixed ops per sec) when data increases to 761 GB data on the same cluster.
Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

 

Overall performance (Previous  charts in combined mode): Representing throughput & latency behavior with the increase in data size(or no of records) in Cassandra with all three types of workloads (read, write and mixed which are shown individually in previous graphs) in a single graph now.
  • Overall throughput vs data size: Shows Throughput variation of each type of workload with data size. Below graph just combines the output of previous graphs into one single graph so that throughput variation for all types of workloads (read write &mixed)can be represented into a single view.
Here x-axis represents the data size (in GB) & y-axis represents the Throughput (no of ops done per sec).
Blue colored bar represents the read workload
Yellow colored bar represents the write workload
Red colored bar represents the mixed workload
  • Overall latency vs data size: Shows Latency variation of each type of workload with data size. Below graph just combines the output of previous graphs into one single graph so that Latency variation for all types of workloads (read & write)can be represented into a single view.

 

 

Here x-axis represents the data size (in GB) & y-axis represents the Latency ( time taken by each operation to complete).
Blue colored bar represents the read workload
Yellow colored bar represents the write workload
  • Overall throughput vs data size (Line Graph): Shows Throughput variation of each type of workload with data size in a Line Graph. Below graph just combines the output of previous graphs into one single Line graph so that throughput variation for all types of workloads (read write &mixed)can be represented into a single view so that Throughput for different workloads can be compared.
Here x-axis represents the data size (in GB) & y-axis represents the Throughput (no of ops done per sec).
Blue colored Line represents the read workload
Yellow colored Line represents the write workload
Red colored Line represents the mixed workload
  • Overall latency vs data size (Line Graph): Shows Latency variation of each type of workload with data size in Line Graph. Below graph just combines the output of previous graphs into one single Line graph so that Latency variation for all types of workloads (read & write)can be represented into a single view so that Latency for different workloads can be compare.d
Here x-axis represents the data size (in GB) & y-axis represents the Latency ( time taken by each operation to complete).
Blue colored bar represents the read workload
Yellow colored bar represents the write workload