Author Archives: Pawan

Querying Multiple Data Sources Using a Single Presto Query

Introduction

Presto is an open source distributed SQL engine for running interactive analytic queries on top of various data sources like Hadoop, Cassandra, and Relational DBMS etc. It is developed by Facebook to query Petabytes of data with low latency using Standard SQL interface.

With the help of Presto, data from multiple sources can be accessed, combined and analysed using a single SQL query. In this article we are going to run join queries on 2 tables –one of it is present in Apache Cassandra & second is present in Hive. First we will setup presto cluster then run standard SQL queries with Presto on the data stored in Apache Cassandra and Hive. A single presto query will first fetch data from Cassandra and Hive tables then process & analyse data based on query then result of this analysis will be stored in a new Hive Table.

Prerequisites

In order to run presto queries on Hive and Cassandra tables, below components must be installed and configured.

A working Hadoop installation (single-node or multi-node). This can be achieved by following steps which are given here.
Hive needs to be installed and configured as Presto uses hive metastore to query hive tables using hive metastore. This can be done by following steps given here.
Hive metastore service should be up & running. It can be started using command $HIVE_HOME/bin/hive –service metastore &
A working Apache Cassandra installation (single-node or multi-node) as we are going to query data from Apache Cassandra table. Presto requires apache Cassandra 2.0.3 or later version to be installed.

Installing and Configuring Presto for Hive and Cassandra Catalog

Presto can be installed and configured on multiple nodes to form a cluster. A presto cluster will consist of 3 components:

Presto Coordinator
Presto Worker
Discovery Server

A single presto coordinator and discovery server is present in a presto cluster. Multiple workers are present to process data in parallel and each worker runs on a separate node.

Install Presto

Both presto coordinator and workers use the same installation setup. Use below steps to install presto on coordinator and each worker node.

– Download the Presto server tarball, presto-server-0.68.tar.gz, and unpack it using command

tar -xvf presto-server-0.68.tar.gz

– Create a data directory for storing presto logs, local metadata. This directory can be created anywhere but it is recommended to create it outside presto installation directory. For example, create data directory in /var/lib/presto using command

mkdir -p /var/lib/presto/data

Note: User should have the read and write permissions on presto data directory (/var/lib/presto/data).

Configure Presto

Create an etc directory inside the $PRESTO_INSTALLATION_DIR directory on each node using below command:

mkdir etc

Note: Here $PRESTO_INSTALLATION_DIR = path of directory where presto-server-0.68 is installed.

This will hold the following configuration:

– Node Properties: environmental configuration specific to each node

– JVM Config: command line options for the Java Virtual Machine

– Config Properties: configuration for the Presto server.

– Catalog Properties: configuration for connectors (data sources)

Node Properties

On each node, create a node.properties (at location $PRESTO_INSTALLATION_DIR/etc/node.properties) file which contains configuration specific to each node. A node is a single installed instance of Presto on a machine. This file is typically created by the deployment system when Presto is first installed. The following is a minimal etc/node.properties. Add these properties to the created file.

node.environment=productionnode.id=ffffffff-ffff-ffff-ffff-ffffffffffffnode.data-dir=/var/lib/presto/data

Here node.id is the unique identifier for this installation of Presto. This must be unique for every node. This identifier should remain consistent across reboots or upgrades of Presto.

JVM Config

On each node, create a jvm.config (at location $PRESTO_INSTALLATION_DIR/etc/jvm.config) file which contains a list of command line options used for launching the Java Virtual Machine. The format of the file is a list of options, one per line.

-server-Xmx16G-XX:+UseConcMarkSweepGC-XX:+ExplicitGCInvokesConcurrent-XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+HeapDumpOnOutOfMemoryError-XX:OnOutOfMemoryError=kill -9 %p-XX:PermSize=150M-XX:MaxPermSize=150M-XX:ReservedCodeCacheSize=150M-Xbootclasspath/p:$PRESTO_INSTALLATION_DIR/presto-server-0.68/lib/floatingdecimal-0.1.jar

Note: In above configuration, -Xmx16G representsthe max memory allocated to presto server for current node. Change this parameter according to memory present in your environment.

Config Properties

On each node, create a config properties file (at location $PRESTO_INSTALLATION_DIR/etc/config.properties) which contains the configuration for the Presto server. Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters. For a presto cluster, there will be one coordinator and multiple workers running each on a separate machine.

Configurations For The coordinator:

Use below configuration properties in etc/config.properties file for coordinator node.

coordinator=truedatasources=jmxhttp-server.http.port=8080presto-metastore.db.type=h2presto-metastore.db.filename=var/lib/presto/MetaStoretask.max-memory=1GBdiscovery-server.enabled=truediscovery.uri=http://example.net:8080

Note: In discovery.uri property, replace example.net with IP address of your machine where discovery service will run.

Configurations For workers:

Use below configuration properties in etc/config.properties file for each workder node.

coordinator=falsedatasources=jmx,hive,cassandrahttp-server.http.port=8080presto-metastore.db.type=h2presto-metastore.db.filename=var/lib/presto/MetaStoretask.max-memory=1GBdiscovery-server.enabled=truediscovery.uri=http://example.net:8080

Note: In discovery.uri property, replace example.net with IP address of your machine where discovery service will run.

Log Levels

The optional log levels file, log.properties, allows setting the minimum log level for named logger hierarchies. First create this file at location $PRESTO_INSTALLATION_DIR/etc/log.properties and add below properties in it.

com.facebook.presto=DEBUG

Catalog Properties

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog.

Catalogs are registered by creating a catalog properties file in the $PRESTO_INSTALLATION_DIR/etc/catalog directory. For example, create $PRESTO_INSTALLATION_DIR/etc/catalog /jmx.properties file with the following contents to mount the jmx connector as the jmx catalog

connector.name=jmx

Create Hive Catalog

Presto includes Hive connectors for multiple versions of Hadoop:

hive-hadoop1: Apache Hadoop 1.x
hive-hadoop2: Apache Hadoop 2.x
hive-cdh4: Cloudera CDH 4
hive-cdh5: Cloudera CDH 5

Create $PRESTO_INSTALLATION_DIR/etc/catalog/hive.properties with the following contents to mount the hive-hadoop1 connector as the hive catalog, replacing hive-hadoop1 with the proper connector for your version of Hadoop and example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=hive-hadoop1hive.metastore.uri=thrift://example.net:9083

Note: Replace example.net with the IP address of node where Hive is running.

Create Cassandra Catalog

Create etc/catalog/cassandra.properties with the following contents to mount the Cassandra connector as the Cassandra catalog

connector.name=cassandra# Comma separated list of contact pointscassandra.contact-points=host1,host2# Port running the native Cassandra protocolcassandra.native-protocol-port=9042# Limit of rows to read for finding all partition keys.cassandra.limit-for-partition-key-select=100000# number of splits generated if partition keys are unknowncassandra.unpartitioned-splits=1000# maximum number of schema cache refresh threads, i.e. maximum number of parallel requestscassandra.max-schema-refresh-threads=10# schema cache time to livecassandra.schema-cache-ttl=1h# schema refresh intervalcassandra.schema-refresh-interval=2m# Consistency level used for Cassandra queries (ONE, TWO, QUORUM, …)cassandra.consistency-level=ONE# fetch size used for Cassandra queriescassandra.fetch-size=5000 # fetch size used for partition key select querycassandra.fetch-size-for-partition-key-select=20000

Note: In cassandra.contact-points property above, replace host1, host2 with the IP addresses of your machines containing Apache Cassandra installation.

Install Discovery:

Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on start-up. This service needs to be installed and configured on any one node.

Download discovery-server-1.16.tar.gz, unpack it to create the installation directory, create the data directory, and then configure it to run on a different port than Presto. The standard port for Discovery is 8411

Configure Discovery

As with Presto, create an etc directory inside the installation directory to hold the configuration files.

Node Properties

Create the Node Properties file (discovery-server-1.16/etc/node.properties) the same way as for Presto, but make sure to use a unique value for node.id. For example:

node.environment=productionnode.id=ffffffff-ffff-ffff-ffff-ffffffffffffnode.data-dir=/var/lib/presto/discovery/data

JVM Config

Create the JVM Config file the same way as for Presto, but configure it to use fewer resources:

-server-Xmx1G-XX:+UseConcMarkSweepGC-XX:+ExplicitGCInvokesConcurrent-XX:+AggressiveOpts-XX:+HeapDumpOnOutOfMemoryError-XX:OnOutOfMemoryError=kill -9 %p

Config Properties

Create etc/config.properties with the following lone option:

http-server.http.port=8411

Run Discovery

The installation directory contains the launcher script in bin/launcher. Presto can be started as a daemon by running the following:

cd DISCOVERY_INSTALLATION_DIRbin/launcher start

Run Presto

The installation directory contains the launcher script in bin/launcher. Presto can be started as a daemon by running the following:

cd $PRESTO_INSTALLATION_DIRbin/launcher start

Start presto server using above command on each server present in the cluster.

Install and Configure Presto Client Interface (Cli)

Download presto-cli-0.68-executable.jar from here
Rename it to presto using Linux command mv presto-cli-0.68-executable.jar presto
Provide execute permission to presto using Linux command chmod +x presto
Now run presto cli using below command

./presto –server localhost:8080 –catalog hive –schema default

Note: Replace localhost with IP address of presto server.

Creating Tables and Populating Data in Cassandra and Hive

Now create tables in Apache Cassandra and Hive and populate data in these tables so that we can query these tables using presto.

Create Table in Apache Cassandra

Create a table orders in apache Cassandra using CQL and insert data into it using below commands:

Populate Data in Cassandra Table

After table creation in Apache Cassandra, populate some data using cqlsh

cqlsh> CREATE KEYSPACE demodb WITH REPLICATION = { ‘class’ : ‘NetworkTopologyStrategy’, ‘dc1’ : 1 };cqlsh> USE demodb;cqlsh> CREATE TABLE user_purchases (user_id INT, item TEXT, quanity INT, amount FLOAT, time timestamp, place TEXT, PRIMARY KEY (user_id, timestamp));cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (1, ‘Shirt’, 2, 3050.50, 1395639405, ‘New Delhi’);cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (1, ‘Shoes’, 3, 8140.60, 1398901516, ‘Noida’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (2, ‘Mobile Phone’, 1, 18300.00, 1406195803, ‘Gurgaon’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (3, ‘Laptop’, 1, 40140.60, 1401782401, ‘New Delhi’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (6, ‘chocolate’, 5, 500.30, 1401782405, ‘New Delhi’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (6, ‘Tablet’, 1, 20460.20, 1401782291, ‘Gurgaon’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (10, ‘Bat’, 1, 4860.20, 1337070341, ‘Mumbai’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (12, ‘clothes’, 4, 16450.00, 1295781836, ‘Chennai’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (9, ‘Bike’, 1, 65320.00, 1384490305, ‘Mumbai’); cqlsh> INSERT INTO user_purchases (user_id, item, quanity, amount, time, place) VALUES (11, ‘Music System’, 2, 26450.00, 1370489145, ‘New Delhi’);

Create Table in Hive

Create a table user_info in hive using below command in hive cli

hive> create table user_info (id INT, fname STRING, lname STRING, age INT, salary INT, gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

Populate data in Hive table

Create a text file user_info_data.csv and add some user related data for above created hive table

1,Steven,Smith,24,42000,Male 2,Pawan,Lathwal,24,30000,Male3,Mariya,Gilbert,25,44000,Female4,Taylor,Lockwood,24,41000,Male5,Sanjiv,Singh,25,51000,Male6,Peter,Mcculum,43,191000,Male7,Geeta,Rai,23,35000,Female8,Priyanka,Sachdeva,23,34000,Female9,Sanjiv,Puri,26,78000,Male10,Sachin,Tyagi,43,250000,Male11,Adam,Gilchrist,34,180000,Male12,Monika,Chandra,24,46000,Female13,Anamika,Malhotra,26,92000,Female

Now load this data into hive table using below command:

load data local inpath ‘user_info_data.csv’ overwrite into table user_info;

Query data using Presto

After loading data in both Cassandra table and Hive table, we are going to query this data from Presto client interface (cli). With the help of Presto, a single query can be executed to get data from both of these sources- Cassandra and Hive and combine the results.

The combined results (from Cassandra & Hive) from presto query can be either streamed to client or can be saved in a new Table. Here we will try both the approaches.

Presto Query 1: Combine data from Cassandra & Hive using Presto Join Query

First login to presto cli by using below command

./presto –server localhost:8080 –catalog hive –schema default

Note: Replace localhost with the IP address of node running presto server.

Then run below query on Presto cli

presto:default> select hive_user_info.id, hive_user_info.fname, hive_user_info.age, hive_user_info.salary, cassandra_user_purchases.item, cassandra_user_purchases.time, cassandra_user_purchases.place from hive.default.user_info hive_user_info join cassandra.stockticker.user_purchases cassandra_user_purchases on hive_user_info.id = cassandra_user_purchases.user_id;

The above presto query combines data from 2 tables – user_info table present in Hive and user_purchases table in Cassandra. A join is performed on these tables on the basis of common filed user_id and matching records will be shown as a result.

Presto Query 2: Get top 2 purchases from each user by combining data from Hive & Cassandra in single Presto Query

Run below presto query which joins the data from Cassandra and Hive table & output only top 2 purchases for each user based on the purchase_amount. This query uses complex analytic function of presto like row_number(), order by clause etc.

presto:default> select * from (select *, row_number() over(partition by id order by amount desc)as rnk from (select hive_user_info.id, hive_user_info.fname, hive_user_info.gender, hive_user_info.age, hive_user_info.salary, cassandra_user_purchases.item, cassandra_user_purchases.time, cassandra_user_purchases.place, cassandra_user_purchases.quanity, cassandra_user_purchases.amount from user_info hive_user_info join cassandra.stockticker.user_purchases cassandra_user_purchases on hive_user_info.id = cassandra_user_purchases.user_id)) where rnk <=2;

Presto Query 3: Get top 2 purchases from each user and save result in a new Table

This query first finds top 2 purchases from each user based on purchase amount then stores the output results on a new table user_top_purchases which will be created as a result of this query.

CREATE TABLE user_top_purchases as select * from (select *, row_number() over(partition by id order by amount desc)as rnk from (select hive_user_info.id, hive_user_info.fname, hive_user_info.gender, hive_user_info.age, hive_user_info.salary, cassandra_user_purchases.item, cassandra_user_purchases.time, cassandra_user_purchases.place, cassandra_user_purchases.quanity, cassandra_user_purchases.amount from user_info hive_user_info join cassandra.stockticker.user_purchases cassandra_user_purchases on hive_user_info.id = cassandra_user_purchases.user_id)) where rnk <=2;

Apache Cassandra Performance Benchmark on a large data set

Cassandra Performance Benchmark Results on Large Data Set (in Terabytes)

Hi,

This post is about performance of NoSQL database Apache Cassandra. Cassandra is one of the most popular massively scalable NoSQL database with read/write support across multiple data center. This is all about how Cassandra performs in read, write and update operations when data increases in the cluster. Various kind of workloads(read, write, update, mixed) were applied on Cassandra cluster with to check how performance if effected with the increase of data size.

This benchmark is performed on 6 node Cassandra cluster with the help of YCSB (Yahoo Cloud Serving Benchmark). Three types of workloads(write only, read only, mixed) were applied to this cluster.

Write Only Workload: This workload writes data(fixed no of records) to Cassandra cluster with the help of YCSB for some time.

Read Only Workload: This workload reads data(fixed no of records) from Cassandra cluster with the help of YCSB.

Mixed Workload: This workload is a combination of read and write operations on Cassandra cluster where 50% reads are performed and 50% writes are performed simultaneously.

Cassandra Cluster Details:

No of Nodes in Cassandra Cluster : 6

Replication factor : 3

Consistency Level : Quorum (2 in this case)

Avg Record Size: 380 KB

No of YCSB Clients(instances) : 1

Hardware Configurations:

Total 6 machine were used with below hardware configuration for each machine:

RAM : 32 GB

No of Cores : 8

Hard Disk : 2 TB

Benchmark Statistics:

Write performance (Write only workload): Representing write throughput (no of writes operations performed per second) behavior with the increase in data size(no of records present) in Cassandra cluster.

Write Throughput vs Data Size: Here write workload was applied on 6 node Cassandra cluster with replication factor of 3 i.e each record was written on 3 nodes. Some specific no of write operations were performed with the help of 1 YCSB client for this. This process was repeated with increase in data in Cassandra cluster. Here the motive was to check how much the write performance will affect when inserting data in a 1 GB data Cassandra cluster and having 1 TB(Terabyte data). Will the write throughput increase, decrease or remain constant while increasing total amount of data in cluster. Below graph is showing write throughput( writes operations per second) with increase in total data present in 6 node Cassandra cluster.

Here x axis is showing the data size present in Cassandra cluster before applying write workload.Y axis is representing the write throughput( no of writes ops per second) against each data set size.

Result: The graph shows that with increase in data in Cassandra cluster, the no of write operations per second (write throughput) decreases. It was 61 write ops per second when 36 GB data was present in Cassandra cluster which decrease to 20 write ops per second when 800 GB data was present to the same cluster. Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

Write Latency vs Data Size: It shows the behavior of latency( avg time taken by each record to process) during above write operations. Below graph is representing how time taken by each write record varies with respect to data present in Cassandra cluster.

Here x axis is showing the data size present in Cassandra cluster before applying write workload.

Y axis is representing the write latency( avg time taken by each record) against data set size.

Result: The graph shows that with increase in data in Cassandra cluster, write latency increases i.e with more data present in cluster, each record will take longer time to complete its write as compare to when small data is present in cluster. It was 694 ms (0.6 second) when 36 GB data was present in cluster which increases to 1967 ms( 1.9 second) when data increases to 800 GB data on the same cluster. Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

Read performance (Read only workload): Shows how reads are affected by data size i,e impact of data present in Cassandra cluster on read throughput (no of reads per second) and Latency (avg time taken by each read operation to respond/complete.

Read Throughput vs Data Size: Representing read throughput (no of read operations performed per second) behavior with the increase in data size(or no of records) on 6 node Cassandra Cluster. Some specific no of read operations were performed with the help of 1 YCSB client with 100 threads for this. This read workload was applied multiple times after inserting some data in cluster so that reads are performed on large data set each time. Here the aim was to check how much the read performance will be affected when reading data from a 1 GB data Cassandra cluster and having 1 TB(Terabyte data). Will the read throughput increase, decrease or remain constant while increasing total amount of data in cluster. Below graph is showing read throughput( read operations per second) with increase in total data present in 6 node Cassandra cluster

Here x axis is showing the data size(in GB) present in Cassandra cluster when applying read workload.

Y axis is representing the read throughput( no of read ops per second) against each data set size.

Result: The graph shows that with increase in data in Cassandra cluster, the no of read operations per second (read throughput) decreases. It was 188 read ops per second when 36 GB data was present in Cassandra cluster which decrease to 36 read ops per second when 761 GB data was present to the same cluster.

In above graph, we can see that sometimes performance was increased a little bit with increase of data (it was 82 reads per sec on 253 GB data & 85 reads per sec on 290 GB data). This is due of

(1) Compaction which occurs in background to decrease the no of SSTables in cluster

(2) Cache also plays important role in read operations

(3) Data size difference between 2 consecutive read workloads was small (290-253 = 37 GB) i.e a only 37 GB data was inserted in Cassandra cluster before applying next read workload. So some data was read from cache.

Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

Read Latency vs Data Size: It shows the behavior of latency( avg time taken by each record to process) during above write operations. Below graph is representing how time taken by each write record varies with respect to data present in Cassandra cluster.

Here x axis is showing the data size (In GB) present in Cassandra cluster when applying read workload.

Y axis is representing the read Latency( avg time taken by each read to complete) against data set size.

Result: The graph shows that with increase in data in Cassandra cluster, read latency increases i.e with more data present in cluster, each record will take longer time to complete its read as compare to when small data is present in cluster. It was 265 ms (0.2 second) when 36 GB data was present in cluster which increases to 689 ms (0.6 second) when data increases to 761 GB data on the same cluster. The reason it decreases a little bit for some workloads is that the data size difference between 2 consecutive read workloads was small (for e.g. 290-253 = 37 GB) i.e a only 37 GB data was inserted in Cassandra cluster before applying next (4th) read workload. So some data was read from cache.

Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

Performance on mixed workload (Reads + Writes simultaneously): Representing performance of Cassandra under mixed workload (reads & writes occurring simultaneously) with increase in data in cluster. It shows how Throughput and Latency behaves when mixed kind of operations are performed on it on continuously increasing data in Cassandra Cluster. This is a general scenario in most of web applications where some clients are performing transactions & some are busy in reading data.

Mixed (50% write + 50 % read) throughput vs Data size: Representing mixed throughput (no of read + writes operations performed per second) behavior with the increase in data size(or no of records) on 6 node Cassandra Cluster. Some specific no of read + write operations were performed with the help of 1 YCSB client with 100 threads for this. This workload was applied multiple times after inserting some data in cluster so that reads+writes are performed on large data set each time. Here the aim was to check how much the read+write performance will be affected when performed on 1 GB data cluster and 1 TB(Terabyte data). Will the read+write throughput increase, decrease or remain constant while increasing total amount of data in cluster. Below graphs this variation-

Here x axis is showing the data size (In GB) present in Cassandra cluster when applying mixed workload.

Y axis is representing the Throughput( mixed ops per sec) against data set size.

Result: The graph shows that with increase in data in Cassandra cluster, Throughput decreases. It was 73 (mixed ops per sec) when 72 GB data was present in cluster which decreases to 41 (mixed ops per sec) when data increases to 761 GB data on the same cluster.

Note that the each record size was 380 KB and Replication factor was 3 with a Consistency of Quorum.

Overall performance (Previous charts in combined mode): Representing throughput & latency behavior with the increase in data size(or no of records) in Cassandra with all three types of workloads (read, write and mixed which are shown individually in previous graphs) in a single graph now.

Overall throughput vs data size: Shows Throughput variation of each type of workload with data size. Below graph just combines the output of previous graphs into one single graph so that throughput variation for all types of workloads (read write &mixed)can be represented into a single view.

Here x-axis represents the data size (in GB) & y-axis represents the Throughput (no of ops done per sec).

Blue colored bar represents the read workload

Yellow colored bar represents the write workload

Red colored bar represents the mixed workload

Overall latency vs data size: Shows Latency variation of each type of workload with data size. Below graph just combines the output of previous graphs into one single graph so that Latency variation for all types of workloads (read & write)can be represented into a single view.

Here x-axis represents the data size (in GB) & y-axis represents the Latency ( time taken by each operation to complete).

Blue colored bar represents the read workload

Yellow colored bar represents the write workload

Overall throughput vs data size (Line Graph): Shows Throughput variation of each type of workload with data size in a Line Graph. Below graph just combines the output of previous graphs into one single Line graph so that throughput variation for all types of workloads (read write &mixed)can be represented into a single view so that Throughput for different workloads can be compared.

Here x-axis represents the data size (in GB) & y-axis represents the Throughput (no of ops done per sec).

Blue colored Line represents the read workload

Yellow colored Line represents the write workload

Red colored Line represents the mixed workload

Overall latency vs data size (Line Graph): Shows Latency variation of each type of workload with data size in Line Graph. Below graph just combines the output of previous graphs into one single Line graph so that Latency variation for all types of workloads (read & write)can be represented into a single view so that Latency for different workloads can be compare.d

Here x-axis represents the data size (in GB) & y-axis represents the Latency ( time taken by each operation to complete).

Blue colored bar represents the read workload

Yellow colored bar represents the write workload

	Pawan on Querying Multiple Data Sources…
	Jack on Querying Multiple Data Sources…
	Pawan on Querying Multiple Data Sources…
	Sean on Querying Multiple Data Sources…

pawanblogs

A topnotch WordPress.com site

Author Archives: Pawan

Querying Multiple Data Sources Using a Single Presto Query

Introduction

Prerequisites

Installing and Configuring Presto for Hive and Cassandra Catalog

Install Presto

Configure Presto

Node Properties

JVM Config

Config Properties

Configurations For The coordinator:

Configurations For workers:

Log Levels

Catalog Properties

Create Hive Catalog

Create Cassandra Catalog

Install Discovery:

Configure Discovery

Node Properties

JVM Config

Config Properties

Run Discovery

Run Presto

Install and Configure Presto Client Interface (Cli)

Creating Tables and Populating Data in Cassandra and Hive

Create Table in Apache Cassandra

Populate Data in Cassandra Table

Create Table in Hive

Populate data in Hive table

Query data using Presto

Presto Query 1: Combine data from Cassandra & Hive using Presto Join Query

Presto Query 2: Get top 2 purchases from each user by combining data from Hive & Cassandra in single Presto Query

Presto Query 3: Get top 2 purchases from each user and save result in a new Table

Apache Cassandra Performance Benchmark on a large data set

Cassandra Performance Benchmark Results on Large Data Set (in Terabytes)