What is Spark SQL, SQLContext, SparkSession & DataFrames and Datasets

In this article we will see, what is Spark SQL, SQLContext and SparkSession, how to create SQLContext and SparkSession in Spark and their implementation.

We will also look into what is DataFrame and Dataset in Spark, difference between them and their implementation using Spark Java API.

What is Spark SQL ?

Spark SQL is a Spark module for structured data processing to provide more information about the structure of both the data and the computation being performed.

Spark SQL can be used to execute SQL queries, read data from an existing Hive installation and perform Dataset API written in different language.

What is SQLContext ?

SQLContext is used to work with structured data in Spark 1.X, this is replaced by SparkSession from Spark 2.X; however the class is still there in Spark 2.X for backward compatibility.

What is SparkSession ?

SparkSession is added since spark 2.X as a replacement of SqlContext and HiveContext. Since 2.X SparkSession can be used for both structured data (Dataset and DataFrame) operations and hive related SQL operations.

A SparkSession object can be created as shown below:
SparkSession sparkSession = SparkSession.builder().getOrCreate();
A Hive Supported SparkSession object can be created as shown below:
SparkSession sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate();

What is a Spark's DataFrame (Untyped Dataset) ?

DataFrames were added to Spark from "Spark 1.3", a DataFrame is a type of Dataset, organised into named columns conceptually equivalent to a table in a relational database.

DataFrame is represented by a Dataset of Rows (Dataset) and contains additional metadata due to its tabular format, which allows Spark to run certain optimisations on the finalised query.

Along with Dataframe, Spark also introduced catalyst optimizer, which leverages advanced programming features to build an extensible query optimizer.

DataFrames can be created using SparkSession from an existing RDD, Hive table, Databases (MySql) using JDBC drivers or from Spark data sources like ".parquet", ".json", ORC etc.

Spark's DataFrame Example

Let's create a DataFrame from a Java list of objects:
		/* Dummy data*/
		List<Geo> dummy = new ArrayList<Geo>();
		
		dummy.add(new Geo(1, "London", 20));
		dummy.add(new Geo(2, "New York", 30));
		dummy.add(new Geo(3, "Delhi", 40));
		dummy.add(new Geo(3, "Delhi", 60));
Java list of objects can be converted to a DataFrame as shown below:
SparkSession sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate();
Dataset dataFrame = sparkSession.createDataFrame(dummy, Geo.class);
We can use .show() to displays the content of the DataFrame to stdout as shown below:
		/* Show complete data of dataframe*/
		dataFrame.show();

		/* Show specific number of rows of dataframe*/
		dataFrame.show(2);
Output: Following output will be shown to stdout:
+------+--------+----+
|cityId|cityName|hits|
+------+--------+----+
|     1|  London|  20|
|     2|New York|  30|
|     3|   Delhi|  40|
|     3|   Delhi|  60|
+------+--------+----+

+------+--------+----+
|cityId|cityName|hits|
+------+--------+----+
|     1|  London|  20|
|     2|New York|  30|
+------+--------+----+
only showing top 2 rows
Read more about : DataFrames in Apache Spark

What is a Spark's Dataset ?

Datasets were added to Spark from "Spark 1.6", a Dataset is a distributed collection of data that provides the benefits of RDDs and Spark SQL's optimized execution engine.

Datasets API provides compile time type safety which was not available in Dataframes.

Datasets includes both the compile-time type-safety of the RDD and the same efficient off-heap storage mechanism along with the performance benefits of the Catalyst query optimizer of DataFrame API.

At the core of Dataset is a new concept called an encoder for conversion between JVM objects and tabular representation, Spark also provides automatic encoders for primitives like String, Integer, Long etc and Scala/Java beans.

Spark's Dataset Example

Let's create a Dataset from a Java list of objects:
		/* Dummy data*/
		List<Geo> dummy = new ArrayList<Geo>();
		
		dummy.add(new Geo(1, "London", 20));
		dummy.add(new Geo(2, "New York", 30));
		dummy.add(new Geo(3, "Delhi", 40));
		dummy.add(new Geo(3, "Delhi", 60));
Java list of objects can be converted to a Dataset as shown below:
SparkSession sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate();
Encoder<Geo> encoder = Encoders.bean(Geo.class);
Dataset<Geo> dataset = sparkSession.createDataset(dummy, encoder)
We can use .show() to displays the content of the Dataset to stdout as shown below:
		/* Show complete data of Dataset*/
		dataset.show();

		/* Show specific number of rows of Dataset*/
		dataset.show(2);
Output: Following output will be shown to stdout:
+------+--------+----+
|cityId|cityName|hits|
+------+--------+----+
|     1|  London|  20|
|     2|New York|  30|
|     3|   Delhi|  40|
|     3|   Delhi|  60|
+------+--------+----+

+------+--------+----+
|cityId|cityName|hits|
+------+--------+----+
|     1|  London|  20|
|     2|New York|  30|
+------+--------+----+
only showing top 2 rows
Read more about: Datasets in Apache Spark

In this article we have seen, Spark SQL, SQLContext and SparkSession, their properties, implementation and operations using Spark Java API.