Big data consists of large and complex data sets that are capable of increasing exponentially with respect of time. This data sets are mostly unstructured in nature and consumes large memory and hence it's impossible to store and process them effectively using traditional RDBMS.

Big Data Analytics

Analytics means generating some meaningful insights from a given set of data. Big Data Analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other business information.

Big Data Analytics are of two types:

1) Batch Analytics

Colleting data and processing it on some time later is called batch processing, usually done on historical data. Example: Telephone or electricity bill generation.

2) Real Time Analytics

Processing done on immediate data for instant results, usually done on real time data. Example: Debit or Credit Card transactions, stock market analysis and live dashboards.

What is Apache Spark ?

Apache Spark is a fast and general-purpose cluster computing system, it provides high-level APIs in Java, Scala, Python and R.

Spark also has supporting tools like Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark can run on top of HDFC to leverage the distributed replicated storage, it can be used along with MapReduce in the same Hadoop cluster or as an stand alone processing framework without Hadoop.

Even if Spark can run standalone, it is not meant to replace Hadopp even provides a good deal while used with HDFS.

In most of the cases Spark and MapReduce can be used together while spark doing real time processing and MapReduce doing batch processing.

Mahout is used as Machine Learning framework for MapReduce, same way Spark has Spark Mlib that is substitute to Mahout for Spark.

Why Apache Spark when MapReduce is already there ?

Hadoop is not meant for Real time processing and is used for batch processing only, Spark being 100 times faster than MapReduce can be used for real time data processing.

That doesn't mean Spark can process only real time data, spark can deal with both historical data as well as real time data.

On the other hand, spark programming is much easier and simpler than complicated MapReduce programming.

The major advantage of spark over traditional Hadoop based MapReduce is it's faster processing even for historical data. Apache Spark processes data in-memory while Hadoop MapReduce persists data back to the disk after a map or a reduce action, so Spark perform faster than Hadoop's MapReduce.

When it comes to speed part of Spark, it can run 100 times faster than MapReduce because of I/O operation skipped by it that are part of each map and reduce operations in MapReduce.

Spark Ecosystem

Spark ecosystem consists of all the components of Spark engine along with Spark Core, to be used exclusively for different kind of tasks.

Spark Core

Creating RDDs is part of Spark Core, it is the major engine on top of that many libraries are written i.e. Spark SQL, Spark Streaming, Machine Learning, GraphX and SparkR.

Spark Core has the responsibility of memory management, job scheduling, data distribution and monitoring across the cluster and interaction with storage systems.

Once the data is loaded in memory from different sources like HDFS, Kafka, Flume using Spark Core any of the component can be applied on it for processing and the output can be stored back to Cassandra, HDFS etc.

Spark SQL

Spark SQL is used for structured data and can run modified hive queries on existing hadoop clusters. Data processed in SparkSQL is Structured in nature and has column information along with row information, these data sets are called data frames and not RDD.

Spark SQL supports many input sources like Json, Hive, Cassandra etc. The speed of Spark SQL is very high in comparison to hadoop based systems.

For input source connectivity prebuild drivers(JDBC, ODBC) as well as user defined functions can be used.

Spark Streaming

Spark Streaming enables real time processing and analytics on real time streaming date. It provides high-throughput and fault tolerant stream processing of live data streams.

The series of streaming data streams processed in real time are called DStreams.


MLib has a rich set of machine learning libraries built on top of Spark Core, can be considered a substitute to MapReduce's Mahout. Data processed using MLib is called Memory Pipeline so that multiple machine learning algorithms can operate on them.

MLib can be categorised into two type of algorithms:

1) Supervised Algorithms: Uses labelled data in which both input and output advises are provided to the algorithms.

2) Un Supervised Algorithms: These algorithms are not provided with advised output in advance and left to make sense of the data without labels.


GraphX is a graph processing engine that included graphical data processing libraries.


SparkR consists of R language libraries and also provides command line shell for R programmers.

Spark Architecture

Spark Architecture is based on Directed Acyclic Graph (DAG), where data sets undergoes sequential computations in which each node represents an RDD partition and each edge represents transformation on top of RDD.

An spark cluster mainly consits of three components:

1) A Driver Node
2) Worker Nodes
3) Cluster Manager

1) Driver Node

The driver program(SparkContext) sits on Driver Node, Spark context is like Java's main() function, as in Java execution starts from main and each program have separate main(), each spark application starts with SparkContext and each of them have their own SparkContext.

Driver node contains various components ? DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster.

Each Spark Application being running in different JVMs both on driver and worker nodes makes them isolated from each other and hence data can not be shared between applications without writing it to an external storage system.

When a Spark Application is submitted to driver, it converts the code into a logical directed acyclic graph (DAG) and negotiates with cluster manager to acquire workers, the cluster manager than allocates workers for the job. These workers has to register themselves with the driver before start executing the given task.

Further Spark Jobs are also scheduled by driver, when the application execution ends the driver terminate all the executors and release the resources from the cluster manager.

2) Worker Nodes

On worker nodes the memory needed for keeping the data blocks are called executors, the logic to be executed on data sets is called a Task, an executor may have multiple tasks.

It's executor's duty to performs all the data processing, reads and Writes data to external sources, storing the computation results data in-memory, cache or on hard disk drives.

Driver and Workers can run on a horizontal architecture in different Java processes or on a vertical architecture having different machines.

3) Cluster Manager

Cluster Manager is responsible for the allocation and deallocation of various physical resources such as memory, cpu, etc.

Hadoop YARN, Apache Mesos or the simple standalone spark cluster manager either of them can be launched on-premise or in the cloud for a spark application to run.

Spark Lazy Execution

Spark applications generally includes three step process, loading the data into memory, transformation and action. As soon as the first and second steps are executed spark does not load data into memory(RAM) immediately.

Spark loads the initial RDD and transformed RDD in second step in memory only after the execution of action step.

Till the time the program will not hit action, spark will not do any action beforehand, this is called lazy execution.

RDDs(Resilient Distributed Datasets) stands for collective distributed data blocks in memory, these RDDs are immutable in nature.

Code Examples

It is possible to write spark applications in Java, Scala and Python, these applications can be submitted to spark clusters for data processing.

public class Application {
	public static void main(String[] args) {
		// from 2.2 SparkContext is replaced with SparkSession
		SparkSession spark = SparkSession.builder().appName("FruitApp").getOrCreate();

		// from 2.2 RDD is replaced with Dataset
		Dataset logData ="/home/techie/fruits.txt");

		// this will count occurrence of Apple in the file
		long appleCount = logData.filter(s -> s.contains("Apple")).count();

		// this will count occurrence of Orange in the file
		long orangeCount = logData.filter(s -> s.contains("Orange")).count();

		System.out.println("appleCount: " + appleCount + ", orangeCount: " + orangeCount);

The code above will read data from a file and print total count of "Apple" and "Orange" on console.

Install Apache Spark

Apache spark can be installed in easy steps from prebuild packages for hadoop and like environments or standalone even on a bare ubuntu machine. To Setup Apache Spark on ubuntu machines click : Install Apache Spark on Ubuntu.

Spark Standalone mode Multi-Node cluster

Apache Spark Multi-Node cluster can be setup using cluster managers like Hadoop YARN, Apache Mesos or Standalone spark cluster manager.

To Setup Apache Spark cluster on ubuntu machines using Simple standalone spark cluster manager click : Apache Spark Standalone mode Multi-Node cluster setup on Ubuntu.

Spark Java Application with Maven

Spark Java applications can be created as maven java project adding spark dependencies to pom.xml, these compiled application in form of .jar file can be submitted to spark clusters. Know more in details here: "How to create a Spark Java application"

In this article we have seen a basic overview of how Spark can be used in Big Data Analytics and How it's better than traditional MapReduce environments. In upcoming articles we will see more about Spark Cluster Setup, Job Submission and much more.
  • By
  • Nov 12, 2017
  • Big Data