BigData Technologies and NoSql - What is Apache Hadoop, MapReduce, HDFS, Hive and Pig

From our previous discussion we came to know that bigdata is a term used for describing rapid growth and maintainability of both structured and unstructured data.  We have a huge amount of data coming from a number of different sources, its variety and huge volume is making it inconsistent in nature.

"We have already entered in the era of BigData, so having large amount of data is not a problem, the problem is to manage the data in an efficient way so that every useful information can be fetched from it easily."

Today we will dive into a brief introduction of some popular bigdata technologies those are being used to maintain bigdata these days. We need some advanced technologies which not only provides storage of large amount of data but manage to make data available at any point of time. I came across a number of web search and other things to collect some information about tools and techniques that are available and are being used these days. Here is a brief introduction of top 5 bigdata technologies of today.

Schema less and Column oriented Databases (No Sql)
We are using table and row based relational databases over the years, these databases are just fine with online transactions and quick updates. But when unstructured and large amount of data comes into the picture we needs some databases without having a hard code schema attachment. There are a number of databases to fit into this category, these databases can store unstructured, semi structured or even fully structured data.

Apart from other benefits the most finest thing with schema less databases is that it makes data migration very easy. MongoDB is a very popular and widely used NoSql database these days. NoSql and schema less databases are used when the primary concern is to store a huge amount of data and not to maintain relationship between elements.

"NoSql(not only sql) is a type of databases that does not primarily rely upon schema based structure and does not use sql for data processing."

Apache Hadoop 
Apache Hadoop is one of the main supportive element in BigData technologies. It simplifies the processing of large amount of structured or unstructured data in a cheap manner. Hadoop is an open source project from apache that is continuously improving over the years. 

"Hadoop is basically a set of software libraries and frameworks to manage and process big amount of data from a single server to thousand of machines. It provides a efficient and powerful error detection mechanism based on application layer rather than relying upon hardware."

In December 2012 apache releases Hadoop 1.0.0, more information and installation guide can be found at Apache Hadoop Documentation. Hadoop is not a single project but includes a number of other technologies in it including:

MapReduce was introduced by google to create large amount of web search indexes. It is basically a framework to write applications that processes a large amount of structured or unstructured data over the web. MapReduce takes the query and breaks it into parts to run it on multiple nodes. By distributed query processing it makes it easy to maintain large amount of data by dividing the data into several different machines.

Hadoop MapReduce is a software framework for easily writing applications to manage large amount of data sets with a highly fault tolerant manner. More tutorials and getting started guide can be found at Apache Documentation

HDFS(Hadoop distributed file system)
HDFS is a java based file system that is used to store structured or unstructured data over large clusters of distributed servers. The data stored in hdfs has no restriction or rule to be applied, the data can be either fully unstructured of purely structured. In hdfs the work to make data senseful is done by developer's code only.

Hadoop distributed file system provides a highly fault tolerant atmosphere with a deployment on low cost hardware machines. HDFS is now a part of Apache Hadoop project, more information and installation guide can be found at Apache HDFS documentation.

Hive was originally developed by facebook, now it is made opensource for some time. Hive works something like a bridge in between sql and hadoop, it is basically used to make sql queries on hadoop clusters.

Apache Hive  is basically a data warehouse that provides ad-hoc queries, data summarization and analysis of huge data sets stored in Hadoop compatible file systems. Hive provides a SQL like called HiveQL query based implementation of huge amount of data stored in hadoop clusters. In January 2013 apache releases Hive 0.10.0, more information and installation guide can be found at Apache Hive Documentation.

Pig was introduced by yahoo and later on it was made fully open source. It also provides a bridge to query data over hadoop clusters but unlike hive, it implements a script implementation to make hadoop data access able by developers and business persons.

Apache pig provides a high level programming platform for developers to process and analyse BigData using user defined functions and programming efforts. In January 2013 Apache released Pig  0.10.1 which is defined for use with Hadoop 0.10.1 or later releases. More information and installation guide can be found at Apache Pig Getting Started Documentation.

Big data is a very broad topic to discuss within a single blog but i hope, this blog is able to give you a good overview of major bigdata technologies and implementations.