Introduction to Apache Zookeeper

In this series of BigData tutorials we have seen a number of technologies, in this particular article we will see What is Apache Zookeper, what are its key features, how it works and where to use it.

What is Zookeeper

Zookeeper is a centralized open-source server for maintaining and managing configuration information, naming conventions and synchronization for distributed cluster environment. Zookeeper helps the distributed systems to reduce their management complexity by providing low latency and high availability. Zookeeper was initially a sub-project for Hadoop but now it's a top level independent project of Apache Software Foundation.

What Zookeepr does

Zookeeoer is robust and provides high availability because the persisted data is distributed and replicated between multiple nodes and one client can connect to any of the working node if one node fails. Initially a master node is chosen to communicate with client but if the node went down the role of master is migrated to other node in the cluster.

Key benefits of Zookeeper

Zooleeper provides a very simple interface and reliable services, following are main key-features of Zookeepr:

1) Fast: Zookeeper is very fast with kind of workloads where reads are more frequent than writes. The read/write ration is around 10:1.

2) Reliable: Zookeeper is distributed and date is replicated on each server, hence if a node goes down and a critical mass of servers are up the service will be available, there is no single point of failure.

3) Simple: Zookeeper follows traditional file-systems like notions and hence its easy to understand, configure and maintain.

How Zookeeper Works

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data nodes called znodes. Much like a traditional file system the name spaces provided by zookeepr is a sequebce of path elements separated by a slash(/). Every node is identified by its slash separated path and every node has a parent node having a suffix of child's path, except the root node (/).

Zookeper was designed to store configuration meta-data only and hence every node has a limit on how much data they can contain. Zookeepr nodes expects to store data in KB/MB only and it has a inbuld sanity check for 1MB data.

This restriction is to prevent nosed from storing large data sets, zookeepr was designed to process configuration or information data fast and hence expects small data it its nodes.

The service is replicate on every server involved in the archetecture and these machines maintains an in-memory image of the data tree along with a transaction logs and snapshots in a persistent store.

All the data is stored in memory so Zookeepr has a low latency and high throughput, storing data in memory is the reason behind nodes expects to have a little amount of data on them.

Client has to connect to a single server only and maintain a session through TCP connection, is the fails or server goes down the client has to connect with another server and a new session in reestablished in that case.

Reads are much more fast than writes the reason behind this is, reads happened locally on a single server and does not have to do anything with other servers while in write operations a replica has to be places on each and every server.

Where to use Zookeeper

The most common example of Zookeeper usage is distributed-memory computation, where data is shared between distributed nodes and must be accessed/updated in a very careful way to account for synchronization.

ZooKeeper offers the library to construct your synchronization primitives, while the ability to run a distributed server avoids the single-point-of-failure issue you have when using a centralized (broker-like) message repository.

Most of the technologies like Apache Hadoop, Hbase, Apache Solr, Neo4J, Apache Kafka etc. relies on Zookeer for their configuration management and fault-tolerant replications.

In this article we have seen, What is Apache Zookeper, what are its key features, how it works and where to use it. In upcoming articles we will see more about zookeeper and other open-source technology.

About The Author

Nagesh Chauhan

Nagesh Chauhan has 8+ years of software design and development experience in variety of technologies like - Core Java, Java 8 (Streams, Lambda), J2EE (Servlet, JSP), Spring Framework (MVC, IOC, JDBC, SECURITY etc), Spring Boot and Microservices, Kafla, Redis, Cassandra and Spark.