Everyone knows about Hadoop and everyone knows that it is mainly used for Big Data processing, or distributed processing. However, terms used with its components sometime make us confused, specifically items mentioned with the title. Here are some notes made on a discussion we had on Hadoop clusters.
Generally, Hadoop cluster comprises two main components; Master nodes and Slave nodes.
This node manages all services and operations. One Master node is enough for a cluster but having the secondary one increases scalability and high availability. The main operation Master node does is, running NameNode process that cordinates Hadoop storage operations.
This node provides required infrastructure such as CPU, memory and local disk for storing and processing data. This does all slave processes; the main is running DataNode process. Generally cluster comprises at least three Slave nodes but cluster can be easily scaled up by adding many number of Salve nodes.
This node is a part of Master node and responsible for coordinating HDFS functions. For an example, when a location of a file block is requested, Master node gets the location from the Namenode process.
Datanode is a process that handles actual reading and writing of data blocks from/to storage. This is under Slave node and it is a slave for Namenode.
In addition to these, there are so many other components such as Job Tracker, Task Tracker, HDFC, etc. There are many articles on it, refer following if you need more details;