How to create a Hadoop cluster and contribute a limited/specific storage amount as a Data Node in the Hadoop cluster?

Hello, learners hope you are here to learn something about Hadoop and its storage management.

Pre-requisites for the blog


Let's get started with some of the vocabulary I will be using in this blog.

Hadoop: Hadoop, also known as Apache Hadoop, is an open-source technology/software used to solve the issues of managing and using Big Data for computation using a network of computers, also known as a cluster.

Hadoop File-System(HDFS): HDFS is a distributed filesystem used by Hadoop to allow multiple files to be stored simultaneously and with high speed in the Hadoop cluster.

Name/Master Node: Name node is the main centerpiece in HDFS, which manages the metadata of the data stored in the cluster. It doesn’t store the data, but it stores all the data nodes' data. The name node is also a single point of failure in the Hadoop filesystem.

Data/Slave Node: Data node stores the Hadoop filesystem data and is constantly connecting with the name node and receiving and giving data when required. All data nodes in the cluster store replica in them so that if one is down, the filesystem doesn’t fail.

As we are comfortable with the terminology we will be using it in the blog, so let's get started with the task.

We will be doing the task with two instances. We will be configuring one name node and one data node. You can add more data node as per your requirement with the same configuration.


First, we will go with the configuration of the master node. We have to launch an EC2 instance ( you can configure your system as a name node or use VM and configure the system there ).

For launching an AWS instance, you can check my previous blog. We launched two instances naming “master node” and “data node.”

After launching an instance, we need to have “Hadoop” and “JDK” software in our system. JDK supports Hadoop to run, so we need JDK. If you are using AWS instance or VM, you can download the .rpm file of the software and send it to the instance using SCP protocol using WinSCP (can be used to transfer files between Linux and Windows).

As shown in the below image, we have to open the WinSCP app and copy our instance public IP and paste it in the WinSCP hostname, and we, the user same will be “ec2-user” by default.

We use the same “.pem key,” which we downloaded while creating the instance for authentication. (I have talked about this in my previous blog).

After adding the .pem key and IP and username, we will be logged in to WinSCP when we log in. Now we can drag and drop the required programs into our Linux from Windows.

Now we have successfully transferred both “Hadoop” and “JDK” rpm file into our Linux. Now when we log in to our master instance and login with the root account using the command “$sudo -su root” and check the file using the “$ls” command, we can see the “Hadoop” and “JDK” file.

Now we need to install both the software in our system using the command:

$ rpm -i hadoop-1.2.1–1.x86_64.rpm --force   //for hadoop
$ rpm -i jdk-8u171-linux-x64.rpm //for jdk

We have to use force during the Hadoop installation because Hadoop cannot override some files on the system, so we installed it forcefully. You can see it on the image (I have this software already in my system, so it didn’t prompt any output.)

Now we can check if Hadoop is there in the system or not by getting inside Hadoop’s file directory using the command:

$cd /etc/hadoop

Now is the time for configuration. We have to configure our instance to act as a master node now.

For configuration, we have to configure two files inside /etc/Hadoop directory. the files will be “hdfs-site.xml” and “core-site.xml.”

Before configuring the name node, we need to make a directory in the root folder to know why after a few steps. We can use the command:

$mkfs //nn     

We will open the hdfs-site.xml file at first in our vi code editor inside the terminal using the command:

$cd /etc/hadoop
$vi hdfs-site.xml

We edit the file by first using the “i” character to go in insert mode and then write the <property> tag inside the <configuration> tag. Inside <property> tag, we write <name> tag where we define where on the local filesystem the dfs name node should store the file. And the <value> tag indicated the directory where it should include. For the value, we gave the “/nn” directory, which we created a few steps before in / directory: to store the name node details.

To save this configuration file, we need to come outside insert mode using the ESC key and then write “:wq” and press Enter.

Now we have to configure our core-site.xml file. The same way we open our core-site file in vi editors using the command:

$vi core-site.xml

Then as in hdfs-site.xml file we create <property> tag and add <name> and <value> tag inside core-site.xml file. In <name> tag, we define the name node's address, and all the HDFS command refers to the name node address. And inside <value> tag, we define the protocol at first, i.e., “hdfs” then we use the IP IP) to indicate we can receive a connection from any IP, and “9001” is the port where the name node will listen to all data node.

We save this file by coming out of insert mode and saving the file with the “:wq” command.

Now our configuration is done; we have to start our Hadoop service. We have installed Hadoop, configured it, and now we have to run the Hadoop service in our system.

But first, we have to clear our cache memory so that there is no issue with starting our Hadoop service, and we can do that by using the command:

$echo 3 > /proc/sys/vm/drop_caches

You can check the difference in the below image.

Now we start our Hadoop master service using the command:

# hadoop-daemon.sh start namenode

After starting the service, we can check if our service is started or not by using the command:

# jps

We can see in the below image our name node has been started.


Now we have to launch our data node instance install the software there using WinSCP, and do the same process as we did for the master node until Hadoop and JDK installation.

After the installation, we need to go to the same /etc/Hadoop file location and configure our “hdfs-site.xml.” In the hdfs-site file, we configure the <name> tag with dfs.data.dir, which means our data node will store the file in the allocated directory. And the <value> tag includes the directory /data node1, which we create in the root directory as we created for name node, and it takes all the storage of the data node instance.

After saving the file, now we have to open the “core-site.xml” file. And there in the <name> tag add the property where it tells the address of name node which is needed to be connected and <value> tag include the protocol, i.e., hdfs and the public IP of the name node, and the port is 9001.

Now we have to start the data node also, and we can do that using the command:

$ hadoop-daemon.sh start datanode

We can see that our data node service has been started.

Now to confirm, we can check the status of our Hadoop cluster using the command:

# hadoop dfsadmin -report

We can see that in our cluster, we have one data node connected, and we can see the IP of the data node and the storage we contributed.

We can even see the data node's storage, and we have 10GiB of storage in the below image from which we contributed 9.99GiB.

We can say that we have successfully designed a Hadoop cluster with one master and one slave node. We can add many slave nodes with the same configuration and contribute their storage to the cluster.


After configuring the Hadoop cluster, we will now move to using a specific storage amount or limiting the data node's storage amount to contribute to the cluster. Let's get started with that….

We have attached an EBS volume to our instance (it’s like we have added a new hard drive in our OS). And now we have to mount that hard drive, and we can do mounting in 3 easy steps. You can learn more about disk partition in my previous blog. In the below image, you can see a new disk of 5GiB, which is unmounted.

Now we start going for a hard disk partition. We can get inside the disk by using the command:

# fdisk /dev/xvdf

And now, we use the key ’n’ for the new disk partition. We use the key ‘p’ for the primary partition type; then, we have to select the start and end sector point of the hard drive. We left that as by default value. If you want a specific amount from the same hard drive to contribute to the cluster, you can set the end limit as per your requirement. Here I went with the full hard drive size. Then we save the partition with the key ‘w.’ If you want to be clearer about these steps, please take some reference from my blog about disk partition.

Now on the below image, we can see that our disk has been created.

Now we have to format the hard drive before using it to create an Inode Table.

Now here comes the main point, while mounting, we used to make a directory in the root folder, and then we used to mount the storage device in that folder. Still, this time we will be mounting the storage to the directory which our data node is using, i.e., “/datastore” (Above I used ‘/datanode1’ directory I updated the “hdfs-site.xml” file with the value tag value to ‘/datastore’). So we are mounting our new storage device to the same directory.

As an outcome of that mount, we can see in the below image that our cluster's total storage capacity has changed from 9.99GiB to 4.51GiB as we mounted a 5GiB hard drive.

Using the command “Hadoop dfsadmin -report” we can see the detail of the cluster.

We can now contribute a limited or specific amount of storage as a Data Node to the cluster.

One drawback of this method is we loose our data which was on the previous directory. But the data is saved as replica in other data nodes. I will be explaining how we can make this step dynamic using LVM in my upcoming blog.


Finally, we have completed our task of Creating a Hadoop Cluster with one name node and one data node, and we also learned how to contribute a specific or limited amount of storage as a data node to the cluster.

I hope I have explained every detailed bit by bit, and if you have any doubts or suggestions, you can comment on this blog or contact me on my LinkedIn post.

Thank you for staying till the end of the blog, and please do suggest to me some ideas for improvement. Your suggestions will really motivate me.

I blog about ML, Big Data, Cloud Computing. And improving to be the best.