Configuring & Starting Hadoop Cluster Using Ansible Playbook

Ansible playbook is written to configure Name Node and Data Node in Hadoop and Start the Cluster.

MishanRG
4 min readMar 20, 2021

--

Greetings Everyone!!! I hope you all are having a great day.

Let me start by explaining Ansible. In short, the word you need to understand that Ansible is a powerful automation tool for configuration. We have to download and set 100 and 1000 systems servers at the industry level to provide smooth service. Configuring one by one and manually is not a good way. Hence, we have some tool by which we need to write some predefined code as per our need and run it one system, and the whole cluster of systems gets configured. Ansible is among one of those tools. You can check me for another blog for getting more information about Ansible and its use case in Industry.

You can also check my other blog if you want some insights on how the Hadoop cluster work and how we can configure it.

So, let’s get started with the configuration of our plan.

INSTALLATION

To understand Automation and how we can write a playbook first, we need to be clear with what we want to do with automation, and we need to know every step of how we manually configure the setup. We can then go step by step while writing the playbook code. First, we need to have 2 systems minimum, i.e., one Name Node another Data Node. We can add more Data Node. Then we need to have Hadoop and JDK installation files. So I have created a playbook which can help us install both these packages in all the system. So here is a playbook that I used to install both packages in all systems.

We transferred the package from Ansible Controller Node to Managed Node and installed it there.

As in the above image, we can see that installation is done in our Managed Node.

CONFIGURE & STARTING NAME NODE

We will create a code by which we can configure Name Node as one of our Managed Node and start Hadoop there.

We can see the code below to configure a system as a Hadoop Name Node, and we ran this playbook in one of our Managed Node, which will work as Name Node.

We have two codes one is the configuration file, and one is the variable file.

I have included the configuration file of Hadoop mentioned the variable file in my GitHub repository. Here is the link.

In the above image, we can see that the Name Node has been started. We can confirm that using the “jps” command. And we can see above that a Name Node service is running after we ran the playbook.

CONFIGURE & STARTING DATA NODE

Now we will write an automation code to start our Data Node in our Hadoop Cluster. Here the variable files I used in the playbook denote the configuration file for the Data Node. You can see the code of my playbook below:

As we can see in the above code, I have passed a pre-created configuration file to the system and started the service. We can see that the Data Node is started in our below image.

CONCLUSION

So this way, we can configure our Hadoop cluster using Ansible automation. We can also make our playbook more dynamic by using the variable in the software package to use any version required. I hope my work is explanatory.

You can find the whole code in my GitHub, i.e., the below-mentioned link.

I hope I have explained everything, and if you have any doubts or suggestions, you can comment on this blog or contact me on my LinkedIn post.

Thank you for staying till the end of the blog, and please do suggest to me some ideas for improvement. Your suggestions will really motivate me.

--

--

MishanRG

I blog about ML, Big Data, Cloud Computing. And improving to be the best.