Integrating LVM with Hadoop and Elasticity on Data Node Storage

How to integrate LVM in Hadoop Environment and provide Elasticity to Data Node Storage

Hello Geeks. Let’s learn something new today.

Today we will be learning how to create Logical Volume on a Data Node storage and contribute to Hadoop Cluster to make the data node elastic by nature.

Some basic terms we need to understand:

Logical Volume Management: LVM is away in the storage system that provides a method of allocating storage devices that are more flexible than conventional partitioning schemes to store volume. It is elastic by nature, which means we can increase and decrease storage size when we require without losing data.

Elasticity: Elasticity refers to a storage system's capability to adapt to variable workload changes by allocating and deallocating resources as required by each application.

Data Node: In a Hadoop Cluster Data Node, the system holds the data of the filesystem. They are many Data Node in a single cluster, and their information is saved on the Name Node and controlled by the Name Node.

Let's learn about LVM in brief before going to the topic. As we know, LVM is a storage device management where we, the user, can increase and decrease the amount of storage in a device without losing its data or without going into offline storage mode. It gives us the advantage of the elasticity, flexibility, and control to the user. LVM can also be called dynamic storage.

So, how LVM is created? → First, we need to create a Physical Volume from the storage available. We need to contribute the storage of those Physical Volumes into a Volume Groups, that holds the storage amount. Then from the Volume Group, we can create as many Logical Volume we want until the Volume Group runs out of storage.

Now, as we have some knowledge about what is LVM. Let's get started…

Step 1

First, let’s launch our AWS EC2 instance for name node and data node.

Now we will connect to our data node and setup LVM; we need LVM software. First, we can check what software provides LVM, so w use the command.

# rpm -q --whatprovides lvm2

Then we install the software using the yum command:

# yum install <output of above code>

We have successfully installed LVM software. We will launch and attach two EBS storage to our instance. When we are doing it on VM, we can add storage to our VM. Here we added 2 volumes of 5GiB.

Now we have to convert these disk storage into a physical group, and we can do that with the command:

# pvcreate <diskname> <diskname>

After creating to view the physical volume and to get information about it like it is allocated or not, which volume group it is allocated with, we can use the command:

# pvdisplay      //To see all physical volume# pvdisplay <diskname>   //To see a specific physical volume

After creating a physical volume, we will create a volume group. The volume group is like we have added the two physical groups, and now both are available as one disk with total size. The volume group size will be 10GiB here because we added two physical volumes of 5GiB.

Here we created a VG with the name “myvolume” and added the two PVs to it. We can create a volume group using the command:

# vgcreate <name for VG> <pv name> <pv name>

We can see the details about our volume group using the command:

# vgdisplay             //To see all volume group# vgdisplay <VG name>    //To see specific volume group

Now from our volume group of 10GiB, we will create a Logical Volume. We can create many Logical Volume from the VG until our VG is fully used. To create a logical volume, we have to give it a name, and then we have to allocate the size of the volume and state from which VG we want to create it.

To create a logical volume, we can use the command:

# lvcreate --name <name for LV> --size <lv size> <VG name>

We can see the details about the LV using the command:

# lvdisplay             //To see all logical volume# lvdisplay <VG name>    //To see specific logical volume

Now we have created a Logical Volume, we can check it using the command “fdisk -l,” and we can see our Logical Volume of 3GiB.

To use our Logical Volume, we have to format the volume first to create a fresh Inode Table. We will be using the ext4 file system to format our Logical Volume. To format, we used the command:

# mkfs.ext4 /dev/<VG Name>/<LV Name>

After formatting, we mounted the storage in one directory called “datastore” using the command:

# mount /dev/<VG Name>/<LV Name>   /datastore

We can check if this was successful or not by using the command “df -h.”

Now we have to use this Logical Volume as our Data Node storage. So we have to go to the Data Nodes configuration file and set the <value> tag value to the Logical Volume directory.

We then restarted of hdfs service, and now we can see that the data node is contributing around 3GB to the Hadoop cluster.

Now we created a test file inside the directory “datastore.” The file size is around 555MiB

Now here comes our use case of using the LVM. We created a file of 559MB. What if the file size will be bigger and the cluster needs more storage.?

We have added more storage but using static volume increment, we cannot increase the storage size without shutting down the storage or formatting the storage, and we don’t want to do that.

Now we can increase the LVM size easily without losing any data or shutting down the storage using the command:

# lvextend --size +<size to increase> /dev/<<VG Name>/<LV Name>

We can increase as much storage as we want, depending on the size available in our Volume Group. We added 5GiB more to our Logical Volume, as we can see below the image.

As in our below image, we can see that our Logical Volume size has increased from 3GiB to 8GiB.

Here is an important part after increasing our volume size on Logical Volume, we have to format it to create a fresh inode table. But we cannot lose our data while formatting, so there is a special command that helps us update our inode table without disturbing previous entries.

# resize2fs /dev/<VG Name>/<LV Name>

We can see the volume size before and after using the command in the below image.

And we can even see that our “/datastore” directory, which is used by data node size, has also been increased to 8GiB after resizing.

Even in our Hadoop cluster, we can see that the data node contributes around 8GiB of storage.

If LV fully uses our VG storage capacity LV, we can add more storage, and we can create them into physical Volume and then add them in our Volume Group.

In the below image, we added one more storage of 2GiB in size and created a Physical Volume, and then we use the command “vgextend <VG Name> <disk name>” and our Volume Group size is increase from 10 GiB to 12 GiB.

We have even created a python script where we don't need to type whole process commands. This script can also be used as a menu program for people who are not comfortable with Linux and want to use Logical Volume. We can create a better menu by adding more conditions.

Finally, we have successfully learned how to create an LV and use LV in our data node storage to make it an elastic volume. Then we created a Python Script to make the whole process a menu-driven program.

I hope I have explained every detailed bit by bit, and if you have any doubts or suggestions, you can comment on this blog or contact me on my LinkedIn.

Thank you for reading the blog, and please do suggest to me some ideas for improvement. Your suggestions will really motivate me.

I blog about ML, Big Data, Cloud Computing. And improving to be the best.