Busting Hadoop Myth!!!

Does Hadoop use the parallelism concept or serialism concept to upload the split data in the data nodes ???

Hello Everyone, I hope you all are doing fine.

In this blog, we will learn how data gets uploaded in the data node in the Hadoop cluster? We can see many posts, blogs, and articles that state that Hadoop uses parallelism, and some state Hadoop uses serialism, so let's find out how Hadoop uploads data in data nodes with proof. So let’s get started…

NOTE: Please refer to above link to get more clear understanding in Hadoop Cluster and its formation.

First, we created a Hadoop Cluster with 1 Name Node, 4 Data nodes, and 1 Client.

Web-UI of the Cluster
Data Node in Cluster

We then uploaded a file with the name “testhadoop.txt” from the Client to see how the data gets stored in all the data-node.

We can use the command:

# hadoop fs -put <filename>
Client Uploading Data

When we upload data from the client, we know that the Client contact the Name Node(NN), and NN provides the details of the Data Nodes, and the Client takes the data to the Data Nodes and creates replication. So we used the `tcpdump` command on all 4 data nodes to capture the package(data) coming from the client to check if all data nodes get the package (PARALLELISM) at the same time. OR the data is given to one data node, and that the data node transfers the file to another data node (SERIALISM). We also know that the data is replicated, so let's see if the data is sent to all nodes at once or one data node takes the client's data and passes it on.

We stored the output of `tcpdump` in a file to check the file with the grep command to see from which IP it received the packet. We evaluated port 50010 because the client uses that port to transfer packets.

# tcpdump -i eth0 port 50010 > <filename>
Data Node 1
Data Node 2
Data Node 3
Data Node 4

We ran this command after the client uploaded the data to the cluster. And now we check the output of the `tcpdump` command in all data nodes.

After uploading the file below image, we can see in the Web-UI that the data has been replicated in 3 different IPs among 4 IPs, which means one of the IPs didn’t get the data replication factor in our cluster is 3.

Now to check the packet received, we tried the command “cat” and tried “grep” to Client IP to see if one data node receives the packet or all of them.

CLIENT IP: 35–154–195–85

Command on Data Node 1
Command on Data Node 2
Command on Data Node 4

As we can see from the web UI, only Data Node 1 2 and 4 got the data, so we checked the tcpdump output file, and we found out that only one Data Node, i.e., Data Node 4, received the package from the Client, and other Data Node didn’t get any package. We used the command:

# cat <fileame> | grep 35-206-171-226   //searching for Client IP in   #                                       //the tcpdump output file

When we check for other Data Node, we didn’t see any data packets from the client. When we tracked the data and check how data nodes are sending data using grep on all tcpdump output files for all files, we found a flow of data in a SERIAL manner.

Client → Data Node 4 → Data Node 2 → Data Node 3

We concluded that the data has been given to only one Data Node by the Client and that Data Node gave to another Data Node until three replication are made. So we can say that Hadoop uses SERIALISM instead of parallelism while transferring data to the Data Nodes.

I hope I have explained my points properly and made a conclusion with strong proof. If you have any doubts, feel free to contact me on my LinkedIn.

Thank You for your time. Have a Good Day.

I blog about ML, Big Data, Cloud Computing. And improving to be the best.