BIG DATA AND HADOOP
We are living in a world where technology has become one of the basic needs of everyone. Everywhere we go and everything we do is linked with technology. Nowadays social media and the internet have become a part of our daily life. From children to elderly people everyone is using social media in their own ways, and this usage generates a lot of data daily.
Data has been the most valuable asset of Multi National Corporations (MNCs). The company with lots of user data is leading the market, they are using the data to upgrade the user experience, use it in research, and many more.
Some questions may come in your mind “How much data is being generated?”, “How are all the data being stored and circulated?”, “ How are MNCs managing their data storage?”. So I will be trying to answer all these questions on the level of my understanding and information.
What is Big Data?
Big Data is not, on its own, the true problem. The problem lies with how the data is analyzed, managed, and used. As I said above a lot of data is produced on daily basis. Such data which is so large, fast, or complex that it’s difficult or impossible to process using traditional methods is known as Big Data. These data can be videos of Facebook, messages of WhatsApp, searches in Google, images of Instagram, and many more types.
Statistics and Example of Big Data
Below are the statistics and information of big tech companies on the data that they collect per day, minute or second basis.
- Facebook: 500 Terabytes Per Day
According to a Techcrunch in 2012, Facebook was generating 2.5 billion pieces of content and more than 500 TeraByte of data per day which were all the likes, comments, photos, and videos. Today, there are 2 billion active users and 1.5 billion actives per day which are contributing to data production.
- Twitter: 12 Terabytes Per Day
Even though twitter has some text limit in their post and messages, its users still contributes more than 12 TeraByte of data per day which equals around 4368 TB per year.
- Google: 40,000 Google Web Searches Per Second
Furthermore, most of these data and searches are coming from mobile devices. It can be easily predicted than the increasing demand for mobile devices will further uplift the size of data.
- Airline Sector
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, the generation of data reaches up to many Petabytes. So we can see that not only social media but every area where software packages are used or implemented is producing lots of data.
Problems with Big Data
The two problems among seven subproblems of Big Data are:
- Volume: It is difficult to store the bulk of data. We can create a large capacity of hardware storage but it wouldn’t be an efficient idea because it adds a problem of cost and creates more problems when working with the data from it.
- Velocity: As we know than when we start working on a file the file gets saved in our computing unit (RAM/CPU) which is brought from the storage unit (Hard Drive/Floppy). So when working with big data it will cause an I/O problem(Input Output)
How to Solve These Problems
There are many ways to solve this problem but the most efficient and widely used solution is Distributed Storage.
What is Distributed Storage?
A distributed storage system is an infrastructure that can split data across multiple physical servers (Commodity Hardware), and often across more than one data center with the help of networking. It takes the form of a huge cluster of storage units, where the main node act as a master and the other nodes act as the slaves which contribute their resources and the cluster is formed with coordination between all nodes.
How will Distributed Storage solve the big data problem?
- Storage: In a distributed storage we can split our data into smaller parts and distribute them among multiple worker nodes.
- Speed: When we combine all nodes in a single cluster we can use the computing power of all devices as a single one which can help us work with the data at a faster speed.
- Cost: Instead of buying costly hardware we are using commodity hardware so it will be cost-efficient also.
- Scalable: When we have more data than our storage capacity we can always add more systems in the cluster which will add more space.
- Accessible: As all the systems will be linked within a network it will be secure and easy to access the data from the master system.
How to implement Distributed Storage?
Distributed Storage is a concept and to implement this concept we need some software/program. There are many software packages that helps to implement and design distributed storage. Some of them are AWS S3, CEPH, GlustyerFS. Here, we will be talking about Hadoop.
What is Hadoop?
Hadoop is a product of the Apache Software Foundation. It is an open-source software which means the source code is released for the public. Hadoop is a collection of software utilities that help in solving problems engaging with massive data and computation using a network of many computers.
Tech companies like Facebook, Cloudera, IBM, AWS, and many more use Hadoop. Hadoop manages data through clusters of server in a network which provide single storage. Hadoop’s unique feature of mapping data on the clusters provides faster data processing.
Hadoop helps to create a cluster in which every system is the node. One of the main systems is called Master Node (Name Node) and the remaining connect system is Slave Node (Data Node). All systems are connected in a network and designed in such a way that when the master node gets the data, it divides the data and distributes it among all the nodes.
Finally, this is the information on ‘What is Big Data?’, ‘Who uses it?’, ‘How are they formed, managed, and used?’. I would be updating this blog with more information on big data in the coming days so please keep connected.
Thank you for reading the blog and please do suggest to me some ideas for improvement. Your suggestions will really motivate me.