Big Data- The Big Problem

Nishant Ulhare
7 min readSep 18, 2020

Data is the new Gold. Technological advancements and the ever-increasing amount of data are transforming the way business across industries. In daily basis we using Google, Facebook, Amazon and many more things, But these Companies daily process data in Terabytes. processing Terabyte and petabyte data is not a joke 😱. So How do they do it ?

We know that there are lots of companies like Dell EMC, Hitachi and many more who provides storage infrastructure to many giant companies. does petabyte storage hard disk exist 🤔? No, If we use petabyte storage hard disk, It will increase I/O speed.

Input-Output

So if this storage does not exist, Where and How they store data ?

In this scenario, Big Data Technologies come in a picture. Big Data is not the technology, actually it is a problem. Yes, you read right. Because processing and analyzing terabyte or petabyte is not easy. After it first question comes in our mind What is Bigdata?

So let’s first see What is data?

The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Now discuss What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

In August 2012 Facebook publicly announced some statistics on the amount of data its system processes and stores. According to Facebook, its data system processes 2.5 million pieces of content each day amounting to 500+ terabytes of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new photos are uploaded daily.

Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.

A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data.

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month.

Where they use this data ?

  1. Google Ads

Google is also very much interested in collecting users data like photos to improve their ad delivery system.

2. “I voted” Sticker

Facebook successfully tied the political activity to user engagement when they came out with a social experiment by creating a sticker allowing its users to declare “I Voted” on their profiles. This experiment ran during the 2010 midterm elections and seemed useful. Users who noticed the button were likely to vote and be vocal about the behavior of voting once they saw their friends were participating in it. Out of a total of 61 million users, then, 20% of the users who saw their friends voting, also clicked the sticker. The Data science unit at Facebook has claimed that with the combination of their stickers that motivated close to 60,000 voters directly, and the social contagion, which prompted 280,000 connected users to vote for a total of 340,000 additional voters in the midterm elections. For the 2016 elections, Facebook expanded its involvement into the voting process with reminders and directions to users’ polling places.

3. Celebrate Pride

Following the Supreme Court’s judgment on same-sex marriage as a Constitutional right, Facebook turned into a drenched rainbow spectacle called “Celebrate Pride,” a way of showing support for marriage equality. Facebook provided an easy, simple way to transform profile pictures into rainbow-colored ones. Celebrations such as these hadn’t been seen since 2013 when 3 Million people updated their profile pictures to the red equals sign (the logo of the Human Rights Campaign).All this excitement also raised questions about what kind of research Facebook was conducting after their tracking user moods and citing behavior research. When the company published a paper, The Diffusion of Support in an Online Social Movement, two data scientists at Facebook had analyzed the factors which predicted the support for marriage equality on Facebook.

How they analyze data ?

They use Big Data Technologies like Hadoop, Sparks, Kafka , Cassandra and many more.

So basically Big data technologies are split into two categories -

  1. Operational Big Data Technologies-

It indicates the generated amount of data on a daily basis such as online transactions, social media, or any sort of data from a specific firm used for the analysis through big data technologies based software. It acts as raw data to feed the Analytical Big Data Technologies.

Few cases that outline the Operational Big Data Technologies include executives’ particulars in an MNC, online trading and purchasing from Amazon, Flipkart, Walmart, etc. online ticket booking for movies, flight, railways and many more.

2. Analytical Big Data Technologies-

It refers to advance adaptation of Big Data Technologies, a bit complicated in comparison to Operational Big Data. The real investigation of massive data that is crucial for business decisions comes under this part. Some examples covered in this domain are stock marketing, weather forecasting, time series analysis, and medical-health records.

Facebook -

They rely too much on one technology, like Hadoop. Facebook relies on a massive installation of Hadoop software, which is a highly scalable open-source framework that uses bundles of low-cost servers to solve problems. The company even designs its in-house hardware for this purpose. Mr. Rudin says, “The analytic process at Facebook begins with a 300 petabyte data analysis warehouse. To answer a specific query, data is often pulled out of the warehouse and placed into a table so that it can be studied. The team also built a search engine that indexes data in the warehouse. These are just some of the many technologies that Facebook uses to manage and analyze information.”

Google-

Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link.

These people are using Hadoop Distributed File System Method to manage and analyze data.

Hadoop Distributed File System is a distributed file system for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with Hadoop Distributed File System.

When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across a number of separate machines. A file system that manages storage specific operations across a network of machines is called a distributed file system. Hadoop Distributed File System is one such software.

HDFS CLUSTER

HDFS cluster primarily consists of a Name Node that manages the file system and a Data Nodes that stores the actual data.

Name Node-

Name Node can be considered as a master of the system. It maintains the file system tree and the metadata for all the files and directories present in the system. Name node has knowledge of all the Data Nodes containing data blocks for a given file, however, it does not store block locations persistently. This information is reconstructed every time from Data Nodes when the system starts.

Data Nodes-

Data Nodes are slaves which reside on each machine in a cluster and provide the actual storage. It is responsible for serving, read and write requests for the clients.

How it works ?

HDFS operates on a concept of data replication wherein multiple replicas of data blocks are created and are distributed on nodes throughout a cluster to enable high availability of data in the event of node failure.

e.g. Imagine we have a Name Node and 10 Data Nodes. If Name Node receives 100 GB data, then it creates 10 block of 100 GB like 10 GB per block and sends one block per Data Node.

This topology reduces time of storage.

e.g. When we store 100 GB data on one single hard disk, it takes 10 min to store data. If we can store 100 GB data in one minute using this topology.

Use of Big Data analytics in Agriculture

Problems-

$ Farmers have limited knowledge of current technological advancement in agriculture, limiting production.

$ Consultants to guide Farmers limited, in remote areas.

$ Agricultural products are tried in a test environment ONLY, constraining usability.

Solution-

$ The Big Data in Agriculture thus collected from such interaction regarding crops, weather, terrain, geographic conditions, water and much more is stored and processed. This leads to the analytics part of our solution. By processing this data, the system will be able to assist.

$ Consultants will learn more about the most affected geographic areas where farmers could be assisted to deliver higher productivity.

$ Predictive analytics- To predict the success of a product or crop or predict ill effects of a natural event on crops.

$ Historical data Analysis- Process vast volumes of historical data regarding crops, geographic, etc.

$ Real-Time Analytics- To provide farmers with real time assistance by analyzing information provided by them in real time.

Benefits-

$ Increased Agricultural Productivity through Agriculture Data.

$ Greater success of fertilizing products across a variety of geographic conditions for Agriculture Companies.

$ Avoid ill effects of a particular natural occurrence.

!! Thanks for Reading !!

--

--