What Is Big Data Technology? Definition In Hindi, Tutorial PPT & PDF
Introduction To Big Data?
In this section, let us try to understand what is big data? Big data refers to the huge volume of data that cannot be stored and processed using the traditional approach within the given time frame. The next big question that comes to our mind is how huge this data needs to be? To be classified as big data, there is a lot of misconception while referring the term big data.
We usually use the term big data to refer to the data that is either in Gigabytes or Terabytes of Petabytes or Exabytes or anything larger than this in size. This does not define the term big data completely. Even a small amount of data can be referred to as big data depending on the context, It is being used. Let us try to explain it to you – for instance, if we try to attach a document that is of 100 megabytes in size to en email. We would not be able to do so as the email system would not support an attachment of this size.
Therefore, this 100 megabytes of attachment concerning email can be referred to as Big data. Let us take another example and try to explain the term big data. Let us say we have around 10 terabytes of image files upon which certain processing needs to be done. For instance, we may want to resize and enhance these images within a given time frame.
Suppose if we make use of the traditional system to perform this task, we would not be able to accomplish this task within the given time frame as the computing resources of the traditional system would not be efficient to accomplish this task on time. Therefore, this 10 terabytes of image files can be referred to as big data. Now, let us try to understand big data using some real-world examples. We believe you might all be aware of some of the popular social networking websites such as Facebook, Twitter, Linkedin, Google+ and YouTube.
Each of this website receives a huge volume of data daily. It has been reported on some of the popular tech blocks that Facebook alone receives around 100 terabytes of data each day whereas Twitter processes around 400 million tweets each day. As far as Linkedin and Google+ are concerned, each of their sites receives tens of terabytes of data on daily basis and finally coming to YouTube it has been reported that each minute round 48 hours of flash videos are uploaded to YouTube.
You can just imagine how much volume of data is being stored and processed on these websites. But, as the number of users keeps growing on these websites, storing and processing this data becomes a challenging task. Since this data holds a lot of valuable information. This data needs to be processed in a short period. By using this valuable information, companies can boost their sales and generate more revenue by making use of the traditional computing system.
We would not be able to accomplish this task within the given time frame as the computing resources of the traditional computing system would not be sufficient for processing and storing such a huge volume of data. This is where Hadoop comes into the picture. We would be discussing Hadoop more clearly in the later section. Therefore we can term this huge volume of data as Big data. Let us take another real-world example related to the online industry and try to explain the term big data. For instance, the aircraft’s while they are flying, they keep transmitting data to the air traffic control located at the airports.
The air traffic control uses this data to track and monitor the status and progress of the flight on a real-time basis. Since multiple aircraft would be transmitting this data simultaneously, a high volume of data gets accumulated at the air traffic control within a short period. Therefore, it becomes a challenging task to manage and process this huge volume of data using the traditional approach. Hence we can turn this huge volume of data as big data. We hope you all might have understood the basics of what is big data.
How Is Big Data Classified?
In this section, let us try to understand the classification of Big Data. Big Data can be classified into three different categories. The first one is structured data, the data that does have a proper format associated with it. It can be referred to as structured data for example. The data that is present within the databases, the CSV files and the Excel spreadsheets can be referred to as structured data.
The next one is semi-structure data, the data that does not have a proper format associated with it can be referred to as semi-structured data. For example, the data that is present within the email, the log files and the word documents can be referred to as semi-structured data. And the last one is unstructured data. The data that does not have any format associated with. It can be referred to as unstructured data, for example, the – the image files, the audio files and the video files can be referred to as unstructured data. This is how the Big Data can be classified as.
Characteristics Of Big Data?
In this section, let us learn some of the important characteristics of big data. Big data is categorized by three important characteristics. The first one is volume – it refers to the amount of data that is getting generates. The next one is velocity – Velocity refers to the speed at which this data is getting generated and the last one is variety – variety refers to the different types of data that is getting generates. These are the three important characteristics of big data.
Big Data Challenges
Like many of you, we enjoy our privacy on the internet so the constant advances over the last few years and the ability to collect massive amounts of personal data about us have not sat that well. But, is the industry’s favourite big buzzword right now – Big Data – all bad or are we painting and otherwise potentially very useful tool for the advancement of society in a bad light because of one aspect of it.
First, let’s discuss what Big Data is – There are many wildly different definitions of Big Data, for example, while we individuals might define it as an intrusive threat to our society and the overall anonymity of the Internet Corporations might see it as a revolution in data processing and computing.
But, there are at least a few things that everyone can agree on. Wikipedia defines Big Data as any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. We mean, that doesn’t sound so threatening. Right? So essentially big data is just a collection of data so large that each individual becomes an insignificant number where whoever has the data, often actually cannot make meaningful conclusions about any one person from it. And, many times the term big data is misused.
It’s turned into a marketing buzz word with software and IT companies stating that their systems are running full of challenges that big data provides, which is a great way to say absolutely nothing about the product. While still managing to seem important or relevant in the current market. On top of that, as much as many of you guys might not like to hear these privacy advocates tend to misuse the term as well they act as though big data only described using data compiled by websites and services when big data encompasses many more types of data like scientific and business data sets.
For example, meteorological data from weather stations and market data from financial exchanges all around the world now. Before you burn as at the stake for being a secret NSA operative for defending the idea of big data, we mean how sniffing we could afford miles of Lambo right certainly not from website money. When Big data is an issue, websites that collect and store data about their users for marketing purposes can keep a massive amount of your personal information on hand for an indefinite period. And most of the time the data is not as well secured as they would like for you to believe her. We would like to believe it was in addition to that even deleting this information from public view, often has no impact on their ability to store it on their servers.
So yet, one embarrassing picture of you from 2008 may live on in depths of Facebook servers forever. This coupled with the massive amount of information sharing voluntary or otherwise. And, data interception that agencies like the NSA carried out, means that your regional government or we mean, really any other types of nefarious characters could potentially have every move you have made online. On file now, for us personally, even though we don’t do anything illegal or even questionably legal on the internet.
If someone were to go through every bit of information from the years that we have been using it and cherry-pick light things to paint us in a negative, we are sure we could end up looking like a pretty bad dude. So, let’s make a quick connection here to solidify the ideas that we have discussed that can have an extremely positive impact on the world. It is one of our greatest forms of information sharing. But, whenever we hear anything about torrenting, the topic revolves around piracy and its other unsightly aspects, discussion about big data is quite similar.
Its not in itself a bad thing, its a driving force for a log of great innovations in the science technology and medicine and yet for the most part we only hear about the negatives the main one being lack of the control that we have over our online data and the amount of control that large corporate entities do have over it. So don’t blame bid data for creating the need to protect your privacy online, blame the companies and most important those who use it inappropriately.
How Is Big Data Stored And Processed?
In this section let us try to understand the traditional approach of storing and processing big data. In a traditional approach, usually, the data is being stored and generated out of the organization’s, the financial institutions, such as banks or stock markets or the hospitals. It is given as an input to the ETL system. An ETL system would then extract this data and transformation.
This data that is it would convert this data into a proper format and finally load this data on to the database. Now, the end-users can generate reports and perform analytics by acquiring this data. But, as this grows, it becomes a very challenging task to manage and process this data using the traditional approach. This is one of the fundamental drawbacks of using the traditional approach. Now, let us try to understand some of the major drawbacks of using the traditional approach.
The first major drawback is – it is an expensive system that is – it requires a lot of investment for implementing or upgrading the system. Therefore it is out of the reach of small and mid-sized companies. The second drawback is scalability, as the data grows, expanding the system is a challenging task and the third major drawback it is a time-consuming system. It takes a lot of time to process and extract valuable information from this data. We hope you all might have understood the traditional approach of storing and processing Big Data and its associated drawbacks.
Big Data: In-Depth Explanation
Big data starts with one progression. It is an important place to start. Historically, data was being generated and accumulated by workers in were words employees of companies were entering data into computer systems. Then, things evolved to the internet and now users could generate their data. So think about websites like Facebook. So, all these users are signing up and they are entering the data themselves.
That’s scalable, that’s larger than the first by order of magnitude now that we are talking about scalability. Here, it’s called scaled up from just employees entering the data to a user entering their data. So all of a sudden the amount of data being accumulated was way higher than it was historically. Well, now there’s even a third level in this progression because now machines are accumulating data. The building in all of our cities is full of monitors that are monitoring humidity and temperature and electricity usage.
There are smart meters on our homes that are measuring the amount of energy that our homes are creating. There are satellites around the earth that are monitoring the earth 24 hours a day. Taking pictures accumulating data, well that once machines are accumulating data, that’s orders of magnitude higher than users. There’s a progression from employees generating data to users generating data to machines. Generating data right, so, we’re the Machine stage. So, there is a colossal amount of data being generated.
How is that change things well back in the good old days? People used to use relational databases to process through data. And, we don’t need to worry about what that is. But essentially, there’s an again a major shift. That’s taken place in the good old days. We used to take the data and bring it to the processor. The CPU, the Computer Chip to process the data. But, now there’s so much data that it overwhelms the CPU. It cannot do the processing, because there’s too much data.
So now, what are people doing is they are bringing multiple processors and bringing it to the data. So, in other words, you might have a whole row of servers and each server has some small component of the whole data set and you put a processor in each one individually. It’s called parallel processing. So now, that the data is being processed in a whole bunch of different place, parallel at the same time, so on that before data is being brought to the processor.
Now processor is being brought to the data to process because, and what is that? Its scale of larger, in the first case, you bring the data to one CPU. But now you can bring an infinite number of CPUs to an infinite number of individual servers. Parallel processing, its scale is larger. So now, the data has grown scale of larger orders and magnitude higher and how we have a way to process which is scalable higher as well.
So, that’s the technological shift. Now, let’s talk about some of the technologies that are allowing this to happen and there’s two that we want to mention here – one is called Hadoop. Hadoop is an open-source platform, open-source means that its developer under the general public license. It’s developed by developers all across the world. It’s free to use.
Now the reality of those like for example – Linux, is open-source. Some of the website building platforms, what they call content management systems or open sources like WordPress Drupal and Joomla apache as a server software which is open source. So, that fact that its free is a little bit of an illusion. Because you need experts to understand how to use it, how to implement it, how to customize it.
For the specific usage, that you need but the basic infrastructure of Hadoop is open source, which means its free. And, it organizes this parallel processing. And a lot, it is the software that allows that to happen and the second this is called Map Reduce. Map Reduce which is a way of putting a summary, basically a cliff notes version on each server of what data that server contains. It’s a Cliff Notes right. Its a summary, its a table of content, the best on each one. And, all those tables of contents can go on to one central server which is essentially the search function.
So, the search function in the rare case may not work properly. The answer we are looking for is on this particular server. That’s something that’s being done through MapReduce and Hadoop together. They work together. Those are the technologies that are driving Big Data now. Who’s at the cutting edge, Google. Google’s at the cutting edge of so many things. Now think about how much data is being accumulated by Google not just in the search capacity.