What is Hadoop?: Apache Hadoop Tutorials

Hadoop is an Apache framework developed completely in Java with the opensource brand.

introduction-to-hadoop.png

Hadoop is an apache software framework of a distributed programming model to process a large amount (say terabytes) of data in a large set of clusters( multiple nodes).

Hadoop is popular for OLTP Dataware house processing. Hadoop Advantages:- 1. An open-source framework built on java. 2. It will process the large chunks of data in parallel within in small time. 3. As the data is distributed in multiple nodes, over time data grows, we can add nodes easily and supports parallel processing without much overhead. 4. Highly scalable and highly fault tolerance system even if there are multiple thousands of nodes. On top of Hadoop, we have several following other frameworks 1. Map Reduce 2. Hive 3. Pig 4. Hbase 5.Scope 6.Flume

Hadoop analyzes and processes a large amount of data i.e petabytes of data in parallel with less time located in the distributed environment. Hadoop is not a single tool that contains a combination of different sub-frameworks called Hadoop Core, Map Reduce, HDFS, Pig, and HBase. Hadoop is mostly used for OLTP transactions. Some big companies like Facebook use Hadoop for OLAP transactions as well. Hadoop can be set up in a clustered environment as well as a single node environment.

HDFS is developed based on the google file system (GFS)

Map Reduce is developed on the google map-reduce concept.

Pig framework is an SQL wrapper for map-reduce jobs.

Basics of Hadoop HDFS

In Hadoop, data is stored in Hadoop distributed file system (HDFS) software in the form of blocks (ca hunk of size), and data is replicated in different nodes (called machines) in clustered (many nodes) environment. In a real-time scenario, thousands of petabytes of data are stored in thousands of nodes in a distributed environment. The advantage of storing replicated data is data available when one of the nodes is down. With this, data is available all the time for the client’s apps. Do we need high-end hardware for all these nodes? The answer is no, we can accommodate commodity hardware for all these nodes.

if data is grown rapidly, we can add nodes without failover of the whole system and losing the data. This we can call a scalable system in network terminology. This system handles the case of losing data while adding machines to the existing machines or after the machine adds to the cluster.

As you know cluster has different nodes, if one node fails, Hadoop handles the scenario without losing the data and serves the required work as expected.

HDFS stores the data in the files where this file uses the underlying operating system’s file structure.

HDFS is suitable for storing a large amount of data like Peta and terabytes of data which process the data using Map-reduce for OLTP transactions.

Data storage in HDFS

In HDFS, the set of data is called blocks, and each block of data is replicated in different nodes or machines. The number of nodes where this data is replicated is configured in the Hadoop system.

Basics of Hadoop Map Reduce

In a Hadoop system, petabytes of data are distributed in thousands of nodes in a cloud environment. To process the data stored in HDFS, We need an application.

Map Reduce is a java framework used to write programs, which are executed parallel to process a large amount of data in the clustered environment.

Hadoop provides Map Reduce APIs to write map-reduce programs. We have to make use of those APIs and customize our data analysis login in the code.

The map-reduce a piece of code fetches and processes the data in a distributed environment.

As we want to process the data stored in HDFS, For this we need to write programs using some language like java or python, etc.

Map-reduce has two different tasks 1. Mapper 2. Reducer

The map takes the input data and processes this data in a set of tasks by dividing input data. and out of this map is the result of the set of tasks, which are given to the reducer. reducer process this data and combine the data to output the data.

Basics of Hadoop Hive

Hive is an apache framework that is an SQL wrapper implementation of map-reduce programs. Hive provides SQL languages which understand by the Hadoop system.

Most of the time, data analysis was done by database developers, so the DB developer is not aware of the java programs, in that case, the hive tool is useful.

Database developer writes Hive Queries in hive tool for the result. These queries call the underlying map to reduce jobs, process the data, and finally, data is returned to the hive tool.

Advantages with the hive are no need for java programming code in the Hadoop environment.

In some cases, if SQL queries are not performing well, or some database feature (group by SQL function) does not implement in the hive, then we have to write a map-reduce plugin and register this plugin to the hive repository. It is a one-time task.

On top of all these, we have Hadoop Common which is a core framework written for processing distributed a large set of data to handle the Hadoop features.

Hadoop Use case explained**:-**

Let me take a scenario before apache Hadoop is introduced into the software world

Let me explain the use cases of data processing for one big data company.

The retail company has 20000 stores in the world. Each store will have data related to products in different regions. This data will be stored in different data sets including different popular databases in a multi-software environment. Data companies would need licenses of different software including databases and hardware

For each month, if this company wants to process the data store wise for find the best product as well as a loyal customer that means we need to process and calculate the data and find out the best customer as well as best selling product in each region to give better offers.

Assume that 2000 stores will have the data of all the products and customer details, and customer purchase information per each store.

To target the below use cases.

To find the popular product sold in the last year Christmas per Store A, so that this year we can target the customers to give different discounts for these products

Top 10 Products sold

Select the top 100 customers per store to give more offers.

From the technical point of view, we have a large data infrastructure to store this data as well as we need to process the data, for this, we need data warehousing tools to process this normalized and unorganized which are costly. and also storing should have reliable as it will impact the data loss

Over this time data process is complex and the license cost of maintenance is more.

Suppose 10000 more stores are added, data is grown, and more nodes are added to the current infrastructure, but the overall performance system degrades as the nodes are added

Apache Hadoop solves the above problems.

This topic has been a very basic start to exploring what is Hadoop. Hopefully, you have enough information to get started.

If you have any questions, please feel free to leave a comment and I will get back to you.