Q). What are the key benefits of Hadoop for Big Data Use Cases ?
Problem : Data is too big to store on one machine.
HDFS: Store the data on multiple machines!
Problem : To store and process big data, we may need very high-end machines that are too expensive
HDFS: Run on commodity hardware !
Problem : Commodity hardware will fail!
HDFS: Software is intelligent enough to handle hardware failure!
Problem : What happens to the data if the machine storing the data fails?
HDFS: Replicate the data!
Problem : How can distributed machines organize the data in a coordinated way?
HDFS: Master-Slave Architecture!
Q). What is Apache Hadoop ?
Answer).
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.
It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Originally designed for computer clusters built from commodity hardware—still the common use—it has also found use on clusters of higher-end hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
Q). What are the Differences between Regular File System and HDFS?
Typical File System:
In File System, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data. Typical file system does not fulfill needs of big data, because it can not scale using inexpensive hardware.
HDFS
Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
Q). What are Hadoop Components ?
Answer).
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality,[6] where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
The base Apache Hadoop framework is composed of the following modules:
a). Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
b). Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
c). Hadoop YARN – introduced in 2012 is a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications
d). Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.
Q). What are the Three Configuration Modes in which Hadoop can be run ?
Hadoop Configuration Modes
Hadoop supports three configuration modes when it is implemented on commodity hardware:-
- Standalone mode
- Pseudo-distributed mode
- Fully distributed mode
In standalone mode, all Hadoop services run in a single JVM, that is, Java Virtual Machine on a single machine.
In pseudo-distributed mode, each Hadoop service runs in its JVM but on a single machine.
In a fully distributed mode, the Hadoop services run in individual JVMs, but these JVMs reside on different commodity hardware in a single cluster.
In the next section of introduction to big data tutorial, we will discuss the core components of Apache Hadoop.
HDFS Architecture
A typical HDFS setup is shown in a diagram below.
This setup shows the three essential services of Hadoop:-
- NameNode
- DataNode
- Secondary NameNode services
The NameNode and the Secondary NameNode services constitute the master service, whereas the DataNode service falls under the slave service.
The master server is responsible for accepting a job from clients and ensuring that the data required for the operation will be loaded and segregated into chunks of data blocks. HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks that are stored and replicated in DataNodes. The data blocks are then distributed to the DataNode systems within the cluster. This ensures that replicas of the data are maintained.
Hadoop is a framework that allows you to first store Big Data in a distributed environment so that you can process it parallelly. There are basically two components in Hadoop:
Q). Explain the Architecture of HDFS ?
HDFS
HDFS creates an abstraction of resources, let me simplify it for you. Similar as virtualization, you can see HDFS logically as a single unit for storing Big Data, but actually you are storing your data across multiple nodes in a distributed fashion. Here, you have master-slave architecture. In HDFS, Namenode is a master node and Datanodes are slaves.
a). NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.
b). DataNode
These are slave daemons which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
YARN
YARN performs all your processing activities by allocating resources and scheduling tasks.
It has two major daemons, i.e. ResourceManager and NodeManager.
a). ResourceManager
It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.
b). NodeManager
It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date. So, you can perform parallel processing on HDFS using MapReduce.
What is the Difference between Single-Node Cluster vs. Multi-Node Cluster ?
The table below shows the differences between a single-node cluster and a multi-node cluster.
Single-node cluster | Multi-node cluster |
Hadoop is installed on a single system or node. | Hadoop is installed on multiple nodes ranging from a few to thousands. |
Single-node clusters are used to run trivial processes and simple MapReduce and HDFS operations. It is also used as a testbed. | Multi-node clusters are used for complex computational requirements including analytics. |
Q). What are Data Types in Hadoop ?
The table below shows a list of data types.
Data types | Functions |
Text | Stores String data |
IntWritable | Stores Integer data |
LogWritable | Stores Log data |
FloatWritable | Stores Float data |
DoubleWritable | Stores Double data |
BooleanWritable | Stores Boolean data |
ByteWritable | Stores Byte data |
NullWritable | Placeholder when the value is not needed |
Q). What is Distributed Cache in Hadoop?
Distributed Cache is a Hadoop feature to cache files needed by applications.
Following are the functions of distributed cache:
- It helps to boost efficiency when a map or a reduce task needs access to common data.
- It lets a cluster node read the imported files from its local file system, instead of retrieving the files from other cluster nodes.
- It allows both single files and archives such as zip and tar.gz.
- It copies files only to slave nodes. If there are no slave nodes in the cluster, distributed cache copies the files to the master node.
- It allows access to the cached files from mapper or reducer applications to make sure that the current working directory is added into the application path.
- It allows referencing the cached files as though they are present in the current working directory.
Spark Defined
The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.”
Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk.
Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.
Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module.
Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast.
However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.
Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn’t have its own distributed filesystem, but can use HDFS.
Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section.
Q). What are the the Differences between Hadoop 1 vs Hadoop 2 (YARN)
The biggest difference between Hadoop 1 and Hadoop 2 is the addition of YARN (Yet Another Resource Negotiator), which replaced the MapReduce engine in the first version of Hadoop. YARN strives to allocate resources to various applications effectively. It runs two dæmons, which take care of two different tasks: the resource manager, which does job tracking and resource allocation to applications, the application master, which monitors progress of the execution.
Q). What are the Differences between Hadoop 2 vs Hadoop 3?
There are important features provided by Hadoop 3. For example, while there is one single namenode in Hadoop 2, Hadoop 3 enables having multiple name nodes, which solves the single point of failure problem.
In Hadoop 3, there are containers working in principle of Docker, which reduces time spent on application development.
One of the biggest changes is that Hadoop 3 decreases storage overhead with erasure coding.
Also, Hadoop 3 permits usage of GPU hardware within the cluster, which is a very substantial benefit to execute deep learning algorithms on a Hadoop cluster.
Q). Explain the core Principles of Hadoop Architecture ?
Data is Distributed around Network
– No centralized data server
– Every node in cluster can host data
– Data is replicated to support fault tolerance
Computation is sent to data, rather than vice versa
– Code to be run is sent to nodes
– Results of computations are aggregated
Basic Architecture is Master/Worker
– Master, aka JobNode, launches application
– Workers, aka WorkerNodes, perform bulk of computation